Add nf-core/eager style ancient DNA (aDNA) analysis workflow#1234
Add nf-core/eager style ancient DNA (aDNA) analysis workflow#1234mertydn wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Galaxy workflow under workflows/paleogenomics/ implementing an nf-core/eager-inspired ancient DNA (aDNA) analysis pipeline, along with documentation, changelog, Dockstore metadata, and a Planemo-style test definition.
Changes:
- Introduces the
aDNA-analysis.gaworkflow with preprocessing, alignment, aDNA profiling, metagenomics screening, and MultiQC reporting steps. - Adds workflow metadata and packaging files (
.dockstore.yml,CHANGELOG.md,README.md) for distribution and maintenance. - Adds a workflow test suite (
aDNA-analysis-tests.yml) referencing public test datasets.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/paleogenomics/aDNA-analysis/aDNA-analysis.ga | New Galaxy workflow definition and exposed inputs/outputs. |
| workflows/paleogenomics/aDNA-analysis/aDNA-analysis-tests.yml | Planemo workflow test job + output assertions. |
| workflows/paleogenomics/aDNA-analysis/README.md | User-facing workflow documentation (inputs/steps/outputs/testing data). |
| workflows/paleogenomics/aDNA-analysis/CHANGELOG.md | Initial release entry for the new workflow. |
| workflows/paleogenomics/aDNA-analysis/.dockstore.yml | Dockstore registration metadata for the workflow + tests. |
| @@ -0,0 +1,1316 @@ | |||
| { | |||
| "a_galaxy_workflow": "true", | |||
| "annotation": "This workflow performs an ancient DNA based analysis similar to the one in the nf-core/eager workflow", | |||
| "description": "Input FASTQ reads (single-end or paired-end) for the sample.", | ||
| "name": "InputReads" | ||
| } | ||
| ], | ||
| "label": "InputReads", |
| { | ||
| "description": "HapMap dataset required for X-chromosome contamination estimation in ANGSD.", | ||
| "name": "HapMapChrXReference" | ||
| } | ||
| ], | ||
| "label": "HapMapChrXReference", |
| "input_connections": { | ||
| "inputFile": { | ||
| "id": 0, | ||
| "output_name": "output" | ||
| } |
There was a problem hiding this comment.
Please remove the intial SamToFastq step and expect the user to provide single-end fastq inputs.
| The pipeline pre-processes raw data from FASTQ inputs, or preprocessed BAM inputs. It can align reads and performs extensive general NGS and aDNA specific quality-control on the results. | ||
|
|
||
|
|
||
| ## Required Inputs | ||
| To run this workflow successfully, you need to provide the following input datasets: | ||
|
|
||
| * **`InputReads` :** The raw sequencing data for your sample in `FASTQ` format. The workflow supports both single-end and paired-end reads. |
| @@ -0,0 +1,72 @@ | |||
| # Ancient DNA analysis pipeline | |||
| This workflow performs an ancient DNA (aDNA) based analysis similar to the one in the [nf-core/eager](https://nf-co.re/eager/2.5.3/) workflow. nf-core/eager is a bioinformatics best-practise processing pipeline for genomic NGS sequencing data, with a focus on ancient DNA data. It is ideal for the (palaeo)genomic analysis of humans, animals, plants, microbes and even microbiomes. | |||
| * **`MultiQC Report` :** An interactive HTML report aggregating QC and analysis logs from all tools. | ||
| * **`QualiMap BamQC Report` :** A detailed HTML report containing mapping quality metrics, GC content, and coverage statistics. | ||
| * **`mapDamage Visualisation` :** Visual plots displaying the characteristic C-to-T deamination patterns at the ends of ancient DNA reads. | ||
| * **`Kraken2 Report `:** A tabular report showing the taxonomic classification of unmapped reads. | ||
| * **`EndorSpy Report `:** A JSON file containing the calculated endogenous DNA percentage. | ||
| * **`Sex.DetERRmine Report` :** A JSON file containing relative chromosomal coverage and the calculated biological sex metrics. | ||
| * **`Mt/Nuc Ratio Report` :** A JSON file containing the calculated ratio between mitochondrial and nuclear reads. | ||
| * **`ANGSD Contamination Report` :** A tabular text file detailing the estimates of nuclear X-chromosome contamination. | ||
| * **`Bcftools Stats Report` :** A text file containing comprehensive summary statistics for the called variants (VCF). |
| outputs: | ||
| EndorSpyReport: | ||
| asserts: | ||
| has_text: | ||
| text: "percent_on_target" |
| version: 1.2 | ||
| workflows: | ||
| - name: aDNA-analysis | ||
| subclass: Galaxy | ||
| publish: true | ||
| primaryDescriptorPath: /aDNA-analysis.ga |
wm75
left a comment
There was a problem hiding this comment.
Broadening the scope of the workflow
From one of CoPilot's comments:
ANGSD is configured with a hard-coded region (X:5000000-154900000), which makes this workflow human/reference-specific and not generic for other organisms/assemblies. Please expose the region as a workflow parameter (or derive it from inputs), and/or clearly document that this workflow is intended for human X-chromosome contamination estimation only.
I think you should try to make this workflow as broadly usable as possible and here are some steps to do so:
- Make the "Mitochondrial Chromosome Name" in Mt/Nuc Ratio Calculator a workflow input parameter with MT as the default.
- Introduce an optional BED file input for use at an alternative SexDetERRmine step. If the user doesn't supply that file, run the current SexDetERRmine step, which assumes human DNA.
- Make the two ANGSD steps optional and configurable. For this turn the ANSGD "Region" into a WF parameter with the current hard-coded string as the default. Then make the HapMap file input optional and skip the ANSGD steps if it's not provided. This can be achieved by turning the input into a boolean with the Map parameter value tool and activating "Conditionally skip step" for the two ANSGD steps. (Those steps are also optional in the nf-core pipeline, see: https://nf-co.re/eager/2.5.3/parameters/#nuclear-contamination-for-human-dna.)
User choice at mapping step
Can you please add a workflow input param that lets the user choose between bwa and bowtie2? Again this involves contional execution of those steps based on user input, then picking the actually generated result with the Pick parameter value tool.
This should be straightforward to implement, but immediately brings this WF a good bit towards the flexibility offered by the nf-core pipeline.
More workflow outputsand more descriptive naming
The deduplicated final BAM file and the freebayes result should both be workflow outputs. All workflow outputs should have names describing what is in the data, not just the default name generated by the tool that produced them, e.g. use "ANSGD Report of nuclear contamination" for the ANSGD output, "Fully post-processed mapping results" for the MarkDuplicates output. Reuse those names also as the output labels.
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"confidence\": \"0.0\", \"kraken2_database\": \"2026-02-13T204037Z_minikraken2_v2_8GB\", \"min_base_quality\": \"0\", \"minimum_hit_groups\": \"2\", \"quick\": false, \"report\": {\"create_report\": \"true\", \"__current_case__\": 0, \"use_mpa_style\": false, \"report_zero_counts\": false, \"report_minimizer_data\": true}, \"single_paired\": {\"single_paired_selector\": \"no\", \"__current_case__\": 1, \"input_sequences\": {\"__class__\": \"ConnectedValue\"}}, \"split_reads\": false, \"use_names\": false, \"__page__\": 0, \"__rerun_remap_job_id__\": null}", | ||
| "tool_uuid": null, | ||
| "tool_version": "2.17.1+galaxy0", |
There was a problem hiding this comment.
Copilot is exactly right here. You need to turn the hard-coded DB into a workflow input parameter.
| "input_connections": { | ||
| "inputFile": { | ||
| "id": 0, | ||
| "output_name": "output" | ||
| } |
There was a problem hiding this comment.
Please remove the intial SamToFastq step and expect the user to provide single-end fastq inputs.
|
and please do not mark comments as resolved until they actually are :-) |
… expose parameters, descriptive outputs, and lowercase filenames
|
hi @wm75 , thank you for the detailed feedback! i have updated the workflow according to all of your suggestions. the workflow passes planemo tests successfully. ready for your review! |
| ## Required & Optional Inputs | ||
| To run this workflow successfully, you need to provide the following input datasets and parameters: | ||
|
|
||
| * **`Input reads` :** Input single-end FASTQ reads for the sample. |
There was a problem hiding this comment.
nextflow run nf-core/eager -profile <docker/singularity/podman/conda/institute> --input '*_R{1,2}.fastq.gz' --fasta '<your_reference>.fasta' makes me think this is an odd default input. I would assume modern data is almost always going to be paired end ?
There was a problem hiding this comment.
because of this comment i am using single end input
#1234 (comment)
you also run the test, but i do not think it will find the kraken2 db. it will likely give error. what should i do to make it work?
There was a problem hiding this comment.
yeah, a paired-end version of the wf would also be good, but I don't know much about the state of the ancient dna field. Maybe that DNA is often so degraded that pe doesn't offer much benefit?
I would be happy with a single-end version for now (what about you @mvdbeek?) , but @mertydn we can discuss tomorrow whether the wf shouldn't use an input collection (similar to the proposed change over in #1188).
There was a problem hiding this comment.
yeah, a paired-end version of the wf would also be good, but I don't know much about the state of the ancient dna field. Maybe that DNA is often so degraded that pe doesn't offer much benefit? I would be happy with a single-end version for now (what about you @mvdbeek?) , but @mertydn we can discuss tomorrow whether the wf shouldn't use an input collection (similar to the proposed change over in #1188).
i can add input collection if it preferred, but i would rather keep it simple for now if you both agree
There was a problem hiding this comment.
I don't think we should merge things that aren't generally useful just because it's easy. Surely the pipeline can handle either if single end is really something the field uses, but claude thinks that is not the case:
Ancient DNA (aDNA) sequencing has some distinctive conventions driven by the nature of the molecules being sequenced. Let me walk through what's standard in the field.
Single-end vs paired-end
The field has largely moved toward paired-end sequencing, but with an important caveat: aDNA fragments are very short (typically 30-80 bp, often peaking around 40-60 bp), so the two reads in a pair usually overlap completely or near-completely. Paired-end is preferred because:
The overlap between R1 and R2 can be merged (using tools like AdapterRemoval, leeHom, or fastp) to produce a single high-confidence consensus sequence, which improves base quality and helps resolve damage-induced miscalls.
Merging also confirms the true fragment length and removes adapter readthrough cleanly, which is essential since adapters dominate the 3' end of short aDNA inserts.
It improves mapping for short reads to large reference genomes.
Single-end is sometimes still used for screening or very low-budget shotgun runs, but published pipelines (e.g., nf-core/eager, PALEOMIX) assume paired-end by default.
Read length
Typical setups are 2×50 bp or 2×75 bp on Illumina platforms (NextSeq, NovaSeq). Going longer than ~2×75 is wasteful because reads will simply run past the insert into the adapter on the other side — you're paying for cycles that produce no new sequence information. Some labs use 2×100 or 2×150 when sequencing alongside modern samples on shared flow cells, but this is for logistical convenience rather than aDNA-specific benefit.
Library type
Most modern aDNA work uses double-stranded libraries (often with UDG or partial-UDG treatment to remove deamination damage), though single-stranded library preparation (ssDNA, e.g., the Gansauge & Meyer protocol) is increasingly common for very degraded samples because it recovers shorter and more damaged molecules. Both are typically sequenced paired-end.
Platform
Illumina dominates. NovaSeq and NextSeq are the workhorses. Long-read platforms (ONT, PacBio) are essentially not used for genuine aDNA since the molecules are far too fragmented.
So the short answer: paired-end, ~2×50 to 2×75 bp on Illumina, with read merging as a standard processing step.You said: Got references ?Got references ?18:07Claude responded: Let me find some good references for you.Let me find some good references for you.Searched the webSearched the webLet me also pull up the foundational protocol papers and the standard pipeline references.Searched the webSearched the webI have what I need. Here are the key references organized by topic.
On fragment length and the rationale for paired-end + merging
Sawyer et al. 2012 — the classic empirical characterization of aDNA fragment length distributions, frequently cited for the "~44–172 bp" range. Sawyer S, Krause J, Guschanski K, Savolainen V, Pääbo S. "Temporal patterns of nucleotide misincorporations and DNA fragmentation in ancient DNA." PLoS ONE 7(3): e34131.
Seguin-Orlando et al. 2024 (Frontiers in Bioinformatics) — recent benchmarking paper that explicitly lays out why paired-end + merging is the aDNA standard. Reads generated with modern high-throughput sequencing machines are generally longer than ancient DNA molecules, therefore the reads often contain some portion of the sequencing adaptors, and overlapping portions when DNA has been read forward and backward (paired-end) can be merged to correct sequencing errors and improve read quality. DOI: 10.3389/fbinf.2023.1260486 (PMC10733496). [PubMed Central](https://pmc.ncbi.nlm.nih.gov/articles/PMC10733496/)[PubMed Central](https://pmc.ncbi.nlm.nih.gov/articles/PMC10733496/)
On read-merging tools (each paper justifies the PE-then-merge workflow)
leeHom — Renaud G, Stenzel U, Kelso J. 2014. "leeHom: adaptor trimming and merging for Illumina sequencing reads." Nucleic Acids Research 42(18):e141. Designed explicitly with aDNA in mind.
AdapterRemoval v2 — Schubert M, Lindgreen S, Orlando L. 2016. BMC Research Notes 9:88.
ClipAndMerge — part of the original EAGER (Peltzer et al. 2016).
PEAR — Zhang J et al. 2014. Bioinformatics 30(5):614–620. Not aDNA-specific but widely used.
On the standard pipelines (which assume PE input)
nf-core/eager — Fellows Yates JA, Lamnidis TC, Borry M, Andrades Valtueña A, Fagernäs Z, Clayton S, Garcia MU, Neukamm J, Peltzer A. 2021. "Reproducible, portable, and efficient ancient genome reconstruction with nf-core/eager." PeerJ 9:e10947. DOI: 10.7717/peerj.10947. The current community standard; nf-core/eager performs preprocessing of this raw data, including adapter clipping, read merging, and quality control of adapter-trimmed data. [Spaam-community](https://www.spaam-community.org/intro-to-ancient-metagenomics-book/ancient-metagenomic-pipelines.html)
EAGER v1 (predecessor) — Peltzer A, Jäger G, Herbig A, Seitz A, Kniep C, Krause J, Nieselt K. 2016. "EAGER: efficient ancient genome reconstruction." Genome Biology 17:60.
PALEOMIX — Schubert M, Ermini L, Der Sarkissian C, et al. 2014. "Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX." Nature Protocols 9:1056–1082.
On library preparation (relevant to why short PE works)
Double-stranded (Meyer & Kircher) — Meyer M, Kircher M. 2010. "Illumina sequencing library preparation for highly multiplexed target capture and sequencing." Cold Spring Harbor Protocols 2010(6):pdb.prot5448.
Single-stranded (Gansauge & Meyer) — Gansauge M-T, Meyer M. 2013. "Single-stranded DNA library preparation for the sequencing of ancient or damaged DNA." Nature Protocols 8:737–748. Updated protocol: Gansauge M-T et al. 2020. Nucleic Acids Research 48(14):e76.
Useful assembly-perspective reference
Seitz & Nieselt 2017 ("Improving ancient DNA genome assembly," PeerJ) puts it plainly: samples from ancient DNA (aDNA) mostly contain only very short fragments between 44 and 172 bp (Sawyer et al., 2012). Paired-end sequencing of these short fragments therefore often results in overlapping forward and reverse reads, and mate-pair sequencing as well as sequencing technologies producing long reads (like PacBio) do not result in the same information gain that can be achieved on modern samples. [PubMed Central](https://pmc.ncbi.nlm.nih.gov/articles/PMC5384568/)[PubMed Central](https://pmc.ncbi.nlm.nih.gov/articles/PMC5384568/)
If you want a single citation to anchor the "paired-end is standard, merging is mandatory" claim in a methods section, Fellows Yates et al. 2021 (PeerJ) for the pipeline convention plus Sawyer et al. 2012 (PLoS ONE) for the underlying biology is the canonical pair.
There was a problem hiding this comment.
@wm75 since most of the tools are not available on usegalaxy, i cannot share them there. i am sending .ga file as an attachment.
adna-analysis.zip
There was a problem hiding this comment.
What do you mean by missing tools? I just uploaded your workflow to Galaxy Europe and it's just fine?
There was a problem hiding this comment.
@wm75 you are right. i tried to import it to usegalaxy.org. that is why most of the tools appeared to be missing. that is fine on usegalaxy.eu.
There was a problem hiding this comment.
but I'm also confused because that shared workflow doesn't have any Flatten Collection step anywhere now. Am I missing something or did you solve this problem differently already?
There was a problem hiding this comment.
@wm75 i uploaded wrong one. here is current state: https://usegalaxy.eu/u/mert1907/w/ancient-dna-analysis
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
changing to a smaller db due to memory restrictions
This comment was marked as outdated.
This comment was marked as outdated.
| "owner": "iuc", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"input_param_type\": {\"type\": \"text\", \"__current_case__\": 0, \"input_param\": {\"__class__\": \"ConnectedValue\"}, \"mappings\": [{\"__index__\": 0, \"from\": \"null\", \"to\": \"True\"}]}, \"output_param_type\": \"boolean\", \"unmapped\": {\"on_unmapped\": \"default\", \"__current_case__\": 2, \"default_value\": \"False\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
| "owner": "iuc", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"input_param_type\": {\"type\": \"text\", \"__current_case__\": 0, \"input_param\": {\"__class__\": \"ConnectedValue\"}, \"mappings\": [{\"__index__\": 0, \"from\": \"null\", \"to\": \"False\"}]}, \"output_param_type\": \"boolean\", \"unmapped\": {\"on_unmapped\": \"default\", \"__current_case__\": 2, \"default_value\": \"True\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
| "workflow_outputs": [] | ||
| }, | ||
| "3": { | ||
| "annotation": "The specific genomic region to restrict the ANGSD analysis (typically 'X' for human male nuclear contamination estimation).", |
| "workflow_outputs": [] | ||
| }, | ||
| "7": { | ||
| "annotation": "", |
| "input_connections": {}, | ||
| "inputs": [ | ||
| { | ||
| "description": "", |
| "top": 1870.952666103986 | ||
| }, | ||
| "tool_id": null, | ||
| "tool_state": "{\"multiple\": false, \"validators\": [], \"parameter_type\": \"text\", \"optional\": false}", |
| filetype: gz | ||
| Choose Mapper: BWA | ||
| Optional BED file for Sex.DetERRmine: null | ||
| Kraken2 database directory: viral2019-03 |
Summary
Adds the Ancient DNA (aDNA) analysis workflow.
Technical Details
Implementation based on
nf-core/eagerlogic:Validation
JK2067,hs37d5,HapMap).FOR CONTRIBUTOR:
FOR REVIEWERS:
This workflow does/runs/performs … xyz … to generate/analyze/etc …namefield should be human readable (spaces are fine, no underscore, dash only where spelling dictates it), no abbreviation unless generally understood-) over underscore (_), prefer all lowercase. Folder becomes repository in iwc-workflows organization and is included in TRS id