Add nf-core/eager style ancient DNA (aDNA) analysis workflow by mertydn · Pull Request #1234 · galaxyproject/iwc

mertydn · 2026-05-09T11:24:01Z

Summary

Adds the Ancient DNA (aDNA) analysis workflow.

Technical Details

Implementation based on nf-core/eager logic:

Preprocessing & Alignment: AdapterRemoval, BWA, and Picard MarkDuplicates.
aDNA Profiling: mapDamage, EndorSpy, and ANGSD.
Biological Metrics: Sex.DetERRmine and MtNucRatioCalculator.
Genotyping & Microbiome: FreeBayes, BBduk, and Kraken2.
QC & Reporting: FastQC, QualiMap BamQC, Preseq, and MultiQC.

Validation

Tests: Planemo-validated suite verifying workflow outputs and technical parity using sub-sampled reference datasets (JK2067, hs37d5, HapMap).

FOR CONTRIBUTOR:

I have read the Adding workflows guidelines
License permits unrestricted use (educational + commercial)
Please also take note of the reviewer guidelines below to facilitate a smooth review process.

FOR REVIEWERS:

.dockstore.yml: file is present and aligned with creator metadata in workflow. ORCID identifiers are strongly encouraged in creator metadata. The .dockstore.yml file is required to run tests
Workflow is sufficiently generic to be used with lab data and does not hardcode sample names, reference data and can be run without reading an accompanying tutorial.
In workflow: annotation field contains short description of what the workflow does. Should start with This workflow does/runs/performs … xyz … to generate/analyze/etc …
In workflow: workflow inputs and outputs have human readable names (spaces are fine, no underscore, dash only where spelling dictates it), no abbreviation unless it is generally understood. Altering input or output labels requires adjusting these labels in the the workflow-tests.yml file as well
In workflow: name field should be human readable (spaces are fine, no underscore, dash only where spelling dictates it), no abbreviation unless generally understood
Workflow folder: prefer dash (-) over underscore (_), prefer all lowercase. Folder becomes repository in iwc-workflows organization and is included in TRS id
Readme explains what workflow does, what are valid inputs and what outputs users can expect. If a tutorial or other resources exist they can be linked. If a similar workflow exists in IWC readme should explain differences with existing workflow and when one might prefer one workflow over another
Changelog contains appropriate entries
Large files (> 100 KB) are uploaded to zenodo and location urls are used in test file

Copilot

Pull request overview

Adds a new Galaxy workflow under workflows/paleogenomics/ implementing an nf-core/eager-inspired ancient DNA (aDNA) analysis pipeline, along with documentation, changelog, Dockstore metadata, and a Planemo-style test definition.

Changes:

Introduces the aDNA-analysis.ga workflow with preprocessing, alignment, aDNA profiling, metagenomics screening, and MultiQC reporting steps.
Adds workflow metadata and packaging files (.dockstore.yml, CHANGELOG.md, README.md) for distribution and maintenance.
Adds a workflow test suite (aDNA-analysis-tests.yml) referencing public test datasets.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
workflows/paleogenomics/aDNA-analysis/aDNA-analysis.ga	New Galaxy workflow definition and exposed inputs/outputs.
workflows/paleogenomics/aDNA-analysis/aDNA-analysis-tests.yml	Planemo workflow test job + output assertions.
workflows/paleogenomics/aDNA-analysis/README.md	User-facing workflow documentation (inputs/steps/outputs/testing data).
workflows/paleogenomics/aDNA-analysis/CHANGELOG.md	Initial release entry for the new workflow.
workflows/paleogenomics/aDNA-analysis/.dockstore.yml	Dockstore registration metadata for the workflow + tests.

@@ -0,0 +1,1316 @@
+{
+    "a_galaxy_workflow": "true",
+    "annotation": "This workflow performs an ancient DNA based analysis similar to the one in the nf-core/eager workflow",


+                    "description": "Input FASTQ reads (single-end or paired-end) for the sample.",
+                    "name": "InputReads"
+                }
+            ],
+            "label": "InputReads",


+                {
+                    "description": "HapMap dataset required for X-chromosome contamination estimation in ANGSD.",
+                    "name": "HapMapChrXReference"
+                }
+            ],
+            "label": "HapMapChrXReference",


wm75 · 2026-05-15T09:51:27Z

+            "input_connections": {
+                "inputFile": {
+                    "id": 0,
+                    "output_name": "output"
+                }


Please remove the intial SamToFastq step and expect the user to provide single-end fastq inputs.

+The pipeline pre-processes raw data from FASTQ inputs, or preprocessed BAM inputs. It can align reads and performs extensive general NGS and aDNA specific quality-control on the results.
+
+
+## Required Inputs
+To run this workflow successfully, you need to provide the following input datasets:
+
+* **`InputReads` :** The raw sequencing data for your sample in `FASTQ` format. The workflow supports both single-end and paired-end reads.


@@ -0,0 +1,72 @@
+# Ancient DNA analysis pipeline
+This workflow performs an ancient DNA (aDNA) based analysis similar to the one in the [nf-core/eager](https://nf-co.re/eager/2.5.3/) workflow. nf-core/eager is a bioinformatics best-practise processing pipeline for genomic NGS sequencing data, with a focus on ancient DNA data. It is ideal for the (palaeo)genomic analysis of humans, animals, plants, microbes and even microbiomes.


+* **`MultiQC Report` :** An interactive HTML report aggregating QC and analysis logs from all tools.
+* **`QualiMap BamQC Report` :** A detailed HTML report containing mapping quality metrics, GC content, and coverage statistics.
+* **`mapDamage Visualisation` :** Visual plots displaying the characteristic C-to-T deamination patterns at the ends of ancient DNA reads.
+* **`Kraken2 Report `:** A tabular report showing the taxonomic classification of unmapped reads.
+* **`EndorSpy Report `:** A JSON file containing the calculated endogenous DNA percentage.
+* **`Sex.DetERRmine Report` :** A JSON file containing relative chromosomal coverage and the calculated biological sex metrics.
+* **`Mt/Nuc Ratio Report` :** A JSON file containing the calculated ratio between mitochondrial and nuclear reads.
+* **`ANGSD Contamination Report` :** A tabular text file detailing the estimates of nuclear X-chromosome contamination.
+* **`Bcftools Stats Report` :** A text file containing comprehensive summary statistics for the called variants (VCF).


+  outputs:
+    EndorSpyReport:
+      asserts:
+        has_text:
+            text: "percent_on_target"


+version: 1.2
+workflows:
+- name: aDNA-analysis
+  subclass: Galaxy
+  publish: true
+  primaryDescriptorPath: /aDNA-analysis.ga


wm75

Broadening the scope of the workflow

From one of CoPilot's comments:

ANGSD is configured with a hard-coded region (X:5000000-154900000), which makes this workflow human/reference-specific and not generic for other organisms/assemblies. Please expose the region as a workflow parameter (or derive it from inputs), and/or clearly document that this workflow is intended for human X-chromosome contamination estimation only.

I think you should try to make this workflow as broadly usable as possible and here are some steps to do so:

Make the "Mitochondrial Chromosome Name" in Mt/Nuc Ratio Calculator a workflow input parameter with MT as the default.
Introduce an optional BED file input for use at an alternative SexDetERRmine step. If the user doesn't supply that file, run the current SexDetERRmine step, which assumes human DNA.
Make the two ANGSD steps optional and configurable. For this turn the ANSGD "Region" into a WF parameter with the current hard-coded string as the default. Then make the HapMap file input optional and skip the ANSGD steps if it's not provided. This can be achieved by turning the input into a boolean with the Map parameter value tool and activating "Conditionally skip step" for the two ANSGD steps. (Those steps are also optional in the nf-core pipeline, see: https://nf-co.re/eager/2.5.3/parameters/#nuclear-contamination-for-human-dna.)

User choice at mapping step

Can you please add a workflow input param that lets the user choose between bwa and bowtie2? Again this involves contional execution of those steps based on user input, then picking the actually generated result with the Pick parameter value tool.
This should be straightforward to implement, but immediately brings this WF a good bit towards the flexibility offered by the nf-core pipeline.

More workflow outputsand more descriptive naming

The deduplicated final BAM file and the freebayes result should both be workflow outputs. All workflow outputs should have names describing what is in the data, not just the default name generated by the tool that produced them, e.g. use "ANSGD Report of nuclear contamination" for the ANSGD output, "Fully post-processed mapping results" for the MarkDuplicates output. Reuse those names also as the output labels.

wm75 · 2026-05-15T09:45:23Z

+                "tool_shed": "toolshed.g2.bx.psu.edu"
+            },
+            "tool_state": "{\"confidence\": \"0.0\", \"kraken2_database\": \"2026-02-13T204037Z_minikraken2_v2_8GB\", \"min_base_quality\": \"0\", \"minimum_hit_groups\": \"2\", \"quick\": false, \"report\": {\"create_report\": \"true\", \"__current_case__\": 0, \"use_mpa_style\": false, \"report_zero_counts\": false, \"report_minimizer_data\": true}, \"single_paired\": {\"single_paired_selector\": \"no\", \"__current_case__\": 1, \"input_sequences\": {\"__class__\": \"ConnectedValue\"}}, \"split_reads\": false, \"use_names\": false, \"__page__\": 0, \"__rerun_remap_job_id__\": null}",
+            "tool_uuid": null,
+            "tool_version": "2.17.1+galaxy0",


Copilot is exactly right here. You need to turn the hard-coded DB into a workflow input parameter.

wm75 · 2026-05-15T09:51:27Z

+            "input_connections": {
+                "inputFile": {
+                    "id": 0,
+                    "output_name": "output"
+                }


Please remove the intial SamToFastq step and expect the user to provide single-end fastq inputs.

wm75 · 2026-05-15T11:08:40Z

and please do not mark comments as resolved until they actually are :-)

… expose parameters, descriptive outputs, and lowercase filenames

mertydn · 2026-05-18T14:59:47Z

hi @wm75 , thank you for the detailed feedback! i have updated the workflow according to all of your suggestions.

the workflow passes planemo tests successfully. ready for your review!

mvdbeek · 2026-05-18T15:16:42Z

+## Required & Optional Inputs
+To run this workflow successfully, you need to provide the following input datasets and parameters:
+
+* **`Input reads` :** Input single-end FASTQ reads for the sample.


nextflow run nf-core/eager -profile <docker/singularity/podman/conda/institute> --input '*_R{1,2}.fastq.gz' --fasta '<your_reference>.fasta' makes me think this is an odd default input. I would assume modern data is almost always going to be paired end ?

because of this comment i am using single end input
#1234 (comment)

you also run the test, but i do not think it will find the kraken2 db. it will likely give error. what should i do to make it work?

yeah, a paired-end version of the wf would also be good, but I don't know much about the state of the ancient dna field. Maybe that DNA is often so degraded that pe doesn't offer much benefit?
I would be happy with a single-end version for now (what about you @mvdbeek?) , but @mertydn we can discuss tomorrow whether the wf shouldn't use an input collection (similar to the proposed change over in #1188).

yeah, a paired-end version of the wf would also be good, but I don't know much about the state of the ancient dna field. Maybe that DNA is often so degraded that pe doesn't offer much benefit? I would be happy with a single-end version for now (what about you @mvdbeek?) , but @mertydn we can discuss tomorrow whether the wf shouldn't use an input collection (similar to the proposed change over in #1188).

i can add input collection if it preferred, but i would rather keep it simple for now if you both agree

I don't think we should merge things that aren't generally useful just because it's easy. Surely the pipeline can handle either if single end is really something the field uses, but claude thinks that is not the case:

Ancient DNA (aDNA) sequencing has some distinctive conventions driven by the nature of the molecules being sequenced. Let me walk through what's standard in the field. Single-end vs paired-end The field has largely moved toward paired-end sequencing, but with an important caveat: aDNA fragments are very short (typically 30-80 bp, often peaking around 40-60 bp), so the two reads in a pair usually overlap completely or near-completely. Paired-end is preferred because: The overlap between R1 and R2 can be merged (using tools like AdapterRemoval, leeHom, or fastp) to produce a single high-confidence consensus sequence, which improves base quality and helps resolve damage-induced miscalls. Merging also confirms the true fragment length and removes adapter readthrough cleanly, which is essential since adapters dominate the 3' end of short aDNA inserts. It improves mapping for short reads to large reference genomes. Single-end is sometimes still used for screening or very low-budget shotgun runs, but published pipelines (e.g., nf-core/eager, PALEOMIX) assume paired-end by default. Read length Typical setups are 2×50 bp or 2×75 bp on Illumina platforms (NextSeq, NovaSeq). Going longer than ~2×75 is wasteful because reads will simply run past the insert into the adapter on the other side — you're paying for cycles that produce no new sequence information. Some labs use 2×100 or 2×150 when sequencing alongside modern samples on shared flow cells, but this is for logistical convenience rather than aDNA-specific benefit. Library type Most modern aDNA work uses double-stranded libraries (often with UDG or partial-UDG treatment to remove deamination damage), though single-stranded library preparation (ssDNA, e.g., the Gansauge & Meyer protocol) is increasingly common for very degraded samples because it recovers shorter and more damaged molecules. Both are typically sequenced paired-end. Platform Illumina dominates. NovaSeq and NextSeq are the workhorses. Long-read platforms (ONT, PacBio) are essentially not used for genuine aDNA since the molecules are far too fragmented. So the short answer: paired-end, ~2×50 to 2×75 bp on Illumina, with read merging as a standard processing step.You said: Got references ?Got references ?18:07Claude responded: Let me find some good references for you.Let me find some good references for you.Searched the webSearched the webLet me also pull up the foundational protocol papers and the standard pipeline references.Searched the webSearched the webI have what I need. Here are the key references organized by topic. On fragment length and the rationale for paired-end + merging Sawyer et al. 2012 — the classic empirical characterization of aDNA fragment length distributions, frequently cited for the "~44–172 bp" range. Sawyer S, Krause J, Guschanski K, Savolainen V, Pääbo S. "Temporal patterns of nucleotide misincorporations and DNA fragmentation in ancient DNA." PLoS ONE 7(3): e34131. Seguin-Orlando et al. 2024 (Frontiers in Bioinformatics) — recent benchmarking paper that explicitly lays out why paired-end + merging is the aDNA standard. Reads generated with modern high-throughput sequencing machines are generally longer than ancient DNA molecules, therefore the reads often contain some portion of the sequencing adaptors, and overlapping portions when DNA has been read forward and backward (paired-end) can be merged to correct sequencing errors and improve read quality. DOI: 10.3389/fbinf.2023.1260486 (PMC10733496). [PubMed Central](https://pmc.ncbi.nlm.nih.gov/articles/PMC10733496/)[PubMed Central](https://pmc.ncbi.nlm.nih.gov/articles/PMC10733496/) On read-merging tools (each paper justifies the PE-then-merge workflow) leeHom — Renaud G, Stenzel U, Kelso J. 2014. "leeHom: adaptor trimming and merging for Illumina sequencing reads." Nucleic Acids Research 42(18):e141. Designed explicitly with aDNA in mind. AdapterRemoval v2 — Schubert M, Lindgreen S, Orlando L. 2016. BMC Research Notes 9:88. ClipAndMerge — part of the original EAGER (Peltzer et al. 2016). PEAR — Zhang J et al. 2014. Bioinformatics 30(5):614–620. Not aDNA-specific but widely used. On the standard pipelines (which assume PE input) nf-core/eager — Fellows Yates JA, Lamnidis TC, Borry M, Andrades Valtueña A, Fagernäs Z, Clayton S, Garcia MU, Neukamm J, Peltzer A. 2021. "Reproducible, portable, and efficient ancient genome reconstruction with nf-core/eager." PeerJ 9:e10947. DOI: 10.7717/peerj.10947. The current community standard; nf-core/eager performs preprocessing of this raw data, including adapter clipping, read merging, and quality control of adapter-trimmed data. [Spaam-community](https://www.spaam-community.org/intro-to-ancient-metagenomics-book/ancient-metagenomic-pipelines.html) EAGER v1 (predecessor) — Peltzer A, Jäger G, Herbig A, Seitz A, Kniep C, Krause J, Nieselt K. 2016. "EAGER: efficient ancient genome reconstruction." Genome Biology 17:60. PALEOMIX — Schubert M, Ermini L, Der Sarkissian C, et al. 2014. "Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX." Nature Protocols 9:1056–1082. On library preparation (relevant to why short PE works) Double-stranded (Meyer & Kircher) — Meyer M, Kircher M. 2010. "Illumina sequencing library preparation for highly multiplexed target capture and sequencing." Cold Spring Harbor Protocols 2010(6):pdb.prot5448. Single-stranded (Gansauge & Meyer) — Gansauge M-T, Meyer M. 2013. "Single-stranded DNA library preparation for the sequencing of ancient or damaged DNA." Nature Protocols 8:737–748. Updated protocol: Gansauge M-T et al. 2020. Nucleic Acids Research 48(14):e76. Useful assembly-perspective reference Seitz & Nieselt 2017 ("Improving ancient DNA genome assembly," PeerJ) puts it plainly: samples from ancient DNA (aDNA) mostly contain only very short fragments between 44 and 172 bp (Sawyer et al., 2012). Paired-end sequencing of these short fragments therefore often results in overlapping forward and reverse reads, and mate-pair sequencing as well as sequencing technologies producing long reads (like PacBio) do not result in the same information gain that can be achieved on modern samples. [PubMed Central](https://pmc.ncbi.nlm.nih.gov/articles/PMC5384568/)[PubMed Central](https://pmc.ncbi.nlm.nih.gov/articles/PMC5384568/) If you want a single citation to anchor the "paired-end is standard, merging is mandatory" claim in a methods section, Fellows Yates et al. 2021 (PeerJ) for the pipeline convention plus Sawyer et al. 2012 (PLoS ONE) for the underlying biology is the canonical pair.

@wm75 since most of the tools are not available on usegalaxy, i cannot share them there. i am sending .ga file as an attachment.
adna-analysis.zip

What do you mean by missing tools? I just uploaded your workflow to Galaxy Europe and it's just fine?

@wm75 you are right. i tried to import it to usegalaxy.org. that is why most of the tools appeared to be missing. that is fine on usegalaxy.eu.

but I'm also confused because that shared workflow doesn't have any Flatten Collection step anywhere now. Am I missing something or did you solve this problem differently already?

@wm75 i uploaded wrong one. here is current state: https://usegalaxy.eu/u/mert1907/w/ancient-dna-analysis

changing to a smaller db due to memory restrictions

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

+                "owner": "iuc",
+                "tool_shed": "toolshed.g2.bx.psu.edu"
+            },
+            "tool_state": "{\"input_param_type\": {\"type\": \"text\", \"__current_case__\": 0, \"input_param\": {\"__class__\": \"ConnectedValue\"}, \"mappings\": [{\"__index__\": 0, \"from\": \"null\", \"to\": \"True\"}]}, \"output_param_type\": \"boolean\", \"unmapped\": {\"on_unmapped\": \"default\", \"__current_case__\": 2, \"default_value\": \"False\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}",


+                "owner": "iuc",
+                "tool_shed": "toolshed.g2.bx.psu.edu"
+            },
+            "tool_state": "{\"input_param_type\": {\"type\": \"text\", \"__current_case__\": 0, \"input_param\": {\"__class__\": \"ConnectedValue\"}, \"mappings\": [{\"__index__\": 0, \"from\": \"null\", \"to\": \"False\"}]}, \"output_param_type\": \"boolean\", \"unmapped\": {\"on_unmapped\": \"default\", \"__current_case__\": 2, \"default_value\": \"True\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}",


+            "workflow_outputs": []
+        },
+        "3": {
+            "annotation": "The specific genomic region to restrict the ANGSD analysis (typically 'X' for human male nuclear contamination estimation).",


+            "workflow_outputs": []
+        },
+        "7": {
+            "annotation": "",


+            "input_connections": {},
+            "inputs": [
+                {
+                    "description": "",


+                "top": 1870.952666103986
+            },
+            "tool_id": null,
+            "tool_state": "{\"multiple\": false, \"validators\": [], \"parameter_type\": \"text\", \"optional\": false}",


+      filetype: gz
+    Choose Mapper: BWA
+    Optional BED file for Sex.DetERRmine: null
+    Kraken2 database directory: viral2019-03


Add nf-core/eager style ancient DNA (aDNA) analysis workflow

2094f59

mvdbeek requested a review from Copilot May 11, 2026 17:18

Copilot started reviewing on behalf of mvdbeek May 11, 2026 17:18 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

wm75 reviewed May 15, 2026

View reviewed changes

Apply reviewer feedback: add BWA/Bowtie2 switch, optional HapMap/BED,…

b5c54f5

… expose parameters, descriptive outputs, and lowercase filenames

mvdbeek reviewed May 18, 2026

View reviewed changes

mertydn commented May 18, 2026

View reviewed changes

Comment thread workflows/paleogenomics/adna-analysis/adna-analysis-tests.yml Outdated

This comment was marked as outdated.

Sign in to view

mertydn added 2 commits May 18, 2026 18:22

Update Kraken2 database directory in workflow

2529c72

Add release version to adna-analysis.ga

ef57447

This comment was marked as outdated.

Sign in to view

Update Kraken2 database directory in workflow

d113ab3

changing to a smaller db due to memory restrictions

This comment was marked as outdated.

Sign in to view

Update Kraken2 database directory in workflow

48a96e0

wm75 requested a review from Copilot June 29, 2026 08:38

Copilot started reviewing on behalf of wm75 June 29, 2026 08:39 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

		@@ -0,0 +1,72 @@
		# Ancient DNA analysis pipeline
		This workflow performs an ancient DNA (aDNA) based analysis similar to the one in the [nf-core/eager](https://nf-co.re/eager/2.5.3/) workflow. nf-core/eager is a bioinformatics best-practise processing pipeline for genomic NGS sequencing data, with a focus on ancient DNA data. It is ideal for the (palaeo)genomic analysis of humans, animals, plants, microbes and even microbiomes.

Uh oh!

Conversation

mertydn commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Technical Details

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wm75 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wm75 commented May 15, 2026

Uh oh!

mertydn commented May 18, 2026

Uh oh!

mvdbeek May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mertydn May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mertydn May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mertydn commented May 9, 2026 •

edited

Loading

mvdbeek May 18, 2026 •

edited

Loading

mertydn May 22, 2026 •

edited

Loading

mertydn May 22, 2026 •

edited

Loading