Skip to content
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions workflows/paleogenomics/adna-analysis/.dockstore.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
version: 1.2
workflows:
- name: adna-analysis
subclass: Galaxy
publish: true
primaryDescriptorPath: /adna-analysis.ga
testParameterFiles:
- /adna-analysis-tests.yml
authors:
- name: Ali Mert AYDIN
orcid: "https://orcid.org/0009-0008-9038-0815"
5 changes: 5 additions & 0 deletions workflows/paleogenomics/adna-analysis/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Changelog

## [0.1] - 2026-05-18

- First release.
79 changes: 79 additions & 0 deletions workflows/paleogenomics/adna-analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Ancient DNA analysis pipeline
This workflow performs an ancient DNA (aDNA) based analysis similar to the one in the [nf-core/eager](https://nf-co.re/eager/2.5.3/) workflow. nf-core/eager is a bioinformatics best-practice processing pipeline for genomic NGS sequencing data, with a focus on ancient DNA data. It is ideal for the (palaeo)genomic analysis of humans, animals, plants, microbes and even microbiomes.

The pipeline processes the sequencing-read input provided to the workflow together with a reference genome and optional supporting reference data. It aligns reads and performs extensive general NGS and aDNA-specific quality control on the results.


## Required & Optional Inputs
To run this workflow successfully, you need to provide the following input datasets and parameters:

* **`Input reads` :** Input single-end FASTQ reads for the sample.

@mvdbeek mvdbeek May 18, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextflow run nf-core/eager -profile <docker/singularity/podman/conda/institute> --input '*_R{1,2}.fastq.gz' --fasta '<your_reference>.fasta' makes me think this is an odd default input. I would assume modern data is almost always going to be paired end ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of this comment i am using single end input
#1234 (comment)

you also run the test, but i do not think it will find the kraken2 db. it will likely give error. what should i do to make it work?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, a paired-end version of the wf would also be good, but I don't know much about the state of the ancient dna field. Maybe that DNA is often so degraded that pe doesn't offer much benefit?
I would be happy with a single-end version for now (what about you @mvdbeek?) , but @mertydn we can discuss tomorrow whether the wf shouldn't use an input collection (similar to the proposed change over in #1188).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, a paired-end version of the wf would also be good, but I don't know much about the state of the ancient dna field. Maybe that DNA is often so degraded that pe doesn't offer much benefit? I would be happy with a single-end version for now (what about you @mvdbeek?) , but @mertydn we can discuss tomorrow whether the wf shouldn't use an input collection (similar to the proposed change over in #1188).

i can add input collection if it preferred, but i would rather keep it simple for now if you both agree

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should merge things that aren't generally useful just because it's easy. Surely the pipeline can handle either if single end is really something the field uses, but claude thinks that is not the case:

Ancient DNA (aDNA) sequencing has some distinctive conventions driven by the nature of the molecules being sequenced. Let me walk through what's standard in the field.
Single-end vs paired-end
The field has largely moved toward paired-end sequencing, but with an important caveat: aDNA fragments are very short (typically 30-80 bp, often peaking around 40-60 bp), so the two reads in a pair usually overlap completely or near-completely. Paired-end is preferred because:

The overlap between R1 and R2 can be merged (using tools like AdapterRemoval, leeHom, or fastp) to produce a single high-confidence consensus sequence, which improves base quality and helps resolve damage-induced miscalls.
Merging also confirms the true fragment length and removes adapter readthrough cleanly, which is essential since adapters dominate the 3' end of short aDNA inserts.
It improves mapping for short reads to large reference genomes.

Single-end is sometimes still used for screening or very low-budget shotgun runs, but published pipelines (e.g., nf-core/eager, PALEOMIX) assume paired-end by default.
Read length
Typical setups are 2×50 bp or 2×75 bp on Illumina platforms (NextSeq, NovaSeq). Going longer than ~2×75 is wasteful because reads will simply run past the insert into the adapter on the other side — you're paying for cycles that produce no new sequence information. Some labs use 2×100 or 2×150 when sequencing alongside modern samples on shared flow cells, but this is for logistical convenience rather than aDNA-specific benefit.
Library type
Most modern aDNA work uses double-stranded libraries (often with UDG or partial-UDG treatment to remove deamination damage), though single-stranded library preparation (ssDNA, e.g., the Gansauge & Meyer protocol) is increasingly common for very degraded samples because it recovers shorter and more damaged molecules. Both are typically sequenced paired-end.
Platform
Illumina dominates. NovaSeq and NextSeq are the workhorses. Long-read platforms (ONT, PacBio) are essentially not used for genuine aDNA since the molecules are far too fragmented.
So the short answer: paired-end, ~2×50 to 2×75 bp on Illumina, with read merging as a standard processing step.You said: Got references ?Got references ?18:07Claude responded: Let me find some good references for you.Let me find some good references for you.Searched the webSearched the webLet me also pull up the foundational protocol papers and the standard pipeline references.Searched the webSearched the webI have what I need. Here are the key references organized by topic.
On fragment length and the rationale for paired-end + merging
Sawyer et al. 2012 — the classic empirical characterization of aDNA fragment length distributions, frequently cited for the "~44–172 bp" range. Sawyer S, Krause J, Guschanski K, Savolainen V, Pääbo S. "Temporal patterns of nucleotide misincorporations and DNA fragmentation in ancient DNA." PLoS ONE 7(3): e34131.
Seguin-Orlando et al. 2024 (Frontiers in Bioinformatics) — recent benchmarking paper that explicitly lays out why paired-end + merging is the aDNA standard. Reads generated with modern high-throughput sequencing machines are generally longer than ancient DNA molecules, therefore the reads often contain some portion of the sequencing adaptors, and overlapping portions when DNA has been read forward and backward (paired-end) can be merged to correct sequencing errors and improve read quality. DOI: 10.3389/fbinf.2023.1260486 (PMC10733496). [PubMed Central](https://pmc.ncbi.nlm.nih.gov/articles/PMC10733496/)[PubMed Central](https://pmc.ncbi.nlm.nih.gov/articles/PMC10733496/)
On read-merging tools (each paper justifies the PE-then-merge workflow)

leeHom — Renaud G, Stenzel U, Kelso J. 2014. "leeHom: adaptor trimming and merging for Illumina sequencing reads." Nucleic Acids Research 42(18):e141. Designed explicitly with aDNA in mind.
AdapterRemoval v2 — Schubert M, Lindgreen S, Orlando L. 2016. BMC Research Notes 9:88.
ClipAndMerge — part of the original EAGER (Peltzer et al. 2016).
PEAR — Zhang J et al. 2014. Bioinformatics 30(5):614–620. Not aDNA-specific but widely used.

On the standard pipelines (which assume PE input)
nf-core/eager — Fellows Yates JA, Lamnidis TC, Borry M, Andrades Valtueña A, Fagernäs Z, Clayton S, Garcia MU, Neukamm J, Peltzer A. 2021. "Reproducible, portable, and efficient ancient genome reconstruction with nf-core/eager." PeerJ 9:e10947. DOI: 10.7717/peerj.10947. The current community standard; nf-core/eager performs preprocessing of this raw data, including adapter clipping, read merging, and quality control of adapter-trimmed data. [Spaam-community](https://www.spaam-community.org/intro-to-ancient-metagenomics-book/ancient-metagenomic-pipelines.html)
EAGER v1 (predecessor) — Peltzer A, Jäger G, Herbig A, Seitz A, Kniep C, Krause J, Nieselt K. 2016. "EAGER: efficient ancient genome reconstruction." Genome Biology 17:60.
PALEOMIX — Schubert M, Ermini L, Der Sarkissian C, et al. 2014. "Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX." Nature Protocols 9:1056–1082.
On library preparation (relevant to why short PE works)

Double-stranded (Meyer & Kircher) — Meyer M, Kircher M. 2010. "Illumina sequencing library preparation for highly multiplexed target capture and sequencing." Cold Spring Harbor Protocols 2010(6):pdb.prot5448.
Single-stranded (Gansauge & Meyer) — Gansauge M-T, Meyer M. 2013. "Single-stranded DNA library preparation for the sequencing of ancient or damaged DNA." Nature Protocols 8:737–748. Updated protocol: Gansauge M-T et al. 2020. Nucleic Acids Research 48(14):e76.

Useful assembly-perspective reference
Seitz & Nieselt 2017 ("Improving ancient DNA genome assembly," PeerJ) puts it plainly: samples from ancient DNA (aDNA) mostly contain only very short fragments between 44 and 172 bp (Sawyer et al., 2012). Paired-end sequencing of these short fragments therefore often results in overlapping forward and reverse reads, and mate-pair sequencing as well as sequencing technologies producing long reads (like PacBio) do not result in the same information gain that can be achieved on modern samples. [PubMed Central](https://pmc.ncbi.nlm.nih.gov/articles/PMC5384568/)[PubMed Central](https://pmc.ncbi.nlm.nih.gov/articles/PMC5384568/)
If you want a single citation to anchor the "paired-end is standard, merging is mandatory" claim in a methods section, Fellows Yates et al. 2021 (PeerJ) for the pipeline convention plus Sawyer et al. 2012 (PLoS ONE) for the underlying biology is the canonical pair.

@mertydn mertydn May 22, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wm75 since most of the tools are not available on usegalaxy, i cannot share them there. i am sending .ga file as an attachment.
adna-analysis.zip

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by missing tools? I just uploaded your workflow to Galaxy Europe and it's just fine?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wm75 you are right. i tried to import it to usegalaxy.org. that is why most of the tools appeared to be missing. that is fine on usegalaxy.eu.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I'm also confused because that shared workflow doesn't have any Flatten Collection step anywhere now. Am I missing something or did you solve this problem differently already?

@mertydn mertydn May 22, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wm75 i uploaded wrong one. here is current state: https://usegalaxy.eu/u/mert1907/w/ancient-dna-analysis

* **`Reference genome` :** Reference genome sequence in FASTA format. This is essential for read mapping and variant calling.
* **`Choose Mapper` :** Switch to select the alignment tool. Choose between BWA and Bowtie2.
* **`Hap Map ChrX reference` :** HapMap dataset required for X-chromosome contamination estimation in ANGSD.
* **`Input Mitochondrial Chromosome Name` :** The exact header name of the mitochondrial chromosome in your reference FASTA file (e.g., MT, chrM, rCRS).
* **`Kraken2 database directory` :** The database directory required for Kraken2 taxonomic classification.
* **`Optional BED file for Sex.DetERRmine` :** An optional BED file containing specific genomic coordinates to restrict the Sex.DetERRmine analysis. Leave empty for standard whole-genome human analysis, or provide targeted regions to enable gender estimation for non-human organisms.
* **`ANGSD region parameter` :** The specific genomic region to restrict the ANGSD analysis (typically 'X' for human male nuclear contamination estimation).


## Workflow Steps
By default the pipeline currently performs the following:

## 1. Preprocessing and Quality Control
* **Quality Control:** Evaluates read quality before and after trimming (`FastQC`)
* **Adapter Trimming:** Removes adapter sequences (`AdapterRemoval`)

## 2. Read Mapping and Processing
* **Alignment:** Maps reads to the provided reference genome conditionally using either (`BWA`) or (`Bowtie2`) based on user selection
* **Filtering and Statistics:** Separates unmapped reads and calculates alignment statistics (`Samtools View and Flagstat`)
* **Duplicate Removal:** Detects and marks PCR duplicates (`Picard MarkDuplicates`)
* **Alignment Quality:** Generates detailed BAM quality metrics (`QualiMap BamQC`)
* **Library Complexity:** Estimates library complexity (`Preseq`)

## 3. Ancient DNA (aDNA) Analysis
* **Damage Profiling:** Visualizes aDNA-specific C-to-T damage patterns (`mapDamage`)
* **Endogenous Content:** Calculates the proportion of endogenous (target) DNA in the sample (`EndorSpy`)
* **(`Optional`) Contamination:** Estimates nuclear X-chromosome contamination conditionally if HapMap data is provided (`ANGSD X-Contamination`)

## 4. Biological Information
* **Sex Determination:** Determines biological sex based on relative chromosome coverage ratio. This step adapts conditionally whether an optional BED file is provided (`Sex.DetERRmine`)
* **Mt/Nuc Ratio:** Calculates the ratio of mitochondrial reads to nuclear reads utilizing the specified mitochondrial chromosome name (`MtNucRatioCalculator`)

## 5. Genotyping
* **Variant Analysis:** Performs variant calling to generate VCF files (`FreeBayes`)
* **Variant Statistics:** Calculates statistics for the generated variants (`Bcftools stats`)

## 6. Metagenomic Screening (For Unmapped Reads)
* **Read Extraction:** Extracts unmapped reads for microbial analysis (`Picard SamToFastq`)
* **Quality Filter:** Filters low-complexity sequences (`BBTools BBduk`)
* **Taxonomic Classification:** Performs microbiome/taxonomic screening on the filtered unmapped reads (`Kraken2`)

## 7. Reporting
* **Summary Report:** Aggregates logs and statistics from all these tools into a single interactive HTML report (`MultiQC`)


## Workflow Outputs
Upon successful execution, the workflow explicitly provides the following final files for analysis:

* **`MultiQC aggregated workflow summary report` :** An interactive HTML report aggregating QC and analysis logs from all tools.
* **`QualiMap BamQC general alignment quality metrics report` :** A detailed HTML report containing mapping quality metrics, GC content, and coverage statistics.
* **`mapDamage Visualisation` :** Visual plots displaying the characteristic C-to-T deamination patterns at the ends of ancient DNA reads.
* **`Kraken2 taxonomic classification and microbial screening report` :** A tabular report showing the taxonomic classification of unmapped reads.
* **`EndorSpy endogenous DNA authentication report` :** A JSON file containing the calculated endogenous DNA percentage.
* **`Sex.DetERRmine (Without BED) report of chromosomal gender estimation` :** A JSON file containing biological sex metrics for human-genome alignments.
* **`Sex.DetERRmine (With BED) report of chromosomal gender estimation` :** A JSON file containing biological sex metrics for targeted capture regions.
* **`Mitochondrial to nuclear DNA ratio calculation report` :** A JSON file containing the calculated ratio between mitochondrial and nuclear reads.
* **`ANGSD report of nuclear contamination estimation` :** A tabular text file detailing the estimates of nuclear X-chromosome contamination.
* **`Bcftools variant calling summary statistics report` :** A text file containing comprehensive summary statistics for the called variants (VCF).
* **`Fully post-processed mapping results` :** The final deduplicated and filtered alignment BAM file.
* **`Freebayes raw genomic variant calls` :** The raw VCF file generated from variant analysis.


## Testing Data
To ensure the workflow functions correctly, it was validated using the following datasets and databases:

* **`Primary Test Data` :** A downsampled single-end FASTQ dataset [NIST7035_TAAGGCGA_L001_R1_001_10MB.fastq.gz](https://zenodo.org/records/20271974/files/NIST7035_TAAGGCGA_L001_R1_001_10MB.fastq.gz) optimized for rapid workflow testing and validation.
* **`Primary Reference Genome` :** The [hs37d5_chr21-MT.fa.gz](https://github.com/nf-core/test-datasets/blob/eager/reference/Human/hs37d5_chr21-MT.fa.gz) file was utilized as the primary reference genome sequence.
* **`X-Chromosome Contamination Reference` :** The [HapMap ChrX](https://github.com/ANGSD/angsd/blob/master/RES/HapMapChrX.gz) dataset was provided as the initial reference for the estimation of X-chromosome contamination using the ANGSD tool.
* **`Taxonomic Classification Database` :** The Minikraken v2 database was utilized to perform taxonomic classification via Kraken2.
88 changes: 88 additions & 0 deletions workflows/paleogenomics/adna-analysis/adna-analysis-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
- doc: Test outline for adna-analysis.ga
job:
Input reads:
class: File
location: https://zenodo.org/records/20271974/files/NIST7035_TAAGGCGA_L001_R1_001_10MB.fastq.gz
filetype: fastqsanger.gz
Reference genome:
class: File
location: https://github.com/nf-core/test-datasets/raw/eager/reference/Human/hs37d5_chr21-MT.fa.gz
filetype: fasta.gz
Hap Map ChrX reference:
class: File
location: https://github.com/ANGSD/angsd/raw/master/RES/HapMapChrX.gz
filetype: gz
Choose Mapper: BWA
Optional BED file for Sex.DetERRmine: null
Kraken2 database directory: k2_standard_20210517
outputs:
Fully post-processed mapping results:
asserts:
has_size:
min: 100
EndorSpy endogenous DNA authentication report:
asserts:
has_text:
text: "percent_on_target"
Comment on lines +18 to +26
Sex.DetERRmine (Without BED) report of chromosomal gender estimation:
asserts:
has_text:
text: "Sex.DetERRmine"
QualiMap BamQC general alignment quality metrics report:
asserts:
has_text:
text: "Qualimap Report: BAM QC"
Mitochondrial to nuclear DNA ratio calculation report:
asserts:
has_text:
text: "mtnuccalculator"
mapDamage Visualisation:
element_tests:
dnacomp:
asserts:
has_text:
text: "mapDamage"
misincorporation:
asserts:
has_text:
text: "mapDamage"
5pCtoT_freq:
asserts:
has_text:
text: "5pC>T"
3pGtoA_freq:
asserts:
has_text:
text: "3pG>A"
Fragmisincorporation_plot:
asserts:
has_size:
min: 100
lgdistribution:
asserts:
has_text:
text: "mapDamage"
Length_plot:
asserts:
has_size:
min: 100
Freebayes raw genomic variant calls:
asserts:
has_text:
text: "freeBayes"
ANGSD report of nuclear contamination estimation:
asserts:
has_text:
text: "Method1_MOM_estimate"
Bcftools variant calling summary statistics report:
asserts:
has_text:
text: "ACT>TCGA"
Kraken2 taxonomic classification and microbial screening report:
asserts:
has_text:
text: "root"
MultiQC aggregated workflow summary report:
asserts:
has_text:
text: "MultiQC"
Loading
Loading