-
Notifications
You must be signed in to change notification settings - Fork 95
Add nf-core/eager style ancient DNA (aDNA) analysis workflow #1234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mertydn
wants to merge
6
commits into
galaxyproject:main
Choose a base branch
from
mertydn:add-paleogenomics-adna-workflow
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
2094f59
Add nf-core/eager style ancient DNA (aDNA) analysis workflow
mertydn b5c54f5
Apply reviewer feedback: add BWA/Bowtie2 switch, optional HapMap/BED,…
mertydn 2529c72
Update Kraken2 database directory in workflow
mertydn ef57447
Add release version to adna-analysis.ga
mertydn d113ab3
Update Kraken2 database directory in workflow
mertydn 48a96e0
Update Kraken2 database directory in workflow
mertydn File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| version: 1.2 | ||
| workflows: | ||
| - name: adna-analysis | ||
| subclass: Galaxy | ||
| publish: true | ||
| primaryDescriptorPath: /adna-analysis.ga | ||
| testParameterFiles: | ||
| - /adna-analysis-tests.yml | ||
| authors: | ||
| - name: Ali Mert AYDIN | ||
| orcid: "https://orcid.org/0009-0008-9038-0815" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # Changelog | ||
|
|
||
| ## [0.1] - 2026-05-18 | ||
|
|
||
| - First release. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| # Ancient DNA analysis pipeline | ||
| This workflow performs an ancient DNA (aDNA) based analysis similar to the one in the [nf-core/eager](https://nf-co.re/eager/2.5.3/) workflow. nf-core/eager is a bioinformatics best-practice processing pipeline for genomic NGS sequencing data, with a focus on ancient DNA data. It is ideal for the (palaeo)genomic analysis of humans, animals, plants, microbes and even microbiomes. | ||
|
|
||
| The pipeline processes the sequencing-read input provided to the workflow together with a reference genome and optional supporting reference data. It aligns reads and performs extensive general NGS and aDNA-specific quality control on the results. | ||
|
|
||
|
|
||
| ## Required & Optional Inputs | ||
| To run this workflow successfully, you need to provide the following input datasets and parameters: | ||
|
|
||
| * **`Input reads` :** Input single-end FASTQ reads for the sample. | ||
| * **`Reference genome` :** Reference genome sequence in FASTA format. This is essential for read mapping and variant calling. | ||
| * **`Choose Mapper` :** Switch to select the alignment tool. Choose between BWA and Bowtie2. | ||
| * **`Hap Map ChrX reference` :** HapMap dataset required for X-chromosome contamination estimation in ANGSD. | ||
| * **`Input Mitochondrial Chromosome Name` :** The exact header name of the mitochondrial chromosome in your reference FASTA file (e.g., MT, chrM, rCRS). | ||
| * **`Kraken2 database directory` :** The database directory required for Kraken2 taxonomic classification. | ||
| * **`Optional BED file for Sex.DetERRmine` :** An optional BED file containing specific genomic coordinates to restrict the Sex.DetERRmine analysis. Leave empty for standard whole-genome human analysis, or provide targeted regions to enable gender estimation for non-human organisms. | ||
| * **`ANGSD region parameter` :** The specific genomic region to restrict the ANGSD analysis (typically 'X' for human male nuclear contamination estimation). | ||
|
|
||
|
|
||
| ## Workflow Steps | ||
| By default the pipeline currently performs the following: | ||
|
|
||
| ## 1. Preprocessing and Quality Control | ||
| * **Quality Control:** Evaluates read quality before and after trimming (`FastQC`) | ||
| * **Adapter Trimming:** Removes adapter sequences (`AdapterRemoval`) | ||
|
|
||
| ## 2. Read Mapping and Processing | ||
| * **Alignment:** Maps reads to the provided reference genome conditionally using either (`BWA`) or (`Bowtie2`) based on user selection | ||
| * **Filtering and Statistics:** Separates unmapped reads and calculates alignment statistics (`Samtools View and Flagstat`) | ||
| * **Duplicate Removal:** Detects and marks PCR duplicates (`Picard MarkDuplicates`) | ||
| * **Alignment Quality:** Generates detailed BAM quality metrics (`QualiMap BamQC`) | ||
| * **Library Complexity:** Estimates library complexity (`Preseq`) | ||
|
|
||
| ## 3. Ancient DNA (aDNA) Analysis | ||
| * **Damage Profiling:** Visualizes aDNA-specific C-to-T damage patterns (`mapDamage`) | ||
| * **Endogenous Content:** Calculates the proportion of endogenous (target) DNA in the sample (`EndorSpy`) | ||
| * **(`Optional`) Contamination:** Estimates nuclear X-chromosome contamination conditionally if HapMap data is provided (`ANGSD X-Contamination`) | ||
|
|
||
| ## 4. Biological Information | ||
| * **Sex Determination:** Determines biological sex based on relative chromosome coverage ratio. This step adapts conditionally whether an optional BED file is provided (`Sex.DetERRmine`) | ||
| * **Mt/Nuc Ratio:** Calculates the ratio of mitochondrial reads to nuclear reads utilizing the specified mitochondrial chromosome name (`MtNucRatioCalculator`) | ||
|
|
||
| ## 5. Genotyping | ||
| * **Variant Analysis:** Performs variant calling to generate VCF files (`FreeBayes`) | ||
| * **Variant Statistics:** Calculates statistics for the generated variants (`Bcftools stats`) | ||
|
|
||
| ## 6. Metagenomic Screening (For Unmapped Reads) | ||
| * **Read Extraction:** Extracts unmapped reads for microbial analysis (`Picard SamToFastq`) | ||
| * **Quality Filter:** Filters low-complexity sequences (`BBTools BBduk`) | ||
| * **Taxonomic Classification:** Performs microbiome/taxonomic screening on the filtered unmapped reads (`Kraken2`) | ||
|
|
||
| ## 7. Reporting | ||
| * **Summary Report:** Aggregates logs and statistics from all these tools into a single interactive HTML report (`MultiQC`) | ||
|
|
||
|
|
||
| ## Workflow Outputs | ||
| Upon successful execution, the workflow explicitly provides the following final files for analysis: | ||
|
|
||
| * **`MultiQC aggregated workflow summary report` :** An interactive HTML report aggregating QC and analysis logs from all tools. | ||
| * **`QualiMap BamQC general alignment quality metrics report` :** A detailed HTML report containing mapping quality metrics, GC content, and coverage statistics. | ||
| * **`mapDamage Visualisation` :** Visual plots displaying the characteristic C-to-T deamination patterns at the ends of ancient DNA reads. | ||
| * **`Kraken2 taxonomic classification and microbial screening report` :** A tabular report showing the taxonomic classification of unmapped reads. | ||
| * **`EndorSpy endogenous DNA authentication report` :** A JSON file containing the calculated endogenous DNA percentage. | ||
| * **`Sex.DetERRmine (Without BED) report of chromosomal gender estimation` :** A JSON file containing biological sex metrics for human-genome alignments. | ||
| * **`Sex.DetERRmine (With BED) report of chromosomal gender estimation` :** A JSON file containing biological sex metrics for targeted capture regions. | ||
| * **`Mitochondrial to nuclear DNA ratio calculation report` :** A JSON file containing the calculated ratio between mitochondrial and nuclear reads. | ||
| * **`ANGSD report of nuclear contamination estimation` :** A tabular text file detailing the estimates of nuclear X-chromosome contamination. | ||
| * **`Bcftools variant calling summary statistics report` :** A text file containing comprehensive summary statistics for the called variants (VCF). | ||
| * **`Fully post-processed mapping results` :** The final deduplicated and filtered alignment BAM file. | ||
| * **`Freebayes raw genomic variant calls` :** The raw VCF file generated from variant analysis. | ||
|
|
||
|
|
||
| ## Testing Data | ||
| To ensure the workflow functions correctly, it was validated using the following datasets and databases: | ||
|
|
||
| * **`Primary Test Data` :** A downsampled single-end FASTQ dataset [NIST7035_TAAGGCGA_L001_R1_001_10MB.fastq.gz](https://zenodo.org/records/20271974/files/NIST7035_TAAGGCGA_L001_R1_001_10MB.fastq.gz) optimized for rapid workflow testing and validation. | ||
| * **`Primary Reference Genome` :** The [hs37d5_chr21-MT.fa.gz](https://github.com/nf-core/test-datasets/blob/eager/reference/Human/hs37d5_chr21-MT.fa.gz) file was utilized as the primary reference genome sequence. | ||
| * **`X-Chromosome Contamination Reference` :** The [HapMap ChrX](https://github.com/ANGSD/angsd/blob/master/RES/HapMapChrX.gz) dataset was provided as the initial reference for the estimation of X-chromosome contamination using the ANGSD tool. | ||
| * **`Taxonomic Classification Database` :** The Minikraken v2 database was utilized to perform taxonomic classification via Kraken2. | ||
88 changes: 88 additions & 0 deletions
88
workflows/paleogenomics/adna-analysis/adna-analysis-tests.yml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| - doc: Test outline for adna-analysis.ga | ||
| job: | ||
| Input reads: | ||
| class: File | ||
| location: https://zenodo.org/records/20271974/files/NIST7035_TAAGGCGA_L001_R1_001_10MB.fastq.gz | ||
| filetype: fastqsanger.gz | ||
| Reference genome: | ||
| class: File | ||
| location: https://github.com/nf-core/test-datasets/raw/eager/reference/Human/hs37d5_chr21-MT.fa.gz | ||
| filetype: fasta.gz | ||
| Hap Map ChrX reference: | ||
| class: File | ||
| location: https://github.com/ANGSD/angsd/raw/master/RES/HapMapChrX.gz | ||
| filetype: gz | ||
| Choose Mapper: BWA | ||
| Optional BED file for Sex.DetERRmine: null | ||
| Kraken2 database directory: viral2019-03 | ||
|
|
||
| outputs: | ||
| Fully post-processed mapping results: | ||
| asserts: | ||
| has_size: | ||
| min: 100 | ||
| EndorSpy endogenous DNA authentication report: | ||
| asserts: | ||
| has_text: | ||
| text: "percent_on_target" | ||
|
Comment on lines
+18
to
+26
|
||
| Sex.DetERRmine (Without BED) report of chromosomal gender estimation: | ||
| asserts: | ||
| has_text: | ||
| text: "Sex.DetERRmine" | ||
| QualiMap BamQC general alignment quality metrics report: | ||
| asserts: | ||
| has_text: | ||
| text: "Qualimap Report: BAM QC" | ||
| Mitochondrial to nuclear DNA ratio calculation report: | ||
| asserts: | ||
| has_text: | ||
| text: "mtnuccalculator" | ||
| mapDamage Visualisation: | ||
| element_tests: | ||
| dnacomp: | ||
| asserts: | ||
| has_text: | ||
| text: "mapDamage" | ||
| misincorporation: | ||
| asserts: | ||
| has_text: | ||
| text: "mapDamage" | ||
| 5pCtoT_freq: | ||
| asserts: | ||
| has_text: | ||
| text: "5pC>T" | ||
| 3pGtoA_freq: | ||
| asserts: | ||
| has_text: | ||
| text: "3pG>A" | ||
| Fragmisincorporation_plot: | ||
| asserts: | ||
| has_size: | ||
| min: 100 | ||
| lgdistribution: | ||
| asserts: | ||
| has_text: | ||
| text: "mapDamage" | ||
| Length_plot: | ||
| asserts: | ||
| has_size: | ||
| min: 100 | ||
| Freebayes raw genomic variant calls: | ||
| asserts: | ||
| has_text: | ||
| text: "freeBayes" | ||
| ANGSD report of nuclear contamination estimation: | ||
| asserts: | ||
| has_text: | ||
| text: "Method1_MOM_estimate" | ||
| Bcftools variant calling summary statistics report: | ||
| asserts: | ||
| has_text: | ||
| text: "ACT>TCGA" | ||
| Kraken2 taxonomic classification and microbial screening report: | ||
| asserts: | ||
| has_text: | ||
| text: "root" | ||
| MultiQC aggregated workflow summary report: | ||
| asserts: | ||
| has_text: | ||
| text: "MultiQC" | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nextflow run nf-core/eager -profile <docker/singularity/podman/conda/institute> --input '*_R{1,2}.fastq.gz' --fasta '<your_reference>.fasta'makes me think this is an odd default input. I would assume modern data is almost always going to be paired end ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because of this comment i am using single end input
#1234 (comment)
you also run the test, but i do not think it will find the kraken2 db. it will likely give error. what should i do to make it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, a paired-end version of the wf would also be good, but I don't know much about the state of the ancient dna field. Maybe that DNA is often so degraded that pe doesn't offer much benefit?
I would be happy with a single-end version for now (what about you @mvdbeek?) , but @mertydn we can discuss tomorrow whether the wf shouldn't use an input collection (similar to the proposed change over in #1188).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can add input collection if it preferred, but i would rather keep it simple for now if you both agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should merge things that aren't generally useful just because it's easy. Surely the pipeline can handle either if single end is really something the field uses, but claude thinks that is not the case:
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wm75 since most of the tools are not available on usegalaxy, i cannot share them there. i am sending
.gafile as an attachment.adna-analysis.zip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by missing tools? I just uploaded your workflow to Galaxy Europe and it's just fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wm75 you are right. i tried to import it to
usegalaxy.org. that is why most of the tools appeared to be missing. that is fine onusegalaxy.eu.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I'm also confused because that shared workflow doesn't have any Flatten Collection step anywhere now. Am I missing something or did you solve this problem differently already?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wm75 i uploaded wrong one. here is current state: https://usegalaxy.eu/u/mert1907/w/ancient-dna-analysis