galaxyproject · nekrut · May 27, 2026 · May 27, 2026 · May 27, 2026 · May 27, 2026
diff --git a/workflows/comparative_genomics/pggb-pangenome-build/.dockstore.yml b/workflows/comparative_genomics/pggb-pangenome-build/.dockstore.yml
@@ -0,0 +1,11 @@
+version: 1.2
+workflows:
+- name: pggb-pangenome-build
+  subclass: Galaxy
+  publish: true
+  primaryDescriptorPath: /pggb-pangenome-build.ga
+  testParameterFiles:
+  - /pggb-pangenome-build-tests.yml
+  authors:
+  - name: Anton Nekrutenko
+    orcid: 0000-0002-5987-8032
diff --git a/workflows/comparative_genomics/pggb-pangenome-build/.workflowhub.yml b/workflows/comparative_genomics/pggb-pangenome-build/.workflowhub.yml
@@ -0,0 +1,5 @@
+version: '0.1'
+registries:
+- url: https://workflowhub.eu
+  project: iwc
+  workflow: comparative_genomics/pggb-pangenome-build/pggb-pangenome-build
diff --git a/workflows/comparative_genomics/pggb-pangenome-build/CHANGELOG.md b/workflows/comparative_genomics/pggb-pangenome-build/CHANGELOG.md
@@ -0,0 +1,18 @@
+# Changelog
+
+## [0.1] - 2026-05-27
+
+### Added
+
+- Initial release of the PGGB pangenome build workflow.
+- Steps: PanSN rename (mapped over a strain FASTA collection)
+  → FASTA collection concat → PGGB → odgi stats.
+- Outputs the smoothed GFA1 (gzipped), odgi `.og` binary graph,
+  2D layout (`.og.lay`), layout/viz PNGs, run log, optional VCF
+  via `vg deconstruct`, graph stats TSV, and MultiQC HTML report
+  collating odgi stats + viz across the build.
+- Validated end-to-end on the 8-strain *Plasmodium vivax* v3
+  reference pangenome (~25 Mb haploid genomes; runtime ~28 min
+  on 16 cores); graph nucleotide length within 2.7 % of the v2
+  native reference build. Also exercised on PGGB's own
+  DRB1-3123 CI fixture (12 HLA haplotypes, 50 KB).
diff --git a/workflows/comparative_genomics/pggb-pangenome-build/README.md b/workflows/comparative_genomics/pggb-pangenome-build/README.md
@@ -0,0 +1,90 @@
+# PGGB pangenome build
+
+Galaxy workflow that constructs a pangenome variation graph from per-strain
+assembly FASTAs using the [PGGB](https://github.com/pangenome/pggb) pipeline
+(wfmash → seqwish → smoothxg → gfaffix → odgi) and emits an interactive
+MultiQC report summarising the result.
+
+## Inputs
+
+| Step | Param | Type | Default | Notes |
+|------|-------|------|--------:|-------|
+| 0 | Strain FASTAs | `list` collection | — | One FASTA per haplotype; element identifier becomes the PanSN `SAMPLE` prefix. Plain `.fa` or gzipped `.fa.gz`. |
+| 1 | `n_haplotypes` | integer | 8 | Total haplotypes; for haploid panels = collection size. |
+| 2 | `segment_length` | integer | 5000 | wfmash seed length. 5000 fits ~25 Mb apicomplexan-scale genomes; raise to 10000 for >100 Mb genomes. |
+| 3 | `map_pct_id` | float | 90.0 | wfmash identity threshold. 90 for intra-species panels; drop to 70–80 for inter-species. |
+| 4 | `min_match_len` | integer | 23 | seqwish `-k`. Lower = finer graph; higher = more local collapsing. |
+| 5 | `vcf_spec` | text | `""` | Reference accession prefix for `vg deconstruct`. Blank skips VCF emission. |
+
+## Steps
+
+```
+input: list collection of strain FASTAs
+        │
+        ▼
+   PanSN rename (mapped over collection)
+        │                     ─► per-strain renamed FASTA (SAMPLE#HAP#contig)
+        ▼
+   FASTA collection concat    ─► single PanSN-named multifasta
+        │
+        ▼
+   PGGB                       ─► canonical PGGB pipeline (wfmash + seqwish
+        │                        + smoothxg + gfaffix + odgi). MultiQC
+        │                        report enabled by default.
+        ▼
+   odgi stats                 ─► tabular graph metrics on the final .og
+```
+
+## Outputs
+
+- **smoothed GFA** — final canonical GFA1 (gzipped)
+- **odgi .og** — succinct binary graph (input for downstream odgi queries)
+- **layout (.og.lay)** — 2D graph layout
+- **layout PNG** — 2D layout rendered
+- **viz PNG** — 1D path-coloured visualisation
+- **pggb log** — full run log with all parameter hashes
+- **MultiQC report** — interactive HTML collating odgi stats + viz across
+  both the seqwish-induced intermediate and final smoothed graphs
+- **deconstruct VCF** (optional) — only when `vcf_spec` is set
+- **graph stats** — tab-delimited length / nodes / edges / paths / steps
+
+## Recommended use
+
+The workflow targets the apicomplexan-scale haploid panel (5–15 strains,
+~25 Mb each). For other scales:
+
+- *Closely related strains, intra-species*: defaults (`-p 90 -s 5000 -k 23`)
+  are right. Validated on *P. vivax* (8 strains) and tested on PGGB's
+  DRB1-3123 fixture (12 HLA haplotypes).
+- *Inter-species panel* (e.g., *P. vivax* / *P. cynomolgi* / *P. knowlesi*
+  mix at ~10–15 % divergence): drop `map_pct_id` to 80, keep
+  `segment_length` at 5000.
+- *Larger genomes (>100 Mb)*: raise `segment_length` to 10000 to keep
+  wfmash memory in check.
+
+## Resource notes
+
+PGGB is the dominant compute step. For 8 × ~25 Mb haploid genomes at
+default parameters: ~30 min wall on a 32-core box; ~3 GB peak RAM. The
+`poa_length_target` parameter (currently fixed at the wrapper default
+700,1100) dominates: setting it to `4001,4507` (the older pggb 0.6 default
+that the v3 *P. vivax* reference build used) makes the graph ~50 % more
+collapsed but ~3× slower.
+
+## Author
+
+- Anton Nekrutenko ([@nekrut](https://github.com/nekrut),
+  [ORCID 0000-0002-5987-8032](https://orcid.org/0000-0002-5987-8032))
+
+## License
+
+MIT
+
+## Citations
+
+- PGGB / smoothxg: Garrison et al., *Nat. Biotechnol.* 2024 ([doi:10.1038/s41587-023-01793-w](https://doi.org/10.1038/s41587-023-01793-w))
+- wfmash: Guarracino et al., *Bioinformatics* 2024 ([doi:10.1093/bioinformatics/btae155](https://doi.org/10.1093/bioinformatics/btae155))
+- seqwish: Garrison & Guarracino, *Bioinformatics* 2023 ([doi:10.1093/bioinformatics/btac743](https://doi.org/10.1093/bioinformatics/btac743))
+- odgi: Guarracino et al., *Bioinformatics* 2022 ([doi:10.1093/bioinformatics/btac308](https://doi.org/10.1093/bioinformatics/btac308))
+- gfaffix: Steiner et al., *Bioinformatics* 2023 ([doi:10.1093/bioinformatics/btac788](https://doi.org/10.1093/bioinformatics/btac788))
+- PanSN-spec: <https://github.com/pangenome/PanSN-spec>
diff --git a/workflows/comparative_genomics/pggb-pangenome-build/pggb-pangenome-build-tests.yml b/workflows/comparative_genomics/pggb-pangenome-build/pggb-pangenome-build-tests.yml
@@ -0,0 +1,42 @@
+- doc: |
+    PGGB pangenome build smoke test. Three small PanSN-named FASTAs concatenated
+    via fasta_concat, fed to pggb, with odgi_stats on the resulting .og.
+  job:
+    Strain FASTAs:
+      class: Collection
+      collection_type: list
+      elements:
+      - identifier: PvA
+        class: File
+        path: test-data/PvA.fa
+        filetype: fasta
+      - identifier: PvB
+        class: File
+        path: test-data/PvB.fa
+        filetype: fasta
+      - identifier: PvC
+        class: File
+        path: test-data/PvC.fa
+        filetype: fasta
+    n_haplotypes: 3
+    segment_length: 500
+    map_pct_id: 90.0
+    min_match_len: 11
+    vcf_spec: ""
+  outputs:
+    smoothed GFA:
+      asserts:
+        has_size:
+          min: 100
+    odgi .og:
+      asserts:
+        has_size:
+          min: 100
+    graph stats:
+      asserts:
+        has_text:
+          text: "length"
+    MultiQC report:
+      asserts:
+        has_text:
+          text: "MultiQC"