Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 21 additions & 49 deletions AUDIT.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,11 @@ web client (`web-client/`) that was added later for Vercel deployment.
### 2. What state is it in — does anything run end-to-end?

Backend runs end-to-end via `python tabvision-server/run.py` (Flask, port 5000).
Most recent eval: 91.6% mean Exact F1 across 11 hand-curated videos (per the
v0 11-video benchmark; details preserved on branch `agent-farm-improvements`
via the v0 dev history). The 20-video training set has lower vanilla-baseline
metrics (~0.43–0.51 mean Exact F1) — see §6 below.
Older v0 personal-video measurements are now invalidated: the 11-video eval
set and private training corpus used private recordings and the tab labels are
not trusted. They were removed from this repo on 2026-06-11; current acceptance
evidence must come from checked-in fixtures and license-checked public/offline
corpora.

Most recent in-flight work: Phase 1 audio fine-tune of Basic Pitch on
GuitarSet, on branch `feature/audio-finetune-phase1`. **Frozen mid-experiment**
Expand Down Expand Up @@ -57,15 +58,12 @@ filters tuned to specific failure cases.

| Asset | Path | Reuse target |
|---|---|---|
| 11 self-recorded eval videos + ground-truth | `test-data/existing/` + tabs files | Spec Phase 1.5 iPhone OOD bonus tier |
| 20 self-recorded training videos | `test-data/training-tabs/` (.txt tab files) + `test-data/existing/` | Phase 1.5 + Phase 7 |
| GuitarSet TFRecord splits | `tabvision-server/tools/outputs/tfrecords/guitarset/splits/{train,validation}/` | Phase 7 fine-tuning data (5 train players + 1 validation player) |
| Pretrained Basic Pitch weight loader | `tabvision-server/app/training/load_pretrained.py` | Phase 7 (verified equivalent 2026-04-29) |
| GuitarSet dataset wrapper | `tabvision-server/app/training/guitarset_dataset.py` | Phase 7 |
| Fine-tune training scripts | `tabvision-server/tools/finetune_basic_pitch_{smoke,modal}.py` | Phase 7 reference |
| Error analysis harness | `tabvision-server/tools/error_analysis.py` | Phase 8 (deterministic eval harness port) |
| Vanilla Basic Pitch baseline (20-video) | `tabvision-server/tests/fixtures/benchmarks/results/vanilla-baseline-2026-05-01.json` | Phase 1 / Phase 7 reference point |
| Benchmark history (baseline_v1..v3, tuning_v1..v13) | `tabvision-server/tests/fixtures/benchmarks/results/` | Reference; not directly ported |
| Public/offline eval corpora | `$TABVISION_DATA_ROOT/eval/` manifests; not committed | Phase 1.5 / v1.1 replacement source |
| 17 design docs in `docs/plans/` | `docs/plans/2026-01-* … 2026-05-*` | Context / cross-references in v1 design doc |

### 5. Branches with abandoned approaches worth revisiting?
Expand Down Expand Up @@ -131,12 +129,10 @@ filters tuned to specific failure cases.
- `eval_basic_pitch_baseline.py` — Phase 1 / Phase 7 baseline
- `finetune_basic_pitch_{smoke,modal}.py` — Phase 7 training
- `error_analysis.py` — Phase 8 harness
- `build_position_dataset.py`, `dump_position_features.py`, `train_position_selector.py` — NO_SHIP learned-fusion artifacts (preserve for documentation; not ported)

**Test fixtures** (`tabvision-server/tests/fixtures/`):
- `test_a440.mp4` — single A440 reference clip
- `benchmarks/index.json`, `baseline*.json`, `tuning_v*.json`, `training_baseline.json`, `sample-video-tabs.txt` — eval baselines
- `benchmarks/results/vanilla-baseline-2026-05-01.json` — most recent reference point
- `benchmarks/index.json`, `sample-video-tabs.txt` - legacy sample fixture only; personal-corpus benchmark results were removed.

### Frontend (`tabvision-client/` and `web-client/`)

Expand Down Expand Up @@ -173,14 +169,12 @@ case).
**Concretely verified:**

1. **End-to-end pipeline runs.** `python tabvision-server/run.py` starts Flask
on port 5000; `POST /jobs` → process video → `GET /jobs/:id/result` returns
a TabDocument JSON. Verified by the existence of v0 11-video eval results
averaging 91.6% Exact F1 (per design doc and v0 history).
2. **20-video benchmark harness produces results.** Most recent run:
`tests/fixtures/benchmarks/results/vanilla-baseline-2026-05-01.json` — 20
training-NN clips run through the current pipeline with full per-clip
metrics breakdown (exact / pitch / position / chord, plus per-error
classification).
on port 5000; `POST /jobs` -> process video -> `GET /jobs/:id/result` returns
a TabDocument JSON. The old personal-video eval results are retained only in
git history and no longer count as validation evidence.
2. **Historical personal-video benchmark harness was removed.** Those
measurements depended on inaccurate personal tab labels and no longer
count as reference evidence.
3. **Phase 1 audio fine-tune scaffolding works.** GuitarSet TFRecords built,
pretrained-weight loader verified equivalent to SavedModel (2026-04-29),
smoke trainer ran on 5 clips successfully. Five full fine-tune runs
Expand Down Expand Up @@ -250,8 +244,7 @@ case).
| GuitarSet PyTorch wrapper | `tabvision-server/app/training/guitarset_dataset.py` | Phase 7 |
| Fine-tune training scripts | `tabvision-server/tools/finetune_basic_pitch_{smoke,modal}.py` | Phase 7 reference |
| Error analysis harness | `tabvision-server/tools/error_analysis.py` | Phase 8 (port + harden) |
| 11 + 20 self-recorded videos + tabs | `test-data/{existing,training-tabs}/` | Phase 1.5 (iPhone OOD) |
| Benchmark JSONs | `tabvision-server/tests/fixtures/benchmarks/results/` | Reference baseline |
| Public/offline eval manifests | `$TABVISION_DATA_ROOT/eval/` | Phase 1.5 / v1.1 replacement source |
| Design docs | `docs/plans/2026-01-* … 2026-05-*` | Cross-reference in v1 designs |

---
Expand All @@ -261,40 +254,19 @@ case).
**Per SPEC.md §2.1: "Score the baseline pipeline output against the user's
reference annotation for that one clip. Record the metrics from §1.4."**

We have richer data than one clip: a recent 20-clip vanilla-baseline JSON
already exists (`tabvision-server/tests/fixtures/benchmarks/results/vanilla-baseline-2026-05-01.json`,
generated 2026-05-01).

### 20-clip training set (vanilla Basic Pitch + heuristic fusion, 2026-05-01)

Spot-check from per-clip F1s in the JSON (Exact F1 = string + fret + onset match):

| Clip | Exact F1 | Pitch F1 |
|---|---:|---:|
| training-01 | 0.42 | 0.68 |
| training-02 (estimate from JSON) | 0.32 | — |
| training-03 (estimate from JSON) | 0.86 | — |
| training-04 (estimate from JSON) | 0.57 | — |
| ... | ... | ... |

Mean Exact F1 across the 20-clip set is approximately **0.43–0.51** depending
on which harness alignment is used (corrected harness with `_find_best_time_offset`
vs original).

### 11-clip eval set (best v0 result, 2026-04-02)

Mean Exact F1: **0.916** across `sample-video, video-3, video-4, video-5,
video-6, video-7, video-8, video-9, video-10, video-11, video-12`. Per-clip
range 0.696 to 0.976.
The personal-video baseline JSONs and tab files were removed on 2026-06-11
because the labels are not trusted. Use the later public/fixture eval reports
(`docs/EVAL_REPORTS/v1_acceptance_2026-06-03.md` and v1.1 reports) for current
metrics.

### Spec §1.4 targets (for context)

| Metric | v0 status | Spec v1 target |
|---|---|---|
| Onset F1 (50 ms) | unmeasured for §1.4 definition | ≥ 0.92 |
| Pitch F1 (50 ms, no offset) | ~0.68–0.75 (20-clip), better on 11-clip | ≥ 0.90 |
| Tab F1 (string + fret + onset) | 0.43–0.51 (20-clip) / 0.916 (11-clip) | 0.88 |
| Chord-instance accuracy | low on 20-clip (many 0.0s) | ≥ 0.85 |
| Pitch F1 (50 ms, no offset) | current public/fixture reports only | >= 0.90 |
| Tab F1 (string + fret + onset) | current public/fixture reports only | >= 0.88 |
| Chord-instance accuracy | current public/fixture reports only | >= 0.85 |
| End-to-end latency (60 s clip on laptop CPU) | unmeasured | ≤ 5 min |

**Interpretation:** v0 already exceeds the aggregate Tab F1 target on the
Expand Down
5 changes: 3 additions & 2 deletions LICENSES.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,10 +61,11 @@ Phase 0 (this document) produces the initial map; Phase 9 verifies.
| Guitar-TECHS | Phase 0 (eval) / 1.5 / 7 | CC-BY-4.0 (Zenodo record 14963133) | ✅ eval-only | arXiv:2501.03720 — 3 electric guitarists, 5h12m multi-mic + DI; per-string 6-track MIDI. **Acquirer landed** (`scripts.acquire.datasets guitar-techs`, Zenodo API). **Scanner landed** (`manifest_builder.scan_guitar_techs` → `clean_electric` tier) — layout *inferred*, verify against first real download. Not redistributed here; required attribution must appear in the public README. |
| IDMT-SMT-Guitar | 1.5 / 7 | research-use, registration | ⚠️ | Training-only; not redistributed in our repo. Verified 2026-05-13 research pass; superseded by Guitar-TECHS for v1 acceptance — kept for potential future training augmentation. |
| EGDB | 1.5 / 7 / Phase 0 (eval) | **author-granted use (2026-06-01)** | ✅ eval-only | https://ss12f32v.github.io/Guitar-Transcription/ — 240 tracks, ~12h with multi-amp electric variants, GuitarPro tabs + aligned MIDI. **Access is open** — the audio is a public Google Drive folder linked from the project page; the *license* was the only gate (the repo has no LICENSE file → default all-rights-reserved). Author (`f08946011@ntu.edu.tw`) granted portfolio use 2026-06-01. **ACTION REQUIRED: save the grant email under `docs/` (e.g. `docs/licenses/egdb-grant-2026-06-01.eml`) and log it in `docs/DECISIONS.md` — the written grant is the only evidence the gate cleared (SPEC §1.4 hard rule).** Treated like GuitarSet: held-out distorted-electric eval source, **not redistributed** here and **not a shipped-weight substrate** unless the grant explicitly permits portfolio distribution. If the grant is research-only, it remains an eval gate only. |
| ~~GOAT~~ | DROPPED | request-only, research-only | ❌ | arXiv:2509.22655. Verified 2026-05-13: distribution gated per-use ("for research purposes only, upon request") due to copyrighted cover-song content. Not portfolio-compatible per SPEC §1.5; removed from the eval composite. |
| GAPS | v1.1 optional real-video/audio research eval | CC-BY-NC-SA-4.0 | ⚠️ eval-only | Zenodo 10.5281/zenodo.13962272. 14h of real classical guitar audio-score aligned pairs with high-resolution MIDI alignments and performance-video links. Do not commit or redistribute media; use only for offline research metrics with attribution, and keep NC data out of shipped weights/default artifacts. |
| ~~GOAT~~ | DROPPED from default pipeline; candidate only if access/license changes | request-only, license pending | ❌ | arXiv:2509.22655 / GOAT-Dataset. The paper describes DI electric guitar audio plus amp-rendered variants annotated with string/fret tablatures, but dataset access is by request and must be rechecked before any use. Not portfolio-compatible until explicit access and dataset license terms are saved. |
| ~~SynthTab~~ | DROPPED from default pipeline | dataset CC-BY-NC-4.0 (code CC-BY-4.0) | ❌ | github.com/yongyizang/SynthTab. Dataset NC clause taints derived weights (SynthTab paper treats trained models as derivative work). Not portfolio-compatible per SPEC §1.5; removed from the planned pretrain pipeline 2026-05-13. The repo code (Apache/CC-BY) remains MIT-style usable for our own renderers if needed. |
| DadaGP | research/dev only — **not in default pipeline** | access-by-email; underlying GP tabs derive from copyrighted songs | ⚠️ | https://github.com/dada-bots/dadaGP. Per 2026-05-13 design plan §4.2, acceptable as internal training augmentation only. Synthetic-source clips are blocked from non-train manifest splits by `tabvision.eval.manifest.validate_manifest` (the `SYNTHETIC_IN_EVAL_SPLIT` guard). |
| ~~User clips (the 20 self-recorded set)~~ | BANNED | self-owned | ⛔ | Banned from all roles per 2026-05-13 design plan D10 — not as accuracy gate, dev set, or label source. Replaced by the public-corpus composite. |
| ~~User clips (the private eval + training corpus)~~ | BANNED | self-owned | ⛔ | Banned from all roles per 2026-06-11 cleanup: not as accuracy gate, dev set, label source, or historical benchmark source. The tracked tabs and stale result artifacts were removed; replace with GuitarSet / Kaggle UT-Austin / GAPS-style offline public corpora depending on the eval tier. |
| Roboflow `b101/guitar-3` | 3 (training) | **CC BY 4.0** | ✅ | **Verified 2026-05-05.** Source: https://universe.roboflow.com/b101/guitar-3. Forked into Patrick's workspace as `patricks-workspace-vozcg/guitar-3-4efcd` v2; YOLOv8-OBB export downloaded (926 images, 710/144/72 split, classes: fret / neck / nut). License declared in the dataset's README.dataset.txt: "License: CC BY 4.0". Attribution: "guitar 3" by b101 on Roboflow Universe (https://universe.roboflow.com/b101/guitar-3), CC BY 4.0; export downloaded May 5, 2026 via the Roboflow SDK. **Required attribution must appear in the public README and any blog post.** |

## Library dependencies (default pipeline)
Expand Down
30 changes: 15 additions & 15 deletions SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -367,7 +367,7 @@ tabvision/
├── data/
│ ├── README.md # how to acquire (do not commit large files)
│ ├── fixtures/ # tiny clips checked in for unit tests
│ ├── eval/ # held-out user-recorded clips + annotations
│ ├── eval/ # public/offline eval manifests + annotations
│ └── augmented/ # generated; .gitignored
├── scripts/
│ ├── acquire/ # one script per dataset/model in §6
Expand Down Expand Up @@ -579,12 +579,12 @@ flowchart TD
5. `tabvision.render.ascii` producing readable ASCII tab.
6. `tabvision` CLI: `tabvision transcribe input.mov -o output.tab`.
7. End-to-end integration test on a fixture clip.
8. First eval-harness run on user clips record metrics, **even if bad**.
8. First eval-harness run on public/offline clips - record metrics, **even if bad**.

**Acceptance test:**
- `tabvision transcribe data/fixtures/scale_clean.wav` produces non-empty ASCII tab.
- Integration test passes.
- Eval harness reports **any** numbers on at least 3 user clips. Metrics logged to `docs/EVAL_REPORTS/phase1_<date>.md`.
- Eval harness reports **any** numbers on at least 3 public/offline clips. Metrics logged to `docs/EVAL_REPORTS/phase1_<date>.md`.

**Acceptance does NOT require** good metrics. The point is the harness, not the score.

Expand All @@ -605,15 +605,15 @@ flowchart TD

### Phase 1.5 — Annotation tool & eval set

**Goal:** Build the user-recorded eval set so subsequent phases have something to optimize against.
**Goal:** Build a public/offline eval manifest so subsequent phases have something to optimize against without private home-video labels.

**Deliverables:**
1. `scripts/annotate/cli.py` — interactive annotator. Plays clip, shows candidate onsets, lets user correct pitch/string/fret per onset, writes `.jams`.
2. 15 user-recorded clips with annotations under `$TABVISION_DATA_ROOT/eval/`.
3. Manifest `data/eval/manifest.toml` listing clips + checksums + difficulty tier (easy/med/hard).
2. Public/offline corpus adapters with annotations under `$TABVISION_DATA_ROOT/eval/`.
3. Manifest `data/eval/manifest.toml` listing clips + checksums + source, license, split, and difficulty tier (easy/med/hard).

**Acceptance test:**
- 15 annotated clips exist.
- Public/offline annotated clips exist for each enabled eval tier.
- `pytest -m eval` runs end-to-end.
- A baseline metrics report is generated and checked in.

Expand Down Expand Up @@ -777,7 +777,7 @@ flowchart TD
5. Tunable mixing weight λ_v between audio and vision evidence (CLI: `--fusion-lambda-vision 1.0`).

**Acceptance test:**
- On user eval set, **Tab F1 0.85** (target: 0.88 by Phase 9).
- On enabled public/offline eval set, **Tab F1 >= 0.85** (target: 0.88 by Phase 9).
- Chord-instance accuracy ≥ 0.80 (target: 0.85 by Phase 9).
- Ablation report: audio-only vs. audio+vision. The audio+vision configuration must beat audio-only by **≥ 8 points** on Tab F1.

Expand Down Expand Up @@ -817,8 +817,8 @@ flowchart TD

**Acceptance test:**
- All four formats round-trip on snapshot fixtures.
- A user-recorded clip's output `.gp5` opens correctly in TuxGuitar (free) and Guitar Pro 7 if available.
- A user-recorded clip's output `.mid` plays back at correct pitches and timing in any standard MIDI player.
- A fixture or public/offline eval clip's output `.gp5` opens correctly in TuxGuitar (free) and Guitar Pro 7 if available.
- A fixture or public/offline eval clip's output `.mid` plays back at correct pitches and timing in any standard MIDI player.

**Decision tree:**

Expand Down Expand Up @@ -848,16 +848,16 @@ flowchart TD
- **Distorted-electric oversampling.** Distorted variants are rendered with multiple amp + cab IR pairs and oversampled in fine-tune batches relative to their share of the source set, since the §1.4 distorted-electric tier target is the most likely to be the long-pole.

2. **Video augmentation pipeline** (`scripts/augment/video.py`):
- Source: user-recorded clips (eval set + any additional training clips the user has recorded).
- Source: license-checked public/offline video corpora only; personal recordings are not an accuracy gate, dev set, or label source.
- Augmentations: color jitter, brightness/contrast, hue shift, slight perspective warp (±5°), slight rotation, motion blur, simulated lens distortion, time-domain crops, frame dropout to simulate FPS variation.
- Aligned annotations preserved (transformed where appropriate — e.g., perspective warp updates the homography ground truth so the fretboard regressor sees consistent labels).

3. **Fine-tuning recipes** (`scripts/train/audio_finetune.py`, `scripts/train/hand_finetune.py`):
- Audio: fine-tune the chosen audio backend on the augmented DadaGP set + user clips. Distorted-electric variants oversampled. Run on Kaggle T4 (or Lightning Studios for longer jobs).
- Vision: fine-tune the fingertip-to-fret head on augmented user clips. The fretboard detector and guitar detector are normally good after Phase 3 and rarely need refining in Phase 7; if Phase 5 fusion exposes systematic vision errors, retrain those too.
- Audio: fine-tune the chosen audio backend on the augmented DadaGP set plus license-checked public/offline corpora. Distorted-electric variants oversampled. Run on Kaggle T4 (or Lightning Studios for longer jobs).
- Vision: fine-tune the fingertip-to-fret head on augmented public/offline video corpora. The fretboard detector and guitar detector are normally good after Phase 3 and rarely need refining in Phase 7; if Phase 5 fusion exposes systematic vision errors, retrain those too.

4. **Self-labeling loop** (`scripts/train/self_label.py`):
- Run pipeline on a corpus of unlabeled user clips pipeline emits TabEvents with confidences.
- Run pipeline on a corpus of unlabeled public/offline clips -> pipeline emits TabEvents with confidences.
- High-confidence outputs (per-note confidence > 0.9) are auto-accepted as pseudo-labels.
- Low-confidence outputs are flagged for the user via the Phase 1.5 annotator; user corrects.
- Both auto-labels and corrected-labels are added to the training set with appropriate weights (corrected: 1.0, auto: 0.5).
Expand Down Expand Up @@ -1109,7 +1109,7 @@ A single source of truth so Claude Code does not drift:
|---|---|---|
| 0 | `AUDIT.md` exists; CI green; `LICENSES.md` initial map | `make ci` |
| 1 | `docs/EVAL_REPORTS/phase1_*.md` exists | `pytest -m integration && pytest -m eval -k phase1` |
| 1.5 | 15 user clips in manifest, all 4 tiers represented | `tabvision-eval --manifest data/eval/manifest.toml --check` |
| 1.5 | Public/offline clips in manifest, enabled tiers represented | `tabvision-eval --manifest data/eval/manifest.toml --check` |
| 2 | Highres beats baseline ≥ 5 pts Pitch F1 | `pytest -m eval -k phase2 --compare baseline` |
| 3 | Guitar IoU ≥ 0.95; preflight ≥ 9/10; homography median error < 5 px | `pytest -m guitar_eval -m preflight_eval -m fretboard_eval` |
| 4 | Fingertip top-1 ≥ 0.75 on 100 labeled frames | `pytest -m hand_eval` |
Expand Down
Loading
Loading