pgil256 · pgil256 · Jun 18, 2026 · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026
diff --git a/AUDIT.md b/AUDIT.md
@@ -19,10 +19,11 @@ web client (`web-client/`) that was added later for Vercel deployment.
 ### 2. What state is it in — does anything run end-to-end?
 
 Backend runs end-to-end via `python tabvision-server/run.py` (Flask, port 5000).
-Most recent eval: 91.6% mean Exact F1 across 11 hand-curated videos (per the
-v0 11-video benchmark; details preserved on branch `agent-farm-improvements`
-via the v0 dev history). The 20-video training set has lower vanilla-baseline
-metrics (~0.43–0.51 mean Exact F1) — see §6 below.
+Older v0 personal-video measurements are now invalidated: the 11-video eval
+set and private training corpus used private recordings and the tab labels are
+not trusted. They were removed from this repo on 2026-06-11; current acceptance
+evidence must come from checked-in fixtures and license-checked public/offline
+corpora.
 
 Most recent in-flight work: Phase 1 audio fine-tune of Basic Pitch on
 GuitarSet, on branch `feature/audio-finetune-phase1`. **Frozen mid-experiment**
@@ -57,15 +58,12 @@ filters tuned to specific failure cases.
 
 | Asset | Path | Reuse target |
 |---|---|---|
-| 11 self-recorded eval videos + ground-truth | `test-data/existing/` + tabs files | Spec Phase 1.5 iPhone OOD bonus tier |
-| 20 self-recorded training videos | `test-data/training-tabs/` (.txt tab files) + `test-data/existing/` | Phase 1.5 + Phase 7 |
 | GuitarSet TFRecord splits | `tabvision-server/tools/outputs/tfrecords/guitarset/splits/{train,validation}/` | Phase 7 fine-tuning data (5 train players + 1 validation player) |
 | Pretrained Basic Pitch weight loader | `tabvision-server/app/training/load_pretrained.py` | Phase 7 (verified equivalent 2026-04-29) |
 | GuitarSet dataset wrapper | `tabvision-server/app/training/guitarset_dataset.py` | Phase 7 |
 | Fine-tune training scripts | `tabvision-server/tools/finetune_basic_pitch_{smoke,modal}.py` | Phase 7 reference |
 | Error analysis harness | `tabvision-server/tools/error_analysis.py` | Phase 8 (deterministic eval harness port) |
-| Vanilla Basic Pitch baseline (20-video) | `tabvision-server/tests/fixtures/benchmarks/results/vanilla-baseline-2026-05-01.json` | Phase 1 / Phase 7 reference point |
-| Benchmark history (baseline_v1..v3, tuning_v1..v13) | `tabvision-server/tests/fixtures/benchmarks/results/` | Reference; not directly ported |
+| Public/offline eval corpora | `$TABVISION_DATA_ROOT/eval/` manifests; not committed | Phase 1.5 / v1.1 replacement source |
 | 17 design docs in `docs/plans/` | `docs/plans/2026-01-* … 2026-05-*` | Context / cross-references in v1 design doc |
 
 ### 5. Branches with abandoned approaches worth revisiting?
@@ -131,12 +129,10 @@ filters tuned to specific failure cases.
 - `eval_basic_pitch_baseline.py` — Phase 1 / Phase 7 baseline
 - `finetune_basic_pitch_{smoke,modal}.py` — Phase 7 training
 - `error_analysis.py` — Phase 8 harness
-- `build_position_dataset.py`, `dump_position_features.py`, `train_position_selector.py` — NO_SHIP learned-fusion artifacts (preserve for documentation; not ported)
 
 **Test fixtures** (`tabvision-server/tests/fixtures/`):
 - `test_a440.mp4` — single A440 reference clip
-- `benchmarks/index.json`, `baseline*.json`, `tuning_v*.json`, `training_baseline.json`, `sample-video-tabs.txt` — eval baselines
-- `benchmarks/results/vanilla-baseline-2026-05-01.json` — most recent reference point
+- `benchmarks/index.json`, `sample-video-tabs.txt` - legacy sample fixture only; personal-corpus benchmark results were removed.
 
 ### Frontend (`tabvision-client/` and `web-client/`)
 
@@ -173,14 +169,12 @@ case).
 **Concretely verified:**
 
 1. **End-to-end pipeline runs.** `python tabvision-server/run.py` starts Flask
-   on port 5000; `POST /jobs` → process video → `GET /jobs/:id/result` returns
-   a TabDocument JSON. Verified by the existence of v0 11-video eval results
-   averaging 91.6% Exact F1 (per design doc and v0 history).
-2. **20-video benchmark harness produces results.** Most recent run:
-   `tests/fixtures/benchmarks/results/vanilla-baseline-2026-05-01.json` — 20
-   training-NN clips run through the current pipeline with full per-clip
-   metrics breakdown (exact / pitch / position / chord, plus per-error
-   classification).
+   on port 5000; `POST /jobs` -> process video -> `GET /jobs/:id/result` returns
+   a TabDocument JSON. The old personal-video eval results are retained only in
+   git history and no longer count as validation evidence.
+2. **Historical personal-video benchmark harness was removed.** Those
+   measurements depended on inaccurate personal tab labels and no longer
+   count as reference evidence.
 3. **Phase 1 audio fine-tune scaffolding works.** GuitarSet TFRecords built,
    pretrained-weight loader verified equivalent to SavedModel (2026-04-29),
    smoke trainer ran on 5 clips successfully. Five full fine-tune runs
@@ -250,8 +244,7 @@ case).
 | GuitarSet PyTorch wrapper | `tabvision-server/app/training/guitarset_dataset.py` | Phase 7 |
 | Fine-tune training scripts | `tabvision-server/tools/finetune_basic_pitch_{smoke,modal}.py` | Phase 7 reference |
 | Error analysis harness | `tabvision-server/tools/error_analysis.py` | Phase 8 (port + harden) |
-| 11 + 20 self-recorded videos + tabs | `test-data/{existing,training-tabs}/` | Phase 1.5 (iPhone OOD) |
-| Benchmark JSONs | `tabvision-server/tests/fixtures/benchmarks/results/` | Reference baseline |
+| Public/offline eval manifests | `$TABVISION_DATA_ROOT/eval/` | Phase 1.5 / v1.1 replacement source |
 | Design docs | `docs/plans/2026-01-* … 2026-05-*` | Cross-reference in v1 designs |
 
 ---
@@ -261,40 +254,19 @@ case).
 **Per SPEC.md §2.1: "Score the baseline pipeline output against the user's
 reference annotation for that one clip. Record the metrics from §1.4."**
 
-We have richer data than one clip: a recent 20-clip vanilla-baseline JSON
-already exists (`tabvision-server/tests/fixtures/benchmarks/results/vanilla-baseline-2026-05-01.json`,
-generated 2026-05-01).
-
-### 20-clip training set (vanilla Basic Pitch + heuristic fusion, 2026-05-01)
-
-Spot-check from per-clip F1s in the JSON (Exact F1 = string + fret + onset match):
-
-| Clip | Exact F1 | Pitch F1 |
-|---|---:|---:|
-| training-01 | 0.42 | 0.68 |
-| training-02 (estimate from JSON) | 0.32 | — |
-| training-03 (estimate from JSON) | 0.86 | — |
-| training-04 (estimate from JSON) | 0.57 | — |
-| ... | ... | ... |
-
-Mean Exact F1 across the 20-clip set is approximately **0.43–0.51** depending
-on which harness alignment is used (corrected harness with `_find_best_time_offset`
-vs original).
-
-### 11-clip eval set (best v0 result, 2026-04-02)
-
-Mean Exact F1: **0.916** across `sample-video, video-3, video-4, video-5,
-video-6, video-7, video-8, video-9, video-10, video-11, video-12`. Per-clip
-range 0.696 to 0.976.
+The personal-video baseline JSONs and tab files were removed on 2026-06-11
+because the labels are not trusted. Use the later public/fixture eval reports
+(`docs/EVAL_REPORTS/v1_acceptance_2026-06-03.md` and v1.1 reports) for current
+metrics.
 
 ### Spec §1.4 targets (for context)
 
 | Metric | v0 status | Spec v1 target |
 |---|---|---|
 | Onset F1 (50 ms) | unmeasured for §1.4 definition | ≥ 0.92 |
-| Pitch F1 (50 ms, no offset) | ~0.68–0.75 (20-clip), better on 11-clip | ≥ 0.90 |
-| Tab F1 (string + fret + onset) | 0.43–0.51 (20-clip) / 0.916 (11-clip) | ≥ 0.88 |
-| Chord-instance accuracy | low on 20-clip (many 0.0s) | ≥ 0.85 |
+| Pitch F1 (50 ms, no offset) | current public/fixture reports only | >= 0.90 |
+| Tab F1 (string + fret + onset) | current public/fixture reports only | >= 0.88 |
+| Chord-instance accuracy | current public/fixture reports only | >= 0.85 |
 | End-to-end latency (60 s clip on laptop CPU) | unmeasured | ≤ 5 min |
 
 **Interpretation:** v0 already exceeds the aggregate Tab F1 target on the

diff --git a/LICENSES.md b/LICENSES.md
@@ -61,10 +61,11 @@ Phase 0 (this document) produces the initial map; Phase 9 verifies.
 | Guitar-TECHS | Phase 0 (eval) / 1.5 / 7 | CC-BY-4.0 (Zenodo record 14963133) | ✅ eval-only | arXiv:2501.03720 — 3 electric guitarists, 5h12m multi-mic + DI; per-string 6-track MIDI. **Acquirer landed** (`scripts.acquire.datasets guitar-techs`, Zenodo API). **Scanner landed** (`manifest_builder.scan_guitar_techs` → `clean_electric` tier) — layout *inferred*, verify against first real download. Not redistributed here; required attribution must appear in the public README. |
 | IDMT-SMT-Guitar | 1.5 / 7 | research-use, registration | ⚠️ | Training-only; not redistributed in our repo. Verified 2026-05-13 research pass; superseded by Guitar-TECHS for v1 acceptance — kept for potential future training augmentation. |
 | EGDB | 1.5 / 7 / Phase 0 (eval) | **author-granted use (2026-06-01)** | ✅ eval-only | https://ss12f32v.github.io/Guitar-Transcription/ — 240 tracks, ~12h with multi-amp electric variants, GuitarPro tabs + aligned MIDI. **Access is open** — the audio is a public Google Drive folder linked from the project page; the *license* was the only gate (the repo has no LICENSE file → default all-rights-reserved). Author (`f08946011@ntu.edu.tw`) granted portfolio use 2026-06-01. **ACTION REQUIRED: save the grant email under `docs/` (e.g. `docs/licenses/egdb-grant-2026-06-01.eml`) and log it in `docs/DECISIONS.md` — the written grant is the only evidence the gate cleared (SPEC §1.4 hard rule).** Treated like GuitarSet: held-out distorted-electric eval source, **not redistributed** here and **not a shipped-weight substrate** unless the grant explicitly permits portfolio distribution. If the grant is research-only, it remains an eval gate only. |
-| ~~GOAT~~ | DROPPED | request-only, research-only | ❌ | arXiv:2509.22655. Verified 2026-05-13: distribution gated per-use ("for research purposes only, upon request") due to copyrighted cover-song content. Not portfolio-compatible per SPEC §1.5; removed from the eval composite. |
+| GAPS | v1.1 optional real-video/audio research eval | CC-BY-NC-SA-4.0 | ⚠️ eval-only | Zenodo 10.5281/zenodo.13962272. 14h of real classical guitar audio-score aligned pairs with high-resolution MIDI alignments and performance-video links. Do not commit or redistribute media; use only for offline research metrics with attribution, and keep NC data out of shipped weights/default artifacts. |
+| ~~GOAT~~ | DROPPED from default pipeline; candidate only if access/license changes | request-only, license pending | ❌ | arXiv:2509.22655 / GOAT-Dataset. The paper describes DI electric guitar audio plus amp-rendered variants annotated with string/fret tablatures, but dataset access is by request and must be rechecked before any use. Not portfolio-compatible until explicit access and dataset license terms are saved. |
 | ~~SynthTab~~ | DROPPED from default pipeline | dataset CC-BY-NC-4.0 (code CC-BY-4.0) | ❌ | github.com/yongyizang/SynthTab. Dataset NC clause taints derived weights (SynthTab paper treats trained models as derivative work). Not portfolio-compatible per SPEC §1.5; removed from the planned pretrain pipeline 2026-05-13. The repo code (Apache/CC-BY) remains MIT-style usable for our own renderers if needed. |
 | DadaGP | research/dev only — **not in default pipeline** | access-by-email; underlying GP tabs derive from copyrighted songs | ⚠️ | https://github.com/dada-bots/dadaGP. Per 2026-05-13 design plan §4.2, acceptable as internal training augmentation only. Synthetic-source clips are blocked from non-train manifest splits by `tabvision.eval.manifest.validate_manifest` (the `SYNTHETIC_IN_EVAL_SPLIT` guard). |
-| ~~User clips (the 20 self-recorded set)~~ | BANNED | self-owned | ⛔ | Banned from all roles per 2026-05-13 design plan D10 — not as accuracy gate, dev set, or label source. Replaced by the public-corpus composite. |
+| ~~User clips (the private eval + training corpus)~~ | BANNED | self-owned | ⛔ | Banned from all roles per 2026-06-11 cleanup: not as accuracy gate, dev set, label source, or historical benchmark source. The tracked tabs and stale result artifacts were removed; replace with GuitarSet / Kaggle UT-Austin / GAPS-style offline public corpora depending on the eval tier. |
 | Roboflow `b101/guitar-3` | 3 (training) | **CC BY 4.0** | ✅ | **Verified 2026-05-05.** Source: https://universe.roboflow.com/b101/guitar-3. Forked into Patrick's workspace as `patricks-workspace-vozcg/guitar-3-4efcd` v2; YOLOv8-OBB export downloaded (926 images, 710/144/72 split, classes: fret / neck / nut). License declared in the dataset's README.dataset.txt: "License: CC BY 4.0". Attribution: "guitar 3" by b101 on Roboflow Universe (https://universe.roboflow.com/b101/guitar-3), CC BY 4.0; export downloaded May 5, 2026 via the Roboflow SDK. **Required attribution must appear in the public README and any blog post.** |
 
 ## Library dependencies (default pipeline)

diff --git a/SPEC.md b/SPEC.md
@@ -367,7 +367,7 @@ tabvision/
 ├── data/
 │   ├── README.md           # how to acquire (do not commit large files)
 │   ├── fixtures/           # tiny clips checked in for unit tests
-│   ├── eval/               # held-out user-recorded clips + annotations
+│   ├── eval/               # public/offline eval manifests + annotations
 │   └── augmented/          # generated; .gitignored
 ├── scripts/
 │   ├── acquire/            # one script per dataset/model in §6
@@ -579,12 +579,12 @@ flowchart TD
 5. `tabvision.render.ascii` producing readable ASCII tab.
 6. `tabvision` CLI: `tabvision transcribe input.mov -o output.tab`.
 7. End-to-end integration test on a fixture clip.
-8. First eval-harness run on user clips — record metrics, **even if bad**.
+8. First eval-harness run on public/offline clips - record metrics, **even if bad**.
 
 **Acceptance test:**
 - `tabvision transcribe data/fixtures/scale_clean.wav` produces non-empty ASCII tab.
 - Integration test passes.
-- Eval harness reports **any** numbers on at least 3 user clips. Metrics logged to `docs/EVAL_REPORTS/phase1_<date>.md`.
+- Eval harness reports **any** numbers on at least 3 public/offline clips. Metrics logged to `docs/EVAL_REPORTS/phase1_<date>.md`.
 
 **Acceptance does NOT require** good metrics. The point is the harness, not the score.
 
@@ -605,15 +605,15 @@ flowchart TD
 
 ### Phase 1.5 — Annotation tool & eval set
 
-**Goal:** Build the user-recorded eval set so subsequent phases have something to optimize against.
+**Goal:** Build a public/offline eval manifest so subsequent phases have something to optimize against without private home-video labels.
 
 **Deliverables:**
 1. `scripts/annotate/cli.py` — interactive annotator. Plays clip, shows candidate onsets, lets user correct pitch/string/fret per onset, writes `.jams`.
-2. 15 user-recorded clips with annotations under `$TABVISION_DATA_ROOT/eval/`.
-3. Manifest `data/eval/manifest.toml` listing clips + checksums + difficulty tier (easy/med/hard).
+2. Public/offline corpus adapters with annotations under `$TABVISION_DATA_ROOT/eval/`.
+3. Manifest `data/eval/manifest.toml` listing clips + checksums + source, license, split, and difficulty tier (easy/med/hard).
 
 **Acceptance test:**
-- 15 annotated clips exist.
+- Public/offline annotated clips exist for each enabled eval tier.
 - `pytest -m eval` runs end-to-end.
 - A baseline metrics report is generated and checked in.
 
@@ -777,7 +777,7 @@ flowchart TD
 5. Tunable mixing weight λ_v between audio and vision evidence (CLI: `--fusion-lambda-vision 1.0`).
 
 **Acceptance test:**
-- On user eval set, **Tab F1 ≥ 0.85** (target: 0.88 by Phase 9).
+- On enabled public/offline eval set, **Tab F1 >= 0.85** (target: 0.88 by Phase 9).
 - Chord-instance accuracy ≥ 0.80 (target: 0.85 by Phase 9).
 - Ablation report: audio-only vs. audio+vision. The audio+vision configuration must beat audio-only by **≥ 8 points** on Tab F1.
 
@@ -817,8 +817,8 @@ flowchart TD
 
 **Acceptance test:**
 - All four formats round-trip on snapshot fixtures.
-- A user-recorded clip's output `.gp5` opens correctly in TuxGuitar (free) and Guitar Pro 7 if available.
-- A user-recorded clip's output `.mid` plays back at correct pitches and timing in any standard MIDI player.
+- A fixture or public/offline eval clip's output `.gp5` opens correctly in TuxGuitar (free) and Guitar Pro 7 if available.
+- A fixture or public/offline eval clip's output `.mid` plays back at correct pitches and timing in any standard MIDI player.
 
 **Decision tree:**
 
@@ -848,16 +848,16 @@ flowchart TD
    - **Distorted-electric oversampling.** Distorted variants are rendered with multiple amp + cab IR pairs and oversampled in fine-tune batches relative to their share of the source set, since the §1.4 distorted-electric tier target is the most likely to be the long-pole.
 
 2. **Video augmentation pipeline** (`scripts/augment/video.py`):
-   - Source: user-recorded clips (eval set + any additional training clips the user has recorded).
+   - Source: license-checked public/offline video corpora only; personal recordings are not an accuracy gate, dev set, or label source.
    - Augmentations: color jitter, brightness/contrast, hue shift, slight perspective warp (±5°), slight rotation, motion blur, simulated lens distortion, time-domain crops, frame dropout to simulate FPS variation.
    - Aligned annotations preserved (transformed where appropriate — e.g., perspective warp updates the homography ground truth so the fretboard regressor sees consistent labels).
 
 3. **Fine-tuning recipes** (`scripts/train/audio_finetune.py`, `scripts/train/hand_finetune.py`):
-   - Audio: fine-tune the chosen audio backend on the augmented DadaGP set + user clips. Distorted-electric variants oversampled. Run on Kaggle T4 (or Lightning Studios for longer jobs).
-   - Vision: fine-tune the fingertip-to-fret head on augmented user clips. The fretboard detector and guitar detector are normally good after Phase 3 and rarely need refining in Phase 7; if Phase 5 fusion exposes systematic vision errors, retrain those too.
+   - Audio: fine-tune the chosen audio backend on the augmented DadaGP set plus license-checked public/offline corpora. Distorted-electric variants oversampled. Run on Kaggle T4 (or Lightning Studios for longer jobs).
+   - Vision: fine-tune the fingertip-to-fret head on augmented public/offline video corpora. The fretboard detector and guitar detector are normally good after Phase 3 and rarely need refining in Phase 7; if Phase 5 fusion exposes systematic vision errors, retrain those too.
 
 4. **Self-labeling loop** (`scripts/train/self_label.py`):
-   - Run pipeline on a corpus of unlabeled user clips → pipeline emits TabEvents with confidences.
+   - Run pipeline on a corpus of unlabeled public/offline clips -> pipeline emits TabEvents with confidences.
    - High-confidence outputs (per-note confidence > 0.9) are auto-accepted as pseudo-labels.
    - Low-confidence outputs are flagged for the user via the Phase 1.5 annotator; user corrects.
    - Both auto-labels and corrected-labels are added to the training set with appropriate weights (corrected: 1.0, auto: 0.5).
@@ -1109,7 +1109,7 @@ A single source of truth so Claude Code does not drift:
 |---|---|---|
 | 0 | `AUDIT.md` exists; CI green; `LICENSES.md` initial map | `make ci` |
 | 1 | `docs/EVAL_REPORTS/phase1_*.md` exists | `pytest -m integration && pytest -m eval -k phase1` |
-| 1.5 | 15 user clips in manifest, all 4 tiers represented | `tabvision-eval --manifest data/eval/manifest.toml --check` |
+| 1.5 | Public/offline clips in manifest, enabled tiers represented | `tabvision-eval --manifest data/eval/manifest.toml --check` |
 | 2 | Highres beats baseline ≥ 5 pts Pitch F1 | `pytest -m eval -k phase2 --compare baseline` |
 | 3 | Guitar IoU ≥ 0.95; preflight ≥ 9/10; homography median error < 5 px | `pytest -m guitar_eval -m preflight_eval -m fretboard_eval` |
 | 4 | Fingertip top-1 ≥ 0.75 on 100 labeled frames | `pytest -m hand_eval` |