diff --git a/docs/plans/2026-06-03-v1.1-video-string-resolution-design.md b/docs/plans/2026-06-03-v1.1-video-string-resolution-design.md new file mode 100644 index 0000000..ba73252 --- /dev/null +++ b/docs/plans/2026-06-03-v1.1-video-string-resolution-design.md @@ -0,0 +1,165 @@ +# v1.1 — Video string-resolution for single-line (design) + +**Date:** 2026-06-03 +**Status:** design / prep. No code in this doc; it scopes the v1.1 milestone. +**Depends on:** v1 ACCEPTED (audio-only acoustic, 2026-06-03 — see +`docs/EVAL_REPORTS/v1_acceptance_2026-06-03.md`). + +## 1. Goal + +Lift the **clean-acoustic single-line** tier from its v1 audio-only ceiling +(**Tab F1 ≈ 0.52**, lower-95 0.457) toward the original SPEC §1.4 target +(**0.94**), and **chord-instance accuracy** from ≈ 0.48 toward **0.85**, by using +the **fretting-hand video** to resolve the *string* that audio cannot. Both +numbers were re-scoped from v1 gates to **v1.1 video targets** in SPEC §1.4.1 +(2026-06-03) precisely because they are audio-information-limited. + +Hard constraint: **video is additive and confidence-gated.** The audio-only +tiers must not regress when video is absent, occluded, or low-confidence. + +## 2. Why this is the right lever (diagnosis) + +The failure is **not** pitch and **not** tuning — it is *which string*: + +- Error decomposition (`docs/EVAL_REPORTS/acoustic_single_line_2026-06-02.md`): + single-line loss is **322 `wrong_position_same_pitch`** vs **8 `pitch_off`** — + the pitch is correct, the string is wrong. +- The 2026-06-03 acceptance run confirmed the cap holds after fixes: single-line + Tab F1 0.523, and chord-instance accuracy 0.521 ≈ Tab F1 (chord inherits the + same string error). +- Onset F1 0.94 / pitch F1 0.93 are already at spec. The bottleneck is the + pitch → (string, fret) assignment, and **the same pitch is acoustically + near-identical across strings**, so audio cannot break the tie. + +The fretting hand's position on the neck *directly observes* the string and +fret. This is the one signal that carries the missing information. + +## 3. What already exists (v1.1 wires + strengthens; it does not build) + +The full video stack shipped earlier (Phase 2–5; `run_pipeline` already runs it +under `video_enabled=True`). Inventory of the relevant pieces: + +| Concern | Module | Produces | +|---|---|---| +| Frames | `tabvision/demux` | per-frame iterator (ffmpeg + cv2) | +| Neck/guitar detect | `video/guitar/yolo_backend.py`, `guitar/tracker.py` | guitar + neck ROI | +| Fretboard geometry | `video/fretboard/{geometric,keypoint,tracker}.py` | per-frame `Homography` (canonical fret/string grid) | +| Hand landmarks | `video/hand/mediapipe_backend.py` | MediaPipe fingertip samples | +| **Per-finger (string,fret)** | `video/hand/fingertip_to_fret.py` | **`FrameFingering`** — a per-(string, fret) posterior | +| Coarse hand anchor | `video/hand/neck_anchor.py` | `NeckAnchor` (center/min/max fret + confidence) | +| Pitch → positions | `fusion/candidates.py` | `candidate_positions(pitch)` — every playable (string, fret) | +| Audio position prior | `fusion/position_prior.py` | learned (string, fret) reweighting | +| Neck-anchor prior | `fusion/neck_prior.py` | attaches `AudioEvent.fret_prior` from the anchor | +| Emit | `fusion/viterbi.py`, `fusion/playability.py` | `TabEvent` (string_idx, fret) with position continuity | + +`run_pipeline` already demuxes frames, runs `_run_video_stack` → `fingerings` +(`FrameFingering`) + `neck_anchors`, and calls `apply_neck_anchor_priors`. + +## 4. The gap (precise, and why single-line didn't move) + +`fusion/neck_prior.anchor_position_prior` builds a Gaussian over **fret** and +then **tiles it across every string** (`neck_prior.py:69`: +`np.tile(fret_probs[None, :], (n_strings, 1))`). So the video signal that +currently reaches fusion constrains the **fret region** but is **string-agnostic** +— it says "the hand is around fret 5," not "on the D string." That is the wrong +axis: single-line errors are wrong-*string*, right-fret-region. + +Meanwhile the **string-discriminative** signal already exists in `FrameFingering` +(fingertip → per-(string, fret) posterior) but is **not** consumed by the per-note +resolver — only the coarse, fret-only `NeckAnchor` is. **v1.1 closes exactly this +gap.** + +## 5. Method + +A new confidence-gated fusion step that turns per-frame `FrameFingering` into a +per-note **string** prior, restricted to the note's pitch candidates: + +1. **Temporal align.** For each audio note `(pitch, onset_s)`, collect the + `FrameFingering`s within ±Δt of the onset (Δt ≈ one fret-change window, + ~0.1–0.15 s; reuse the `max_time_distance_s` pattern from `neck_prior`). +2. **Restrict to candidates.** `candidate_positions(pitch, cfg)` gives the few + playable `(string, fret)` for that pitch. Score each candidate by the video + posterior mass on that exact `(string, fret)` cell (multi-frame vote: sum / + max over the window). +3. **Confidence-gated fuse.** Combine the video string-prior with the audio + posterior multiplicatively, weighted by a per-note video confidence + (hand-detection + homography quality). High confidence → video decides the + string; low/occluded → weight → 0 and the audio prior stands unchanged + (**no-regression guarantee**). +4. **Emit unchanged.** Feed the resolved per-note `(string, fret)` posterior into + the existing `viterbi`/`playability` emission (position continuity via the + already-tuned `POSITION_SHIFT_COST`). + +Net new code: a `fusion/video_string_prior.py` (note ← FrameFingering, candidate- +restricted) + wiring in `pipeline.run_pipeline` / the fusion entry, alongside (not +replacing) the fret-only neck anchor. Chords (§7) extend step 2 to the multi-note +cluster. + +## 6. The hard part — eval data (the real gate) + +**GuitarSet, the v1 eval set, is audio-only.** It cannot validate video +string-resolution. v1.1 needs a corpus with (a) fretting-hand video and (b) +frame/note-accurate **string + fret** ground truth. This is the gating decision, +analogous to "no in-repo trainer" for v2-electric. Options, cheapest first: + +1. **Synthetic video rendered from GuitarSet's own string/fret labels.** + GuitarSet's JAMS already carry per-note string + fret (hex-pickup). Render a + synthetic neck + fretting-hand animation from them → free, re-derivable, + license-clean, frame-perfect labels. **Validates the resolver's ceiling** + under clean video, decoupled from MediaPipe noise. (Does not test real-hand + robustness.) Reuses `scripts/viz/overlay_fretboard.py` conventions. +2. **A license-clean public guitar-video dataset with tab/string labels** (e.g. + IDMT-SMT-Guitar video subsets, or a tab-aligned performance corpus). This is + the **real acceptance gate** — must pass SPEC §1.5 portfolio licensing. +3. **A small self-recorded video dev set** — iteration aid only. SPEC §1.4.1 + **bans personal clips as a gate**, so this never becomes the acceptance + number; keep the gate on (2). + +**Recommendation:** (1) first to prove the method moves single-line on clean +video, then (2) as the gate. Escalate to the user if no §1.5-clean public +video+string corpus is found — that decision blocks the acceptance gate. + +## 7. Phased plan + +- **P0 — data + harness.** Pick/build the eval set (§6). Add a + `clean_acoustic_single_line_video` (and strummed/chord) tier + parser to the + composite manifest/harness; the harness already reports per-tier Tab F1 + + chord + bootstrap CIs (shipped 2026-06-03, commit `292252d`). +- **P1 — resolver.** Implement §5 (per-note FrameFingering → candidate-restricted + string prior, confidence-gated). Eval audio-only vs +video on the new tier; + target single-line Tab F1 → 0.94. +- **P2 — robustness + chord.** Occlusion / dropped-frame handling, multi-frame + voting, and multi-finger chord resolution; re-check chord-instance ≥ 0.85. + +## 8. Acceptance test + +On the **video** tier(s), `lower_95_CI ≥ target` over clips (95% bootstrap): +single-line Tab F1 **≥ 0.94**, chord-instance accuracy **≥ 0.85**. AND the +audio-only acoustic tiers **do not regress** vs the v1 numbers (video additive). +Latency **≤ 5 min / 60 s clip** including the video pass on laptop CPU. + +## 9. Decision tree + +- No §1.5-clean public video+string dataset found → ship the synthetic-from- + GuitarSet validation, **flag the public-gate as blocked, escalate to user.** +- Resolver fails to lift single-line past ~0.7 **on clean synthetic** video → + the bug is the resolver/wiring (§4/§5), not the data; fix before real video. +- Lifts on synthetic but not on real video → **hand-detection robustness** is the + bottleneck (occlusion, fast runs); that is P2, not P1. +- Video regresses audio-only tiers → the confidence gate (§5.3) is mis-tuned; + it must collapse to weight 0, recovering the audio path exactly. + +## 10. Free-tools / licensing (SPEC §1.5) + +All compute is free + CPU: MediaPipe (Apache-2.0) and the existing video stack; +no new paid dependency, no GPU. The **only** §1.5 risk is the eval corpus — the +shipping acceptance gate must use a portfolio-clean public video+string dataset +(§6.2). Synthetic-from-GuitarSet (§6.1) is re-derivable from a public source and +clean by construction. + +## 11. Non-goals + +- **Electric** (clean/distorted) — that is v2, behind the tone toggle + (`docs/plans/2026-06-02-electric-backbone-finetune-design.md`). +- Real-time / streaming. +- Expressive markings (bends, hammer-ons, slides) — separate ≥ 0.70 stretch.