pgil256 · pgil256 · Jun 8, 2026 · Jun 8, 2026
diff --git a/docs/plans/2026-06-03-v1.1-video-string-resolution-design.md b/docs/plans/2026-06-03-v1.1-video-string-resolution-design.md
@@ -0,0 +1,165 @@
+# v1.1 — Video string-resolution for single-line (design)
+
+**Date:** 2026-06-03
+**Status:** design / prep. No code in this doc; it scopes the v1.1 milestone.
+**Depends on:** v1 ACCEPTED (audio-only acoustic, 2026-06-03 — see
+`docs/EVAL_REPORTS/v1_acceptance_2026-06-03.md`).
+
+## 1. Goal
+
+Lift the **clean-acoustic single-line** tier from its v1 audio-only ceiling
+(**Tab F1 ≈ 0.52**, lower-95 0.457) toward the original SPEC §1.4 target
+(**0.94**), and **chord-instance accuracy** from ≈ 0.48 toward **0.85**, by using
+the **fretting-hand video** to resolve the *string* that audio cannot. Both
+numbers were re-scoped from v1 gates to **v1.1 video targets** in SPEC §1.4.1
+(2026-06-03) precisely because they are audio-information-limited.
+
+Hard constraint: **video is additive and confidence-gated.** The audio-only
+tiers must not regress when video is absent, occluded, or low-confidence.
+
+## 2. Why this is the right lever (diagnosis)
+
+The failure is **not** pitch and **not** tuning — it is *which string*:
+
+- Error decomposition (`docs/EVAL_REPORTS/acoustic_single_line_2026-06-02.md`):
+  single-line loss is **322 `wrong_position_same_pitch`** vs **8 `pitch_off`** —
+  the pitch is correct, the string is wrong.
+- The 2026-06-03 acceptance run confirmed the cap holds after fixes: single-line
+  Tab F1 0.523, and chord-instance accuracy 0.521 ≈ Tab F1 (chord inherits the
+  same string error).
+- Onset F1 0.94 / pitch F1 0.93 are already at spec. The bottleneck is the
+  pitch → (string, fret) assignment, and **the same pitch is acoustically
+  near-identical across strings**, so audio cannot break the tie.
+
+The fretting hand's position on the neck *directly observes* the string and
+fret. This is the one signal that carries the missing information.
+
+## 3. What already exists (v1.1 wires + strengthens; it does not build)
+
+The full video stack shipped earlier (Phase 2–5; `run_pipeline` already runs it
+under `video_enabled=True`). Inventory of the relevant pieces:
+
+| Concern | Module | Produces |
+|---|---|---|
+| Frames | `tabvision/demux` | per-frame iterator (ffmpeg + cv2) |
+| Neck/guitar detect | `video/guitar/yolo_backend.py`, `guitar/tracker.py` | guitar + neck ROI |
+| Fretboard geometry | `video/fretboard/{geometric,keypoint,tracker}.py` | per-frame `Homography` (canonical fret/string grid) |
+| Hand landmarks | `video/hand/mediapipe_backend.py` | MediaPipe fingertip samples |
+| **Per-finger (string,fret)** | `video/hand/fingertip_to_fret.py` | **`FrameFingering`** — a per-(string, fret) posterior |
+| Coarse hand anchor | `video/hand/neck_anchor.py` | `NeckAnchor` (center/min/max fret + confidence) |
+| Pitch → positions | `fusion/candidates.py` | `candidate_positions(pitch)` — every playable (string, fret) |
+| Audio position prior | `fusion/position_prior.py` | learned (string, fret) reweighting |
+| Neck-anchor prior | `fusion/neck_prior.py` | attaches `AudioEvent.fret_prior` from the anchor |
+| Emit | `fusion/viterbi.py`, `fusion/playability.py` | `TabEvent` (string_idx, fret) with position continuity |
+
+`run_pipeline` already demuxes frames, runs `_run_video_stack` → `fingerings`
+(`FrameFingering`) + `neck_anchors`, and calls `apply_neck_anchor_priors`.
+
+## 4. The gap (precise, and why single-line didn't move)
+
+`fusion/neck_prior.anchor_position_prior` builds a Gaussian over **fret** and
+then **tiles it across every string** (`neck_prior.py:69`:
+`np.tile(fret_probs[None, :], (n_strings, 1))`). So the video signal that
+currently reaches fusion constrains the **fret region** but is **string-agnostic**
+— it says "the hand is around fret 5," not "on the D string." That is the wrong
+axis: single-line errors are wrong-*string*, right-fret-region.
+
+Meanwhile the **string-discriminative** signal already exists in `FrameFingering`
+(fingertip → per-(string, fret) posterior) but is **not** consumed by the per-note
+resolver — only the coarse, fret-only `NeckAnchor` is. **v1.1 closes exactly this
+gap.**
+
+## 5. Method
+
+A new confidence-gated fusion step that turns per-frame `FrameFingering` into a
+per-note **string** prior, restricted to the note's pitch candidates:
+
+1. **Temporal align.** For each audio note `(pitch, onset_s)`, collect the
+   `FrameFingering`s within ±Δt of the onset (Δt ≈ one fret-change window,
+   ~0.1–0.15 s; reuse the `max_time_distance_s` pattern from `neck_prior`).
+2. **Restrict to candidates.** `candidate_positions(pitch, cfg)` gives the few
+   playable `(string, fret)` for that pitch. Score each candidate by the video
+   posterior mass on that exact `(string, fret)` cell (multi-frame vote: sum /
+   max over the window).
+3. **Confidence-gated fuse.** Combine the video string-prior with the audio
+   posterior multiplicatively, weighted by a per-note video confidence
+   (hand-detection + homography quality). High confidence → video decides the
+   string; low/occluded → weight → 0 and the audio prior stands unchanged
+   (**no-regression guarantee**).
+4. **Emit unchanged.** Feed the resolved per-note `(string, fret)` posterior into
+   the existing `viterbi`/`playability` emission (position continuity via the
+   already-tuned `POSITION_SHIFT_COST`).
+
+Net new code: a `fusion/video_string_prior.py` (note ← FrameFingering, candidate-
+restricted) + wiring in `pipeline.run_pipeline` / the fusion entry, alongside (not
+replacing) the fret-only neck anchor. Chords (§7) extend step 2 to the multi-note
+cluster.
+
+## 6. The hard part — eval data (the real gate)
+
+**GuitarSet, the v1 eval set, is audio-only.** It cannot validate video
+string-resolution. v1.1 needs a corpus with (a) fretting-hand video and (b)
+frame/note-accurate **string + fret** ground truth. This is the gating decision,
+analogous to "no in-repo trainer" for v2-electric. Options, cheapest first:
+
+1. **Synthetic video rendered from GuitarSet's own string/fret labels.**
+   GuitarSet's JAMS already carry per-note string + fret (hex-pickup). Render a
+   synthetic neck + fretting-hand animation from them → free, re-derivable,
+   license-clean, frame-perfect labels. **Validates the resolver's ceiling**
+   under clean video, decoupled from MediaPipe noise. (Does not test real-hand
+   robustness.) Reuses `scripts/viz/overlay_fretboard.py` conventions.
+2. **A license-clean public guitar-video dataset with tab/string labels** (e.g.
+   IDMT-SMT-Guitar video subsets, or a tab-aligned performance corpus). This is
+   the **real acceptance gate** — must pass SPEC §1.5 portfolio licensing.
+3. **A small self-recorded video dev set** — iteration aid only. SPEC §1.4.1
+   **bans personal clips as a gate**, so this never becomes the acceptance
+   number; keep the gate on (2).
+
+**Recommendation:** (1) first to prove the method moves single-line on clean
+video, then (2) as the gate. Escalate to the user if no §1.5-clean public
+video+string corpus is found — that decision blocks the acceptance gate.
+
+## 7. Phased plan
+
+- **P0 — data + harness.** Pick/build the eval set (§6). Add a
+  `clean_acoustic_single_line_video` (and strummed/chord) tier + parser to the
+  composite manifest/harness; the harness already reports per-tier Tab F1 +
+  chord + bootstrap CIs (shipped 2026-06-03, commit `292252d`).
+- **P1 — resolver.** Implement §5 (per-note FrameFingering → candidate-restricted
+  string prior, confidence-gated). Eval audio-only vs +video on the new tier;
+  target single-line Tab F1 → 0.94.
+- **P2 — robustness + chord.** Occlusion / dropped-frame handling, multi-frame
+  voting, and multi-finger chord resolution; re-check chord-instance ≥ 0.85.
+
+## 8. Acceptance test
+
+On the **video** tier(s), `lower_95_CI ≥ target` over clips (95% bootstrap):
+single-line Tab F1 **≥ 0.94**, chord-instance accuracy **≥ 0.85**. AND the
+audio-only acoustic tiers **do not regress** vs the v1 numbers (video additive).
+Latency **≤ 5 min / 60 s clip** including the video pass on laptop CPU.
+
+## 9. Decision tree
+
+- No §1.5-clean public video+string dataset found → ship the synthetic-from-
+  GuitarSet validation, **flag the public-gate as blocked, escalate to user.**
+- Resolver fails to lift single-line past ~0.7 **on clean synthetic** video →
+  the bug is the resolver/wiring (§4/§5), not the data; fix before real video.
+- Lifts on synthetic but not on real video → **hand-detection robustness** is the
+  bottleneck (occlusion, fast runs); that is P2, not P1.
+- Video regresses audio-only tiers → the confidence gate (§5.3) is mis-tuned;
+  it must collapse to weight 0, recovering the audio path exactly.
+
+## 10. Free-tools / licensing (SPEC §1.5)
+
+All compute is free + CPU: MediaPipe (Apache-2.0) and the existing video stack;
+no new paid dependency, no GPU. The **only** §1.5 risk is the eval corpus — the
+shipping acceptance gate must use a portfolio-clean public video+string dataset
+(§6.2). Synthetic-from-GuitarSet (§6.1) is re-derivable from a public source and
+clean by construction.
+
+## 11. Non-goals
+
+- **Electric** (clean/distorted) — that is v2, behind the tone toggle
+  (`docs/plans/2026-06-02-electric-backbone-finetune-design.md`).
+- Real-time / streaming.
+- Expressive markings (bends, hammer-ons, slides) — separate ≥ 0.70 stretch.