Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions docs/plans/2026-06-03-v1.1-video-string-resolution-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# v1.1 — Video string-resolution for single-line (design)

**Date:** 2026-06-03
**Status:** design / prep. No code in this doc; it scopes the v1.1 milestone.
**Depends on:** v1 ACCEPTED (audio-only acoustic, 2026-06-03 — see
`docs/EVAL_REPORTS/v1_acceptance_2026-06-03.md`).

## 1. Goal

Lift the **clean-acoustic single-line** tier from its v1 audio-only ceiling
(**Tab F1 ≈ 0.52**, lower-95 0.457) toward the original SPEC §1.4 target
(**0.94**), and **chord-instance accuracy** from ≈ 0.48 toward **0.85**, by using
the **fretting-hand video** to resolve the *string* that audio cannot. Both
numbers were re-scoped from v1 gates to **v1.1 video targets** in SPEC §1.4.1
(2026-06-03) precisely because they are audio-information-limited.

Hard constraint: **video is additive and confidence-gated.** The audio-only
tiers must not regress when video is absent, occluded, or low-confidence.

## 2. Why this is the right lever (diagnosis)

The failure is **not** pitch and **not** tuning — it is *which string*:

- Error decomposition (`docs/EVAL_REPORTS/acoustic_single_line_2026-06-02.md`):
single-line loss is **322 `wrong_position_same_pitch`** vs **8 `pitch_off`** —
the pitch is correct, the string is wrong.
- The 2026-06-03 acceptance run confirmed the cap holds after fixes: single-line
Tab F1 0.523, and chord-instance accuracy 0.521 ≈ Tab F1 (chord inherits the
same string error).
- Onset F1 0.94 / pitch F1 0.93 are already at spec. The bottleneck is the
pitch → (string, fret) assignment, and **the same pitch is acoustically
near-identical across strings**, so audio cannot break the tie.

The fretting hand's position on the neck *directly observes* the string and
fret. This is the one signal that carries the missing information.

## 3. What already exists (v1.1 wires + strengthens; it does not build)

The full video stack shipped earlier (Phase 2–5; `run_pipeline` already runs it
under `video_enabled=True`). Inventory of the relevant pieces:

| Concern | Module | Produces |
|---|---|---|
| Frames | `tabvision/demux` | per-frame iterator (ffmpeg + cv2) |
| Neck/guitar detect | `video/guitar/yolo_backend.py`, `guitar/tracker.py` | guitar + neck ROI |
| Fretboard geometry | `video/fretboard/{geometric,keypoint,tracker}.py` | per-frame `Homography` (canonical fret/string grid) |
| Hand landmarks | `video/hand/mediapipe_backend.py` | MediaPipe fingertip samples |
| **Per-finger (string,fret)** | `video/hand/fingertip_to_fret.py` | **`FrameFingering`** — a per-(string, fret) posterior |
| Coarse hand anchor | `video/hand/neck_anchor.py` | `NeckAnchor` (center/min/max fret + confidence) |
| Pitch → positions | `fusion/candidates.py` | `candidate_positions(pitch)` — every playable (string, fret) |
| Audio position prior | `fusion/position_prior.py` | learned (string, fret) reweighting |
| Neck-anchor prior | `fusion/neck_prior.py` | attaches `AudioEvent.fret_prior` from the anchor |
| Emit | `fusion/viterbi.py`, `fusion/playability.py` | `TabEvent` (string_idx, fret) with position continuity |

`run_pipeline` already demuxes frames, runs `_run_video_stack` → `fingerings`
(`FrameFingering`) + `neck_anchors`, and calls `apply_neck_anchor_priors`.

## 4. The gap (precise, and why single-line didn't move)

`fusion/neck_prior.anchor_position_prior` builds a Gaussian over **fret** and
then **tiles it across every string** (`neck_prior.py:69`:
`np.tile(fret_probs[None, :], (n_strings, 1))`). So the video signal that
currently reaches fusion constrains the **fret region** but is **string-agnostic**
— it says "the hand is around fret 5," not "on the D string." That is the wrong
axis: single-line errors are wrong-*string*, right-fret-region.

Meanwhile the **string-discriminative** signal already exists in `FrameFingering`
(fingertip → per-(string, fret) posterior) but is **not** consumed by the per-note
resolver — only the coarse, fret-only `NeckAnchor` is. **v1.1 closes exactly this
gap.**

## 5. Method

A new confidence-gated fusion step that turns per-frame `FrameFingering` into a
per-note **string** prior, restricted to the note's pitch candidates:

1. **Temporal align.** For each audio note `(pitch, onset_s)`, collect the
`FrameFingering`s within ±Δt of the onset (Δt ≈ one fret-change window,
~0.1–0.15 s; reuse the `max_time_distance_s` pattern from `neck_prior`).
2. **Restrict to candidates.** `candidate_positions(pitch, cfg)` gives the few
playable `(string, fret)` for that pitch. Score each candidate by the video
posterior mass on that exact `(string, fret)` cell (multi-frame vote: sum /
max over the window).
3. **Confidence-gated fuse.** Combine the video string-prior with the audio
posterior multiplicatively, weighted by a per-note video confidence
(hand-detection + homography quality). High confidence → video decides the
string; low/occluded → weight → 0 and the audio prior stands unchanged
(**no-regression guarantee**).
4. **Emit unchanged.** Feed the resolved per-note `(string, fret)` posterior into
the existing `viterbi`/`playability` emission (position continuity via the
already-tuned `POSITION_SHIFT_COST`).

Net new code: a `fusion/video_string_prior.py` (note ← FrameFingering, candidate-
restricted) + wiring in `pipeline.run_pipeline` / the fusion entry, alongside (not
replacing) the fret-only neck anchor. Chords (§7) extend step 2 to the multi-note
cluster.

## 6. The hard part — eval data (the real gate)

**GuitarSet, the v1 eval set, is audio-only.** It cannot validate video
string-resolution. v1.1 needs a corpus with (a) fretting-hand video and (b)
frame/note-accurate **string + fret** ground truth. This is the gating decision,
analogous to "no in-repo trainer" for v2-electric. Options, cheapest first:

1. **Synthetic video rendered from GuitarSet's own string/fret labels.**
GuitarSet's JAMS already carry per-note string + fret (hex-pickup). Render a
synthetic neck + fretting-hand animation from them → free, re-derivable,
license-clean, frame-perfect labels. **Validates the resolver's ceiling**
under clean video, decoupled from MediaPipe noise. (Does not test real-hand
robustness.) Reuses `scripts/viz/overlay_fretboard.py` conventions.
2. **A license-clean public guitar-video dataset with tab/string labels** (e.g.
IDMT-SMT-Guitar video subsets, or a tab-aligned performance corpus). This is
the **real acceptance gate** — must pass SPEC §1.5 portfolio licensing.
3. **A small self-recorded video dev set** — iteration aid only. SPEC §1.4.1
**bans personal clips as a gate**, so this never becomes the acceptance
number; keep the gate on (2).

**Recommendation:** (1) first to prove the method moves single-line on clean
video, then (2) as the gate. Escalate to the user if no §1.5-clean public
video+string corpus is found — that decision blocks the acceptance gate.

## 7. Phased plan

- **P0 — data + harness.** Pick/build the eval set (§6). Add a
`clean_acoustic_single_line_video` (and strummed/chord) tier + parser to the
composite manifest/harness; the harness already reports per-tier Tab F1 +
chord + bootstrap CIs (shipped 2026-06-03, commit `292252d`).
- **P1 — resolver.** Implement §5 (per-note FrameFingering → candidate-restricted
string prior, confidence-gated). Eval audio-only vs +video on the new tier;
target single-line Tab F1 → 0.94.
- **P2 — robustness + chord.** Occlusion / dropped-frame handling, multi-frame
voting, and multi-finger chord resolution; re-check chord-instance ≥ 0.85.

## 8. Acceptance test

On the **video** tier(s), `lower_95_CI ≥ target` over clips (95% bootstrap):
single-line Tab F1 **≥ 0.94**, chord-instance accuracy **≥ 0.85**. AND the
audio-only acoustic tiers **do not regress** vs the v1 numbers (video additive).
Latency **≤ 5 min / 60 s clip** including the video pass on laptop CPU.

## 9. Decision tree

- No §1.5-clean public video+string dataset found → ship the synthetic-from-
GuitarSet validation, **flag the public-gate as blocked, escalate to user.**
- Resolver fails to lift single-line past ~0.7 **on clean synthetic** video →
the bug is the resolver/wiring (§4/§5), not the data; fix before real video.
- Lifts on synthetic but not on real video → **hand-detection robustness** is the
bottleneck (occlusion, fast runs); that is P2, not P1.
- Video regresses audio-only tiers → the confidence gate (§5.3) is mis-tuned;
it must collapse to weight 0, recovering the audio path exactly.

## 10. Free-tools / licensing (SPEC §1.5)

All compute is free + CPU: MediaPipe (Apache-2.0) and the existing video stack;
no new paid dependency, no GPU. The **only** §1.5 risk is the eval corpus — the
shipping acceptance gate must use a portfolio-clean public video+string dataset
(§6.2). Synthetic-from-GuitarSet (§6.1) is re-derivable from a public source and
clean by construction.

## 11. Non-goals

- **Electric** (clean/distorted) — that is v2, behind the tone toggle
(`docs/plans/2026-06-02-electric-backbone-finetune-design.md`).
- Real-time / streaming.
- Expressive markings (bends, hammer-ons, slides) — separate ≥ 0.70 stretch.
Loading