Skip to content

IxVM kernel: unblock UTF-8 decode/encode proof + Nat-layer FFT cuts#450

Open
arthurpaulino wants to merge 7 commits into
mainfrom
ap/utf8-tier-1d
Open

IxVM kernel: unblock UTF-8 decode/encode proof + Nat-layer FFT cuts#450
arthurpaulino wants to merge 7 commits into
mainfrom
ap/utf8-tier-1d

Conversation

@arthurpaulino

Copy link
Copy Markdown
Member

Headline

Aiur kernel previously OOM'd on
_private.Init.Data.String.Decode.0.ByteArray.utf8DecodeChar?.assemble₄_eq_some_of_toBitVec._proof_1_8.
This constant is a prerequisite of ByteArray.utf8DecodeChar?_utf8EncodeChar_append
(part of the UTF-8 round-trip lemma chain).

Now it typechecks at 37.31B FFT (down from the OOM baseline). The
two-direction lemma …_utf8EncodeChar_append itself remains
out-of-reach (still OOMs further along the same dependency tree under
the user's memory cap), but the prerequisite that was the immediate
blocker is unstuck.

The same kernel changes also deliver large FFT cuts on previously-
expensive targets — Tier 1d's structural short-circuit replaces full
delta-whnf cascades wherever it fires:

Constant Before After Δ
Vector.append 4.02B 3.16B −21.4%
Array.append_assoc 3.94B 3.08B −21.8%

Commits

1. a6cd34a — Tier 1d def-eq short-circuit (the unblock)

Aiur's k_is_def_eq_core jumped from Tier 1.5 straight to full delta
WHNF (Tier 2). The cascading Nat.rec / Nat.succ iota expansions then
drove whnf_const_head past 1M unique entries before either side
reached a comparable canonical form — OOM. Rust's def-eq settles the
same pair via no-delta whnf + quick structural recursion before any
of that fires.

Three minimum-necessary pieces ported. No KStore. No FVar. No Subst /
KernelTypes change.

  • whnf_nd family (Whnf.lean, mirror Rust
    whnf_no_delta_for_def_eq). Same dispatch tree as whnf, but
    whnf_nd_const_head's Defn arm falls through to a stuck
    apply_spine instead of delta-unfolding. Iota / proj / quot /
    primitives still fire.

  • k_infer_only family (Infer.lean, mirror Rust
    with_infer_only). App drops k_check(a, dom); Lam drops
    k_ensure_sort(ty); Let drops val/ty validation. Used at
    try_proof_irrel, is_prop_type, try_unit_like — the def-eq
    tactics that only need the synthesized type.

  • k_is_def_eq_struct_safe + Tier 1d wiring (DefEq.lean,
    mirror Rust quick_def_eq + post-try_def_eq_app). Sort-Sort via
    level_equal; Lam-Lam / All-All via recursive k_is_def_eq on the
    type and on the body under Cons(ty_a, types) (types-cons, NOT
    FVar opening). Inserted between Tier 1c (string lit) and Tier 2
    (full whnf):

    aw_nd = whnf_nd(a); bw_nd = whnf_nd(b)
    ptr_eq(aw_nd, bw_nd) → 1
    k_is_def_eq_struct_safe(aw_nd, bw_nd) → 1
    try_lazy_delta_app(aw_nd, bw_nd) → 1     (rerun: spine args may have
      reduced past what Tier 1.5's pre-whnf attempt could see)
    

Each piece independently validated necessary. The previously-tried
KStore explicit caches, FVar variant + opens, FVar-based binder
opening — all confirmed NOT necessary for this unblock and left out.

3 files changed, +296/−6 lines.

2. 36d6c3a — drop g_or from u64_sub_with_borrow

u64_sub_with_borrow combined two per-byte borrow bits with g_or.
The two bits are mutually exclusive: u_t = 1 ⇒ intermediate t_i ≥ 1 ⇒ subtracting br_in ∈ {0,1} cannot underflow ⇒ u_r = 0. Field
+ substitutes for g_or directly. Per Aiur cost model g_or adds
+1 aux + 1 lookup per call (≈ 5 width); field + is free. 7 g_ors ×
2.23M rows.

UTF-8 _proof_1_8: 39.12B → 38.14B (−2.6%).

3. 9e787da — drop g_or from klimbs_add_carry / klimbs_sub_borrow

Same mutually-exclusive-carry pattern. Two limb-level borrows /
carries from sequential u64 ops cannot both be 1.

UTF-8 _proof_1_8: 38.14B → 38.07B (−0.18%).

4. 4e379e7 — hot/cold split try_nat_dispatch, extract binop arm

try_nat_dispatch's width 90 was floored by the binop arm (2× whnf

  • 2× try_extract_nat + try_nat_binop_addr + apply_spine), charged on
    every Nat.succ / Nat.pred row. Factor binop dispatch into
    try_nat_binop_dispatch. Main narrows to the max of succ / pred
    arms.

UTF-8 _proof_1_8: 38.07B → 37.80B (−0.7%).

5. 80ce3d2 — hot/cold split expr_lbr, extract Let arm

expr_lbr's width was floored by the Let arm (3 recursive expr_lbr
calls + 2 lbr_max + 1 lbr_dec), charged on every row even though Let
is rare. Factor into expr_lbr_let.

Nat.add_comm: 55.63M → 55.50M (−0.2%). UTF-8 _proof_1_8: 37.80B
→ 37.62B (−0.5%)
.

6. 1f8effd — hot/cold split try_extract_nat, extract App arm

try_extract_nat's width was floored by the App arm (list_lookup +
address_eq + recursive try_extract_nat + klimbs_succ). Factor into
try_extract_nat_app. Main narrows to leaf-arm width.

UTF-8 _proof_1_8: 37.62B → 37.31B (−0.8%).

7. 039e9cf — re-pin IxVM FFT costs

41 pins in Tests/Ix/IxVM.lean::kernelCheckEntries updated. Every
constant got cheaper; none regressed. Largest reductions:

  • Vector.append: 4.02B → 3.16B (−21.4%)
  • Array.append_assoc: 3.94B → 3.08B (−21.8%)

lake test -- --ignored ixvm passes with 0 FFT mismatches.

Cumulative on UTF-8 _proof_1_8

Commit FFT
baseline (main) OOM
a6cd34a Tier 1d 39.12B
36d6c3a g_or → + in u64_sub_with_borrow 38.14B (−2.6%)
9e787da g_or → + in klimbs_add_carry / klimbs_sub_borrow 38.07B (−0.18%)
4e379e7 hot/cold try_nat_dispatch 37.80B (−0.7%)
80ce3d2 hot/cold expr_lbr 37.62B (−0.5%)
1f8effd hot/cold try_extract_nat 37.31B (−0.8%)

Post-unlock optimization: −4.6% (39.12B → 37.31B).

Cost on small targets

  • Main baseline Nat.add_comm: 56.08M FFT.
  • This branch Nat.add_comm: 55.50M FFT.

Tier 1d itself adds no overhead on the common case (whnf_nd +
struct_safe + try_lazy_delta_app are themselves Aiur-memoized); the
follow-up optimizations are net wins.

Test plan

  • lake exe check Nat.add_comm passes (55.50M FFT).
  • lake exe check Vector.extract_append passes.
  • lake exe check "_private.…assemble₄_eq_some_of_toBitVec._proof_1_8"
    passes (37.31B FFT, previously OOM).
  • lake test -- --ignored ixvm — all 41 FFT pins updated, suite
    passes with 0 mismatches.

…ost-spine-congruence)

UTF-8 `_private.Init.Data.String.Decode.0.ByteArray.utf8DecodeChar?
.assemble₄_eq_some_of_toBitVec._proof_1_8` OOMs on the previous
pipeline because `k_is_def_eq_core` jumps from Tier 1.5 straight to
full delta WHNF (Tier 2); the cascading Nat.rec / Nat.succ iota
expansions then drive `whnf_const_head` past 1M unique entries before
either side reaches a comparable canonical form. Rust's def-eq settles
the same pair via the no-delta whnf + quick structural recursion
before any of that fires.

This patch ports the three pieces of that short-circuit and nothing
else — no FVar variant, no KStore, no Subst changes, no signature
sweep.

* `whnf_nd` family in `Whnf.lean` (mirror Rust `whnf_no_delta_for_def_eq`).
  Same dispatch tree as `whnf` (beta / let zeta / iota / proj / quot /
  primitives all fire), except `whnf_nd_const_head`'s Defn arm falls
  through to a stuck `apply_spine` instead of delta-unfolding.

* `k_infer_only` family in `Infer.lean` (mirror Rust `with_infer_only`).
  App drops `k_check(a, dom)`; Lam drops `k_ensure_sort(ty)`; Let drops
  the val/ty validations. Distinct Aiur memo from `k_infer`, parity
  with Rust's separate `infer_cache` / `infer_only_cache`.

* `k_is_def_eq_struct_safe` in `DefEq.lean` (mirror Rust
  `quick_def_eq`). Sort-Sort via `level_equal`; Lam-Lam / All-All
  recurse on type and on body under `Cons(ty_a, types)`. Returns 1
  only when DEFINITELY def-eq; 0 means fall through. Sound on
  partially-whnf'd (no-delta) inputs because the handled shapes
  don't depend on further reductions.

* `k_is_def_eq_core` Tier 1d wiring inserted between Tier 1c (string
  lit) and Tier 2 (full whnf):
    aw_nd = whnf_nd(a); bw_nd = whnf_nd(b)
    ptr_eq(aw_nd, bw_nd) → 1
    k_is_def_eq_struct_safe(aw_nd, bw_nd) → 1 if 1
    try_lazy_delta_app(aw_nd, bw_nd) → 1 if 1   (rerun post-whnf_nd:
      spine args may have reduced past what Tier 1.5's pre-whnf attempt
      could see, exposing Const-Const congruence that was hidden)

* `try_proof_irrel`, `is_prop_type`, `try_unit_like` switch from
  `k_infer` to `k_infer_only` — these helpers only need the synthesized
  type, not the full re-validation work that `k_infer` does for each
  recursive App/Let/Lam.

Each piece individually validated necessary (removing it puts UTF-8
back into the OOM regime). FVar variant + opens, KStore explicit
caches, infer_only's FVar-based binder opening — all confirmed NOT
necessary for the UTF-8 unblock and left out (see PLAN.md for future
experiments).

Measured (FFT cost):
  Nat.add_comm: 56.08M → 55.63M (~stable; new code paths add no
  overhead on the common case because Tier 1d's whnf_nd + struct_safe
  are themselves Aiur-memoized).
  _private.Init.Data.String.Decode.0.ByteArray.utf8DecodeChar?
   .assemble₄_eq_some_of_toBitVec._proof_1_8: OOM → 39.12B FFT, passes.

3 files, +296/-6 lines.
`u64_sub_with_borrow` combines two per-byte borrow bits with `g_or`. The
two bits are MUTUALLY EXCLUSIVE: `u_t = borrow(a_i - b_i)` and
`u_r = borrow((a_i + 256 - b_i) - br_in)`. If `u_t = 1` the intermediate
`t_i ≥ 1`, so subtracting `br_in ∈ {0,1}` cannot underflow ⇒ `u_r = 0`.
Field `+` substitutes for `g_or` directly (per the same pattern as
`u64_add` in `ByteStream.lean`).

Per Aiur cost model, `g_or` adds +1 aux + 1 lookup per call; field `+`
is free. 8 g_or call sites in `u64_sub_with_borrow` each charged on every
one of the function's 2.23M rows.

Measured (FFT cost) on UTF-8 `_proof_1_8`:
  39.12B → 38.14B (-2.6%)
  Nat.add_comm unchanged (55.63M).

See [[reference_aiur_carry_add]].
Same mutually-exclusive-carry pattern as `u64_sub_with_borrow`:
* `klimbs_add_carry`: u64_add of (la, lb) yields carry1; u64_add of
  (sum1, carry_in) yields carry2. carry1=1 ⇒ sum1 ≤ 2^64-2 ⇒
  carry2=0.
* `klimbs_sub_borrow`: symmetric for borrows.

Replace `g_or(c1, c2)` with `c1 + c2` (field +). Both helpers run on
hot Nat-primitive paths.

Measured on UTF-8 `_proof_1_8`:
  38.14B → 38.07B (-0.18%)
  Nat.add_comm unchanged.

See [[reference_aiur_carry_add]].
`try_nat_dispatch` ran 1.12M rows in UTF-8 `_proof_1_8` at width 90,
charging 5.16% of total FFT. Width was floored by its widest match arm
(the binop branch with 2× whnf + 2× try_extract_nat + try_nat_binop_addr
+ apply_spine), even on Nat.succ / Nat.pred rows that never touched it.

Factor binop dispatch into its own `try_nat_binop_dispatch` fn. Main
dispatcher narrows to the max of succ / pred arms (single whnf +
try_extract_nat + klimbs_succ/dec + apply_spine). The cold fn's width
only charges the rows that actually dispatch a binop.

Measured on UTF-8 `_proof_1_8`:
  38.07B → 37.80B (-0.7%)
  Nat.add_comm unchanged.
`expr_lbr` ran 1.47M rows in UTF-8 `_proof_1_8` at width 39, charging
3.01% of total FFT. The Let arm (3 recursive expr_lbr calls + 2 lbr_max
+ 1 lbr_dec) is the widest match arm, charged on every row of expr_lbr
even though Let is rare in most expressions encountered.

Factor the Let arm into `expr_lbr_let(ty, val, body)`. Main expr_lbr
narrows to max of the 2-recursion arms (App / Lam / Forall). Cold fn
only charges Let-arm rows.

Measured:
  Nat.add_comm: 55.63M → 55.50M (-0.2%)
  UTF-8 `_proof_1_8`: 37.80B → 37.62B (-0.5%)
`try_extract_nat` ran 1.12M rows at width 45, charging 2.68% of UTF-8
`_proof_1_8` total FFT. The App arm (list_lookup + address_eq +
recursive try_extract_nat + klimbs_succ) is the widest match arm; the
Lit / Const / default arms are leaf compares.

Factor App into `try_extract_nat_app(f, a, addrs)`. Main extractor
narrows to leaf-arm width. Cold fn only charges App-arm rows.

Measured on UTF-8 `_proof_1_8`:
  37.62B → 37.31B (-0.8%)
  Nat.add_comm unchanged.
Updates 41 pinned FFT costs in `Tests/Ix/IxVM.lean::kernelCheckEntries`
to match the new kernel's output. All pins moved DOWN — every constant
got cheaper, none regressed.

Largest reductions (% change):
  Vector.append:          4_023_268_168 → 3_160_970_390  (-21.4%)
  Array.append_assoc:     3_938_574_533 → 3_079_334_815  (-21.8%)
  String.Internal.append:   793_580_333 →   775_968_134  ( -2.2%)
  bv_to_nat_lit:            635_780_327 →   619_870_154  ( -2.5%)
  nat_gcd_lit:              665_518_356 →   649_859_784  ( -2.4%)
  Nat.sub_le_of_le_add:     567_575_653 →   557_867_526  ( -1.7%)
  IxVMPrim.nat_mod_lit:     414_695_549 →   407_517_834  ( -1.7%)
  IxVMPrim.nat_div_lit:     405_607_545 →   398_641_590  ( -1.7%)
  IxVMPrim.nat_shr_lit:     411_128_901 →   404_158_486  ( -1.7%)
  Nat.decLe:                209_641_496 →   206_196_563  ( -1.6%)
  Nat.add_comm:              56_084_908 →    55_504_714  ( -1.0%)

`lake test -- --ignored ixvm` passes with 0 FFT mismatches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant