IxVM kernel: unblock UTF-8 decode/encode proof + Nat-layer FFT cuts#450
Open
arthurpaulino wants to merge 7 commits into
Open
IxVM kernel: unblock UTF-8 decode/encode proof + Nat-layer FFT cuts#450arthurpaulino wants to merge 7 commits into
arthurpaulino wants to merge 7 commits into
Conversation
…ost-spine-congruence)
UTF-8 `_private.Init.Data.String.Decode.0.ByteArray.utf8DecodeChar?
.assemble₄_eq_some_of_toBitVec._proof_1_8` OOMs on the previous
pipeline because `k_is_def_eq_core` jumps from Tier 1.5 straight to
full delta WHNF (Tier 2); the cascading Nat.rec / Nat.succ iota
expansions then drive `whnf_const_head` past 1M unique entries before
either side reaches a comparable canonical form. Rust's def-eq settles
the same pair via the no-delta whnf + quick structural recursion
before any of that fires.
This patch ports the three pieces of that short-circuit and nothing
else — no FVar variant, no KStore, no Subst changes, no signature
sweep.
* `whnf_nd` family in `Whnf.lean` (mirror Rust `whnf_no_delta_for_def_eq`).
Same dispatch tree as `whnf` (beta / let zeta / iota / proj / quot /
primitives all fire), except `whnf_nd_const_head`'s Defn arm falls
through to a stuck `apply_spine` instead of delta-unfolding.
* `k_infer_only` family in `Infer.lean` (mirror Rust `with_infer_only`).
App drops `k_check(a, dom)`; Lam drops `k_ensure_sort(ty)`; Let drops
the val/ty validations. Distinct Aiur memo from `k_infer`, parity
with Rust's separate `infer_cache` / `infer_only_cache`.
* `k_is_def_eq_struct_safe` in `DefEq.lean` (mirror Rust
`quick_def_eq`). Sort-Sort via `level_equal`; Lam-Lam / All-All
recurse on type and on body under `Cons(ty_a, types)`. Returns 1
only when DEFINITELY def-eq; 0 means fall through. Sound on
partially-whnf'd (no-delta) inputs because the handled shapes
don't depend on further reductions.
* `k_is_def_eq_core` Tier 1d wiring inserted between Tier 1c (string
lit) and Tier 2 (full whnf):
aw_nd = whnf_nd(a); bw_nd = whnf_nd(b)
ptr_eq(aw_nd, bw_nd) → 1
k_is_def_eq_struct_safe(aw_nd, bw_nd) → 1 if 1
try_lazy_delta_app(aw_nd, bw_nd) → 1 if 1 (rerun post-whnf_nd:
spine args may have reduced past what Tier 1.5's pre-whnf attempt
could see, exposing Const-Const congruence that was hidden)
* `try_proof_irrel`, `is_prop_type`, `try_unit_like` switch from
`k_infer` to `k_infer_only` — these helpers only need the synthesized
type, not the full re-validation work that `k_infer` does for each
recursive App/Let/Lam.
Each piece individually validated necessary (removing it puts UTF-8
back into the OOM regime). FVar variant + opens, KStore explicit
caches, infer_only's FVar-based binder opening — all confirmed NOT
necessary for the UTF-8 unblock and left out (see PLAN.md for future
experiments).
Measured (FFT cost):
Nat.add_comm: 56.08M → 55.63M (~stable; new code paths add no
overhead on the common case because Tier 1d's whnf_nd + struct_safe
are themselves Aiur-memoized).
_private.Init.Data.String.Decode.0.ByteArray.utf8DecodeChar?
.assemble₄_eq_some_of_toBitVec._proof_1_8: OOM → 39.12B FFT, passes.
3 files, +296/-6 lines.
`u64_sub_with_borrow` combines two per-byte borrow bits with `g_or`. The
two bits are MUTUALLY EXCLUSIVE: `u_t = borrow(a_i - b_i)` and
`u_r = borrow((a_i + 256 - b_i) - br_in)`. If `u_t = 1` the intermediate
`t_i ≥ 1`, so subtracting `br_in ∈ {0,1}` cannot underflow ⇒ `u_r = 0`.
Field `+` substitutes for `g_or` directly (per the same pattern as
`u64_add` in `ByteStream.lean`).
Per Aiur cost model, `g_or` adds +1 aux + 1 lookup per call; field `+`
is free. 8 g_or call sites in `u64_sub_with_borrow` each charged on every
one of the function's 2.23M rows.
Measured (FFT cost) on UTF-8 `_proof_1_8`:
39.12B → 38.14B (-2.6%)
Nat.add_comm unchanged (55.63M).
See [[reference_aiur_carry_add]].
Same mutually-exclusive-carry pattern as `u64_sub_with_borrow`: * `klimbs_add_carry`: u64_add of (la, lb) yields carry1; u64_add of (sum1, carry_in) yields carry2. carry1=1 ⇒ sum1 ≤ 2^64-2 ⇒ carry2=0. * `klimbs_sub_borrow`: symmetric for borrows. Replace `g_or(c1, c2)` with `c1 + c2` (field +). Both helpers run on hot Nat-primitive paths. Measured on UTF-8 `_proof_1_8`: 38.14B → 38.07B (-0.18%) Nat.add_comm unchanged. See [[reference_aiur_carry_add]].
`try_nat_dispatch` ran 1.12M rows in UTF-8 `_proof_1_8` at width 90, charging 5.16% of total FFT. Width was floored by its widest match arm (the binop branch with 2× whnf + 2× try_extract_nat + try_nat_binop_addr + apply_spine), even on Nat.succ / Nat.pred rows that never touched it. Factor binop dispatch into its own `try_nat_binop_dispatch` fn. Main dispatcher narrows to the max of succ / pred arms (single whnf + try_extract_nat + klimbs_succ/dec + apply_spine). The cold fn's width only charges the rows that actually dispatch a binop. Measured on UTF-8 `_proof_1_8`: 38.07B → 37.80B (-0.7%) Nat.add_comm unchanged.
`expr_lbr` ran 1.47M rows in UTF-8 `_proof_1_8` at width 39, charging 3.01% of total FFT. The Let arm (3 recursive expr_lbr calls + 2 lbr_max + 1 lbr_dec) is the widest match arm, charged on every row of expr_lbr even though Let is rare in most expressions encountered. Factor the Let arm into `expr_lbr_let(ty, val, body)`. Main expr_lbr narrows to max of the 2-recursion arms (App / Lam / Forall). Cold fn only charges Let-arm rows. Measured: Nat.add_comm: 55.63M → 55.50M (-0.2%) UTF-8 `_proof_1_8`: 37.80B → 37.62B (-0.5%)
`try_extract_nat` ran 1.12M rows at width 45, charging 2.68% of UTF-8 `_proof_1_8` total FFT. The App arm (list_lookup + address_eq + recursive try_extract_nat + klimbs_succ) is the widest match arm; the Lit / Const / default arms are leaf compares. Factor App into `try_extract_nat_app(f, a, addrs)`. Main extractor narrows to leaf-arm width. Cold fn only charges App-arm rows. Measured on UTF-8 `_proof_1_8`: 37.62B → 37.31B (-0.8%) Nat.add_comm unchanged.
Updates 41 pinned FFT costs in `Tests/Ix/IxVM.lean::kernelCheckEntries` to match the new kernel's output. All pins moved DOWN — every constant got cheaper, none regressed. Largest reductions (% change): Vector.append: 4_023_268_168 → 3_160_970_390 (-21.4%) Array.append_assoc: 3_938_574_533 → 3_079_334_815 (-21.8%) String.Internal.append: 793_580_333 → 775_968_134 ( -2.2%) bv_to_nat_lit: 635_780_327 → 619_870_154 ( -2.5%) nat_gcd_lit: 665_518_356 → 649_859_784 ( -2.4%) Nat.sub_le_of_le_add: 567_575_653 → 557_867_526 ( -1.7%) IxVMPrim.nat_mod_lit: 414_695_549 → 407_517_834 ( -1.7%) IxVMPrim.nat_div_lit: 405_607_545 → 398_641_590 ( -1.7%) IxVMPrim.nat_shr_lit: 411_128_901 → 404_158_486 ( -1.7%) Nat.decLe: 209_641_496 → 206_196_563 ( -1.6%) Nat.add_comm: 56_084_908 → 55_504_714 ( -1.0%) `lake test -- --ignored ixvm` passes with 0 FFT mismatches.
039e9cf to
07712ff
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Headline
Aiur kernel previously OOM'd on
_private.Init.Data.String.Decode.0.ByteArray.utf8DecodeChar?.assemble₄_eq_some_of_toBitVec._proof_1_8.This constant is a prerequisite of
ByteArray.utf8DecodeChar?_utf8EncodeChar_append(part of the UTF-8 round-trip lemma chain).
Now it typechecks at 37.31B FFT (down from the OOM baseline). The
two-direction lemma
…_utf8EncodeChar_appenditself remainsout-of-reach (still OOMs further along the same dependency tree under
the user's memory cap), but the prerequisite that was the immediate
blocker is unstuck.
The same kernel changes also deliver large FFT cuts on previously-
expensive targets — Tier 1d's structural short-circuit replaces full
delta-whnf cascades wherever it fires:
Vector.appendArray.append_assocCommits
1.
a6cd34a— Tier 1d def-eq short-circuit (the unblock)Aiur's
k_is_def_eq_corejumped from Tier 1.5 straight to full deltaWHNF (Tier 2). The cascading Nat.rec / Nat.succ iota expansions then
drove
whnf_const_headpast 1M unique entries before either sidereached a comparable canonical form — OOM. Rust's def-eq settles the
same pair via no-delta whnf + quick structural recursion before any
of that fires.
Three minimum-necessary pieces ported. No KStore. No FVar. No Subst /
KernelTypes change.
whnf_ndfamily (Whnf.lean, mirror Rustwhnf_no_delta_for_def_eq). Same dispatch tree aswhnf, butwhnf_nd_const_head's Defn arm falls through to a stuckapply_spineinstead of delta-unfolding. Iota / proj / quot /primitives still fire.
k_infer_onlyfamily (Infer.lean, mirror Rustwith_infer_only). App dropsk_check(a, dom); Lam dropsk_ensure_sort(ty); Let drops val/ty validation. Used attry_proof_irrel,is_prop_type,try_unit_like— the def-eqtactics that only need the synthesized type.
k_is_def_eq_struct_safe+ Tier 1d wiring (DefEq.lean,mirror Rust
quick_def_eq+ post-try_def_eq_app). Sort-Sort vialevel_equal; Lam-Lam / All-All via recursivek_is_def_eqon thetype and on the body under
Cons(ty_a, types)(types-cons, NOTFVar opening). Inserted between Tier 1c (string lit) and Tier 2
(full whnf):
Each piece independently validated necessary. The previously-tried
KStore explicit caches, FVar variant + opens, FVar-based binder
opening — all confirmed NOT necessary for this unblock and left out.
3 files changed, +296/−6 lines.
2.
36d6c3a— dropg_orfromu64_sub_with_borrowu64_sub_with_borrowcombined two per-byte borrow bits withg_or.The two bits are mutually exclusive:
u_t = 1⇒ intermediatet_i ≥ 1⇒ subtractingbr_in ∈ {0,1}cannot underflow ⇒u_r = 0. Field+substitutes forg_ordirectly. Per Aiur cost modelg_oradds+1 aux + 1 lookup per call (≈ 5 width); field
+is free. 7 g_ors ×2.23M rows.
UTF-8
_proof_1_8: 39.12B → 38.14B (−2.6%).3.
9e787da— dropg_orfromklimbs_add_carry/klimbs_sub_borrowSame mutually-exclusive-carry pattern. Two limb-level borrows /
carries from sequential u64 ops cannot both be 1.
UTF-8
_proof_1_8: 38.14B → 38.07B (−0.18%).4.
4e379e7— hot/cold splittry_nat_dispatch, extract binop armtry_nat_dispatch's width 90 was floored by the binop arm (2× whnfevery Nat.succ / Nat.pred row. Factor binop dispatch into
try_nat_binop_dispatch. Main narrows to the max of succ / predarms.
UTF-8
_proof_1_8: 38.07B → 37.80B (−0.7%).5.
80ce3d2— hot/cold splitexpr_lbr, extract Let armexpr_lbr's width was floored by the Let arm (3 recursive expr_lbrcalls + 2 lbr_max + 1 lbr_dec), charged on every row even though Let
is rare. Factor into
expr_lbr_let.Nat.add_comm: 55.63M → 55.50M (−0.2%). UTF-8
_proof_1_8: 37.80B→ 37.62B (−0.5%).
6.
1f8effd— hot/cold splittry_extract_nat, extract App armtry_extract_nat's width was floored by the App arm (list_lookup +address_eq + recursive try_extract_nat + klimbs_succ). Factor into
try_extract_nat_app. Main narrows to leaf-arm width.UTF-8
_proof_1_8: 37.62B → 37.31B (−0.8%).7.
039e9cf— re-pin IxVM FFT costs41 pins in
Tests/Ix/IxVM.lean::kernelCheckEntriesupdated. Everyconstant got cheaper; none regressed. Largest reductions:
lake test -- --ignored ixvmpasses with 0 FFT mismatches.Cumulative on UTF-8
_proof_1_8main)a6cd34aTier 1d36d6c3ag_or → + inu64_sub_with_borrow9e787dag_or → + inklimbs_add_carry/klimbs_sub_borrow4e379e7hot/coldtry_nat_dispatch80ce3d2hot/coldexpr_lbr1f8effdhot/coldtry_extract_natPost-unlock optimization: −4.6% (39.12B → 37.31B).
Cost on small targets
Nat.add_comm: 56.08M FFT.Nat.add_comm: 55.50M FFT.Tier 1d itself adds no overhead on the common case (whnf_nd +
struct_safe + try_lazy_delta_app are themselves Aiur-memoized); the
follow-up optimizations are net wins.
Test plan
lake exe check Nat.add_commpasses (55.50M FFT).lake exe check Vector.extract_appendpasses.lake exe check "_private.…assemble₄_eq_some_of_toBitVec._proof_1_8"passes (37.31B FFT, previously OOM).
lake test -- --ignored ixvm— all 41 FFT pins updated, suitepasses with 0 mismatches.