Skip to content

fix: size FP16 and ternary (TQ1_0/TQ2_0) GGUF quant types correctly#125

Open
SuperMarioYL wants to merge 1 commit into
Andyyyy64:mainfrom
SuperMarioYL:fix/gguf-quant-vocab-fp16-ternary
Open

fix: size FP16 and ternary (TQ1_0/TQ2_0) GGUF quant types correctly#125
SuperMarioYL wants to merge 1 commit into
Andyyyy64:mainfrom
SuperMarioYL:fix/gguf-quant-vocab-fp16-ternary

Conversation

@SuperMarioYL

Copy link
Copy Markdown
Contributor

What

_extract_quant_type() (the GGUF-filename → quant-type parser in models/fetcher.py) emitted tokens that the canonical QUANT_BYTES_PER_WEIGHT / QUANT_QUALITY_PENALTY tables don't key on, so two quant families were mishandled at fetch time.

1. FP16 GGUFs were sized ~3.5× too small

A *-FP16.gguf sibling extracts to the literal "FP16", but data/quantization.py keys full precision as F16 (there is no FP16 key). _estimate_gguf_size() does:

bpw = QUANT_BYTES_PER_WEIGHT.get(quant_type.upper(), 0.5625)  # default Q4_K_M

so the lookup misses and falls back to the Q4_K_M 0.5625 bytes/weight default. A 7B FP16 GGUF is then estimated at ~3.94 GB instead of 14 GB. Because the HF siblings listing carries no per-file size, _estimate_gguf_size() is the primary GGUF sizing path, so this affects essentially every FP16 GGUF — directly producing optimistic "fits" results, the opposite of what a hardware-fit advisor should do.

2. Ternary GGUFs were dropped entirely

*-TQ1_0.gguf / *-TQ2_0.gguf had no matching pattern in the extractor, so they returned "unknown" and were skipped by the if quant == "unknown": continue filter — even though the tables already fully price TQ1_0 and TQ2_0 (and list them in QUANT_PREFERENCE_ORDER). BitNet-class ternary builds therefore never appeared in results.

Fix

Additive, no table or heuristic changes:

  • Add a TQ\d+_\d+ pattern to the extractor.
  • Canonicalize full-precision aliases FP16 → F16 / FP32 → F32 to the keys the tables use (and add FP32 to the precision alternation so the alias is reachable).
  • Regression tests, including a drift guard (test_extract_quant_type_keys_resolve_in_byte_table) asserting every quant the extractor surfaces from a real GGUF filename resolves in QUANT_BYTES_PER_WEIGHT, so the extractor and tables can't silently diverge again.

The quality-penalty path is unchanged: F16 already resolves to 0.0 in QUANT_QUALITY_PENALTY, so only the size estimate is corrected.

Testing

uv run pytest -q     # 400 passed
uv run ruff check    # clean

The four new tests are red on main and green here. No GPU required.

Notes

  • Scope is limited to models/fetcher.py + tests; the separate non-GGUF (engine/quantization.py) FP16-keyed vocabulary used for repo-name inference is intentionally left untouched.
  • "Does a real repo actually ship *-FP16.gguf / *-TQ1_0.gguf?" — yes; these are common llama.cpp naming conventions (full-precision conversions and BitNet ternary builds).

_extract_quant_type emitted tokens the byte/penalty tables don't key on,
so two quant families were mishandled at fetch time:

- "*-FP16.gguf" extracted to "FP16", but QUANT_BYTES_PER_WEIGHT keys full
  precision as "F16". _estimate_gguf_size then missed the table and fell
  back to the Q4_K_M 0.5625 bytes/weight default, sizing a 7B FP16 GGUF at
  ~3.94 GB instead of 14 GB. Since the HF siblings listing carries no
  per-file size, this estimate is the primary GGUF sizing path.
- "*-TQ1_0.gguf"/"*-TQ2_0.gguf" had no matching pattern and extracted to
  "unknown", so BitNet-class ternary GGUFs were dropped entirely even
  though the tables already price TQ1_0/TQ2_0.

Add a TQ pattern, canonicalize FP16->F16 / FP32->F32 to the table keys, and
add regression tests including a drift guard that asserts every extracted
quant resolves in QUANT_BYTES_PER_WEIGHT.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant