fix: size FP16 and ternary (TQ1_0/TQ2_0) GGUF quant types correctly#125
Open
SuperMarioYL wants to merge 1 commit into
Open
fix: size FP16 and ternary (TQ1_0/TQ2_0) GGUF quant types correctly#125SuperMarioYL wants to merge 1 commit into
SuperMarioYL wants to merge 1 commit into
Conversation
_extract_quant_type emitted tokens the byte/penalty tables don't key on, so two quant families were mishandled at fetch time: - "*-FP16.gguf" extracted to "FP16", but QUANT_BYTES_PER_WEIGHT keys full precision as "F16". _estimate_gguf_size then missed the table and fell back to the Q4_K_M 0.5625 bytes/weight default, sizing a 7B FP16 GGUF at ~3.94 GB instead of 14 GB. Since the HF siblings listing carries no per-file size, this estimate is the primary GGUF sizing path. - "*-TQ1_0.gguf"/"*-TQ2_0.gguf" had no matching pattern and extracted to "unknown", so BitNet-class ternary GGUFs were dropped entirely even though the tables already price TQ1_0/TQ2_0. Add a TQ pattern, canonicalize FP16->F16 / FP32->F32 to the table keys, and add regression tests including a drift guard that asserts every extracted quant resolves in QUANT_BYTES_PER_WEIGHT.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
_extract_quant_type()(the GGUF-filename → quant-type parser inmodels/fetcher.py) emitted tokens that the canonicalQUANT_BYTES_PER_WEIGHT/QUANT_QUALITY_PENALTYtables don't key on, so two quant families were mishandled at fetch time.1. FP16 GGUFs were sized ~3.5× too small
A
*-FP16.ggufsibling extracts to the literal"FP16", butdata/quantization.pykeys full precision asF16(there is noFP16key)._estimate_gguf_size()does:so the lookup misses and falls back to the Q4_K_M 0.5625 bytes/weight default. A 7B FP16 GGUF is then estimated at ~3.94 GB instead of 14 GB. Because the HF
siblingslisting carries no per-filesize,_estimate_gguf_size()is the primary GGUF sizing path, so this affects essentially every FP16 GGUF — directly producing optimistic "fits" results, the opposite of what a hardware-fit advisor should do.2. Ternary GGUFs were dropped entirely
*-TQ1_0.gguf/*-TQ2_0.ggufhad no matching pattern in the extractor, so they returned"unknown"and were skipped by theif quant == "unknown": continuefilter — even though the tables already fully priceTQ1_0andTQ2_0(and list them inQUANT_PREFERENCE_ORDER). BitNet-class ternary builds therefore never appeared in results.Fix
Additive, no table or heuristic changes:
TQ\d+_\d+pattern to the extractor.FP16 → F16/FP32 → F32to the keys the tables use (and addFP32to the precision alternation so the alias is reachable).test_extract_quant_type_keys_resolve_in_byte_table) asserting every quant the extractor surfaces from a real GGUF filename resolves inQUANT_BYTES_PER_WEIGHT, so the extractor and tables can't silently diverge again.The quality-penalty path is unchanged:
F16already resolves to0.0inQUANT_QUALITY_PENALTY, so only the size estimate is corrected.Testing
The four new tests are red on
mainand green here. No GPU required.Notes
models/fetcher.py+ tests; the separate non-GGUF (engine/quantization.py)FP16-keyed vocabulary used for repo-name inference is intentionally left untouched.*-FP16.gguf/*-TQ1_0.gguf?" — yes; these are common llama.cpp naming conventions (full-precision conversions and BitNet ternary builds).