feat(vram): model sliding-window attention in KV cache estimation#124
Open
SuperMarioYL wants to merge 1 commit into
Open
feat(vram): model sliding-window attention in KV cache estimation#124SuperMarioYL wants to merge 1 commit into
SuperMarioYL wants to merge 1 commit into
Conversation
estimate_kv_cache scaled the KV cache linearly with the full requested context for every model, so it over-counted VRAM for sliding-window-attention (SWA) models whose local-attention layers only cache the last `window` tokens. At long context this inflated the estimate enough to make the ranker demote models that actually fit (e.g. Gemma-3-27B at 128K: 12.7 GB KV estimated vs ~2.2 GB real). Add architecture-gated SWA modeling: - ModelInfo gains sliding_window and sliding_window_global_ratio. - The fetcher populates them only for architectures whose mainline runtimes actually honor interleaved SWA (Gemma-2/3, gpt-oss, Cohere2), read from HF config sliding_window/sliding_window_pattern, then GGUF metadata architecture, then a boundary-matched, conflict-guarded model-id fallback for config-less GGUF repos. - estimate_kv_cache blends global and windowed layers into an effective context length: global_ratio*ctx + (1-global_ratio)*min(ctx, window). Models outside the allowlist keep the full-context KV figure — including Mistral-7B-v0.1, whose config advertises a 4096 window that mainline runtimes ignore. The reduction can therefore only ever lower an estimate where the saving is real, and stays conservative everywhere else. sliding_window=None reproduces the previous formula exactly. Addresses the sliding-window-attention item in Andyyyy64#25. Tests cover dense, pure-SWA, hybrid, the conservative no-window default, GGUF-metadata and id-hint resolution, merge/boundary false-positive guards, and cache round-trip.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
estimate_kv_cache()scales the KV cache linearly with the full requested context for every model. That over-counts VRAM for sliding-window-attention (SWA) models, whose local-attention layers only ever cache the lastwindowtokens. At long context the inflated KV term is large enough to pushengine/ranker.py(which since #73 demotes models that cannot hold the requested context) into demoting models that actually fit.Measured on
v0.5.12:This PR addresses the sliding-window-attention item of #25.
Approach
The estimate is reduced only for architectures whose mainline runtimes actually honor interleaved SWA —
llama.cpp's ISWA path / MLX — namely Gemma-2, Gemma-3, gpt-oss, and Cohere2. Everything else keeps the full-context KV figure, so the change can only ever lower an estimate where the saving is real and stays conservative otherwise.ModelInfogainssliding_windowandsliding_window_global_ratio(fraction of layers using full/global attention).fetcherresolves them from authoritative sources first — HF configsliding_window/sliding_window_pattern, then GGUF metadata architecture — and only then a narrow, boundary-matched, conflict-guarded model-id fallback for config-less GGUF repos. A merge/finetune whose id merely containsgemma-3(and names another base likellama) is not given a window.estimate_kv_cacheblends the two layer types into an effective context:global_ratio · ctx + (1 − global_ratio) · min(ctx, window).Why Mistral is deliberately excluded
Mistral-7B-v0.1's config advertises
sliding_window: 4096, but mainlinellama.cpp/ MLX do not apply SWA for it (and later Mistral releases setsliding_window: null). Honoring the declared window would under-count VRAM and recommend a model that won't fit — the one direction #25 asks us to avoid. It therefore keeps the dense estimate.Conservatism guarantee
sliding_window = Nonereproduces the previous formula byte-for-byte, and the reduction is monotonic (effective_ctx ≤ ctxalways). A test pins literal KV byte values so a coefficient/formula drift is caught.Tests
uv run pytest→ 412 passed (+16). New coverage: dense (pinned bytes), pure-SWA plateau, hybrid partial growth, never-exceeds-dense, below-window equality, per-arch config resolution (Gemma-3 / gpt-oss), Mistral-not-honored,use_sliding_window: Falseopt-out, Gemma-1 negative, GGUF-metadata + id-hint resolution, merge/boundary false-positive guards, and cache round-trip.ruff check/ruff format --checkclean.Scope
Focused on the SWA half of #25. KV-cache quantization and backend-specific speed paths (the other #25 items) are intentionally left for follow-ups.