feat(vram): model sliding-window attention in KV cache estimation by SuperMarioYL · Pull Request #124 · Andyyyy64/whichllm

SuperMarioYL · 2026-06-19T14:47:23Z

Summary

estimate_kv_cache() scales the KV cache linearly with the full requested context for every model. That over-counts VRAM for sliding-window-attention (SWA) models, whose local-attention layers only ever cache the last window tokens. At long context the inflated KV term is large enough to push engine/ranker.py (which since #73 demotes models that cannot hold the requested context) into demoting models that actually fit.

Measured on v0.5.12:

Model	Context	KV (before)	KV (after)
Gemma-3-27B (1024 window, 1/6 global)	128K	12.68 GB	2.20 GB
Mistral-7B-v0.1 (window ignored by runtimes)	128K	3.29 GB	3.29 GB (unchanged)

This PR addresses the sliding-window-attention item of #25.

Approach

The estimate is reduced only for architectures whose mainline runtimes actually honor interleaved SWA — llama.cpp's ISWA path / MLX — namely Gemma-2, Gemma-3, gpt-oss, and Cohere2. Everything else keeps the full-context KV figure, so the change can only ever lower an estimate where the saving is real and stays conservative otherwise.

ModelInfo gains sliding_window and sliding_window_global_ratio (fraction of layers using full/global attention).
fetcher resolves them from authoritative sources first — HF config sliding_window / sliding_window_pattern, then GGUF metadata architecture — and only then a narrow, boundary-matched, conflict-guarded model-id fallback for config-less GGUF repos. A merge/finetune whose id merely contains gemma-3 (and names another base like llama) is not given a window.
estimate_kv_cache blends the two layer types into an effective context:
global_ratio · ctx + (1 − global_ratio) · min(ctx, window).

Why Mistral is deliberately excluded

Mistral-7B-v0.1's config advertises sliding_window: 4096, but mainline llama.cpp / MLX do not apply SWA for it (and later Mistral releases set sliding_window: null). Honoring the declared window would under-count VRAM and recommend a model that won't fit — the one direction #25 asks us to avoid. It therefore keeps the dense estimate.

Conservatism guarantee

sliding_window = None reproduces the previous formula byte-for-byte, and the reduction is monotonic (effective_ctx ≤ ctx always). A test pins literal KV byte values so a coefficient/formula drift is caught.

Tests

uv run pytest → 412 passed (+16). New coverage: dense (pinned bytes), pure-SWA plateau, hybrid partial growth, never-exceeds-dense, below-window equality, per-arch config resolution (Gemma-3 / gpt-oss), Mistral-not-honored, use_sliding_window: False opt-out, Gemma-1 negative, GGUF-metadata + id-hint resolution, merge/boundary false-positive guards, and cache round-trip. ruff check / ruff format --check clean.

Scope

Focused on the SWA half of #25. KV-cache quantization and backend-specific speed paths (the other #25 items) are intentionally left for follow-ups.

estimate_kv_cache scaled the KV cache linearly with the full requested context for every model, so it over-counted VRAM for sliding-window-attention (SWA) models whose local-attention layers only cache the last `window` tokens. At long context this inflated the estimate enough to make the ranker demote models that actually fit (e.g. Gemma-3-27B at 128K: 12.7 GB KV estimated vs ~2.2 GB real). Add architecture-gated SWA modeling: - ModelInfo gains sliding_window and sliding_window_global_ratio. - The fetcher populates them only for architectures whose mainline runtimes actually honor interleaved SWA (Gemma-2/3, gpt-oss, Cohere2), read from HF config sliding_window/sliding_window_pattern, then GGUF metadata architecture, then a boundary-matched, conflict-guarded model-id fallback for config-less GGUF repos. - estimate_kv_cache blends global and windowed layers into an effective context length: global_ratio*ctx + (1-global_ratio)*min(ctx, window). Models outside the allowlist keep the full-context KV figure — including Mistral-7B-v0.1, whose config advertises a 4096 window that mainline runtimes ignore. The reduction can therefore only ever lower an estimate where the saving is real, and stays conservative everywhere else. sliding_window=None reproduces the previous formula exactly. Addresses the sliding-window-attention item in Andyyyy64#25. Tests cover dense, pure-SWA, hybrid, the conservative no-window default, GGUF-metadata and id-hint resolution, merge/boundary false-positive guards, and cache round-trip.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vram): model sliding-window attention in KV cache estimation#124

feat(vram): model sliding-window attention in KV cache estimation#124
SuperMarioYL wants to merge 1 commit into
Andyyyy64:mainfrom
SuperMarioYL:feature/swa-kv-cache-estimation

SuperMarioYL commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SuperMarioYL commented Jun 19, 2026

Summary

Approach

Why Mistral is deliberately excluded

Conservatism guarantee

Tests

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant