Skip to content

feat(vram): model sliding-window attention in KV cache estimation#124

Open
SuperMarioYL wants to merge 1 commit into
Andyyyy64:mainfrom
SuperMarioYL:feature/swa-kv-cache-estimation
Open

feat(vram): model sliding-window attention in KV cache estimation#124
SuperMarioYL wants to merge 1 commit into
Andyyyy64:mainfrom
SuperMarioYL:feature/swa-kv-cache-estimation

Conversation

@SuperMarioYL

Copy link
Copy Markdown
Contributor

Summary

estimate_kv_cache() scales the KV cache linearly with the full requested context for every model. That over-counts VRAM for sliding-window-attention (SWA) models, whose local-attention layers only ever cache the last window tokens. At long context the inflated KV term is large enough to push engine/ranker.py (which since #73 demotes models that cannot hold the requested context) into demoting models that actually fit.

Measured on v0.5.12:

Model Context KV (before) KV (after)
Gemma-3-27B (1024 window, 1/6 global) 128K 12.68 GB 2.20 GB
Mistral-7B-v0.1 (window ignored by runtimes) 128K 3.29 GB 3.29 GB (unchanged)

This PR addresses the sliding-window-attention item of #25.

Approach

The estimate is reduced only for architectures whose mainline runtimes actually honor interleaved SWA — llama.cpp's ISWA path / MLX — namely Gemma-2, Gemma-3, gpt-oss, and Cohere2. Everything else keeps the full-context KV figure, so the change can only ever lower an estimate where the saving is real and stays conservative otherwise.

  • ModelInfo gains sliding_window and sliding_window_global_ratio (fraction of layers using full/global attention).
  • fetcher resolves them from authoritative sources first — HF config sliding_window / sliding_window_pattern, then GGUF metadata architecture — and only then a narrow, boundary-matched, conflict-guarded model-id fallback for config-less GGUF repos. A merge/finetune whose id merely contains gemma-3 (and names another base like llama) is not given a window.
  • estimate_kv_cache blends the two layer types into an effective context:
    global_ratio · ctx + (1 − global_ratio) · min(ctx, window).

Why Mistral is deliberately excluded

Mistral-7B-v0.1's config advertises sliding_window: 4096, but mainline llama.cpp / MLX do not apply SWA for it (and later Mistral releases set sliding_window: null). Honoring the declared window would under-count VRAM and recommend a model that won't fit — the one direction #25 asks us to avoid. It therefore keeps the dense estimate.

Conservatism guarantee

sliding_window = None reproduces the previous formula byte-for-byte, and the reduction is monotonic (effective_ctx ≤ ctx always). A test pins literal KV byte values so a coefficient/formula drift is caught.

Tests

uv run pytest412 passed (+16). New coverage: dense (pinned bytes), pure-SWA plateau, hybrid partial growth, never-exceeds-dense, below-window equality, per-arch config resolution (Gemma-3 / gpt-oss), Mistral-not-honored, use_sliding_window: False opt-out, Gemma-1 negative, GGUF-metadata + id-hint resolution, merge/boundary false-positive guards, and cache round-trip. ruff check / ruff format --check clean.

Scope

Focused on the SWA half of #25. KV-cache quantization and backend-specific speed paths (the other #25 items) are intentionally left for follow-ups.

estimate_kv_cache scaled the KV cache linearly with the full requested
context for every model, so it over-counted VRAM for sliding-window-attention
(SWA) models whose local-attention layers only cache the last `window` tokens.
At long context this inflated the estimate enough to make the ranker demote
models that actually fit (e.g. Gemma-3-27B at 128K: 12.7 GB KV estimated vs
~2.2 GB real).

Add architecture-gated SWA modeling:

- ModelInfo gains sliding_window and sliding_window_global_ratio.
- The fetcher populates them only for architectures whose mainline runtimes
  actually honor interleaved SWA (Gemma-2/3, gpt-oss, Cohere2), read from HF
  config sliding_window/sliding_window_pattern, then GGUF metadata architecture,
  then a boundary-matched, conflict-guarded model-id fallback for config-less
  GGUF repos.
- estimate_kv_cache blends global and windowed layers into an effective context
  length: global_ratio*ctx + (1-global_ratio)*min(ctx, window).

Models outside the allowlist keep the full-context KV figure — including
Mistral-7B-v0.1, whose config advertises a 4096 window that mainline runtimes
ignore. The reduction can therefore only ever lower an estimate where the
saving is real, and stays conservative everywhere else. sliding_window=None
reproduces the previous formula exactly.

Addresses the sliding-window-attention item in Andyyyy64#25. Tests cover dense,
pure-SWA, hybrid, the conservative no-window default, GGUF-metadata and
id-hint resolution, merge/boundary false-positive guards, and cache round-trip.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant