[None][perf] kv_cache_manager_v2: batch block-key SHA-256 hashing#14994
[None][perf] kv_cache_manager_v2: batch block-key SHA-256 hashing#14994lancelly wants to merge 1 commit into
Conversation
dd25e51 to
2ddd0f8
Compare
|
/bot run --disable-fail-fast |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR optimizes the ChangesHasher Bulk Hashing Optimization
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
PR_Github #52282 [ run ] triggered by Bot. Commit: |
|
PR_Github #52282 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
Hasher.update hashed each token of a block with its own int.to_bytes(8) +
sha256.update() call. For long warm prefix matches this is the dominant cost
of BlockRadixTree.match, which the attention-DP KV-cache-aware router
(KVCacheAwareADPRouter) runs as a per-request probe on every DP rank before
routing -- and which create_kv_cache repeats for the actual reuse lookup.
Pack the whole token block into bytes once (array("Q", block).tobytes()) and
do a single sha256.update(). All NVIDIA GPU host platforms (x86_64, aarch64/
Grace) are little-endian, so this is byte-identical to the per-token
to_bytes(8, "little") loop -- block reuse / cross-run cache-hit behavior is
unchanged. Multimodal blocks (which contain bytes items) fall back to the
per-token loop via except (TypeError, OverflowError).
Speeds up the probe and the create-time reuse lookup equally. On a GB300 Grace
node the real BlockRadixTree.match warm-prefix cost at ISL~38k drops 2.85-3.05x
at tokens_per_block=128/256 (DeepseekV4CacheManager). Adds TestBlockKeyHashing
to lock in the bit-identical contract incl. multi-modal blocks.
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
2ddd0f8 to
be1309c
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #52493 [ run ] triggered by Bot. Commit: |
|
PR_Github #52493 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #52506 [ run ] triggered by Bot. Commit: |
|
PR_Github #52506 [ run ] completed with state
|
Description
Hasher.updateinkv_cache_manager_v2/_block_radix_tree.pyhashed each token of a block one at a time — a Python-levelint.to_bytes(8, "little")+sha256.update()call per token. For a long warm prefix this chained per-token hashing is the dominant cost ofBlockRadixTree.match, which is invoked:KVCacheAwareADPRouter.gather_prefix_matches→probe_prefix_match_length→probe_reuse) as a per-request probe on every DP rank before routing, gating thetp_allgather, andcreate_kv_cachefor the actual reuse lookup (same_match_reuse).This is especially hot for DeepSeek-V4 (
DeepseekV4CacheManager(KVCacheManagerV2),tokens_per_block ∈ {128, 256}) on long-context agentic workloads (mean ISL ~38k)Summary by CodeRabbit
Tests
Refactor