Skip to content

[None][perf] kv_cache_manager_v2: batch block-key SHA-256 hashing#14994

Open
lancelly wants to merge 1 commit into
NVIDIA:mainfrom
lancelly:perf/kvcache-v2-batched-blockkey-hashing-main
Open

[None][perf] kv_cache_manager_v2: batch block-key SHA-256 hashing#14994
lancelly wants to merge 1 commit into
NVIDIA:mainfrom
lancelly:perf/kvcache-v2-batched-blockkey-hashing-main

Conversation

@lancelly
Copy link
Copy Markdown
Collaborator

@lancelly lancelly commented Jun 5, 2026

Description

Hasher.update in kv_cache_manager_v2/_block_radix_tree.py hashed each token of a block one at a time — a Python-level int.to_bytes(8, "little") + sha256.update() call per token. For a long warm prefix this chained per-token hashing is the dominant cost of BlockRadixTree.match, which is invoked:

  • by the attention-DP KV-cache-aware router (KVCacheAwareADPRouter.gather_prefix_matchesprobe_prefix_match_lengthprobe_reuse) as a per-request probe on every DP rank before routing, gating the tp_allgather, and
  • again by create_kv_cache for the actual reuse lookup (same _match_reuse).

This is especially hot for DeepSeek-V4 (DeepseekV4CacheManager(KVCacheManagerV2), tokens_per_block ∈ {128, 256}) on long-context agentic workloads (mean ISL ~38k)

Summary by CodeRabbit

  • Tests

    • Added comprehensive unit tests to verify KV cache block key hashing produces deterministic and correct SHA-256 values for both integer and multimodal (mixed-type) token blocks.
  • Refactor

    • Optimized block hashing performance through efficient batch processing of integer sequences, with graceful fallback for mixed-type blocks.

@lancelly lancelly force-pushed the perf/kvcache-v2-batched-blockkey-hashing-main branch from dd25e51 to 2ddd0f8 Compare June 5, 2026 06:37
@lancelly lancelly marked this pull request as ready for review June 5, 2026 06:38
@lancelly
Copy link
Copy Markdown
Collaborator Author

lancelly commented Jun 5, 2026

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f5e5e4f-dd3d-4a51-ae48-e2ea1c4637c6

📥 Commits

Reviewing files that changed from the base of the PR and between 21ffdc7 and 2ddd0f8.

📒 Files selected for processing (2)
  • tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py
  • tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py

📝 Walkthrough

Walkthrough

This PR optimizes the Hasher.update() method in the KV cache radix tree to bulk-hash integer sequences using Python's array type, with a fallback to per-item hashing for mixed or non-integer blocks. Tests verify the optimization produces correct SHA-256 hashes.

Changes

Hasher Bulk Hashing Optimization

Layer / File(s) Summary
Hasher bulk hashing implementation
tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py
Imports array and modifies Hasher.update() to attempt packing int items into array("Q") and hashing the bytes in a single call; on type or overflow error, falls back to the prior per-item hashing logic for mixed/byte-containing blocks.
TestBlockKeyHashing verification
tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py
Imports hashlib and updates conditional _block_radix_tree imports to include Hasher, then adds TestBlockKeyHashing with a reference SHA-256 implementation and test assertions for integer-only and multimodal (bytes + ints) blocks.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main performance optimization: batching block-key SHA-256 hashing in kv_cache_manager_v2, which is the primary change across both modified files.
Description check ✅ Passed The PR description provides a clear explanation of the performance problem, the solution, and the impact, but the description section lacks a complete PR Checklist with verification marks as required by the template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52282 [ run ] triggered by Bot. Commit: 2ddd0f8 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52282 [ run ] completed with state SUCCESS. Commit: 2ddd0f8
/LLM/main/L0_MergeRequest_PR pipeline #41592 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

lancelly commented Jun 6, 2026

/bot run --disable-fail-fast

Hasher.update hashed each token of a block with its own int.to_bytes(8) +
sha256.update() call. For long warm prefix matches this is the dominant cost
of BlockRadixTree.match, which the attention-DP KV-cache-aware router
(KVCacheAwareADPRouter) runs as a per-request probe on every DP rank before
routing -- and which create_kv_cache repeats for the actual reuse lookup.

Pack the whole token block into bytes once (array("Q", block).tobytes()) and
do a single sha256.update(). All NVIDIA GPU host platforms (x86_64, aarch64/
Grace) are little-endian, so this is byte-identical to the per-token
to_bytes(8, "little") loop -- block reuse / cross-run cache-hit behavior is
unchanged. Multimodal blocks (which contain bytes items) fall back to the
per-token loop via except (TypeError, OverflowError).

Speeds up the probe and the create-time reuse lookup equally. On a GB300 Grace
node the real BlockRadixTree.match warm-prefix cost at ISL~38k drops 2.85-3.05x
at tokens_per_block=128/256 (DeepseekV4CacheManager). Adds TestBlockKeyHashing
to lock in the bit-identical contract incl. multi-modal blocks.

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
@lancelly lancelly force-pushed the perf/kvcache-v2-batched-blockkey-hashing-main branch from 2ddd0f8 to be1309c Compare June 6, 2026 08:29
@lancelly
Copy link
Copy Markdown
Collaborator Author

lancelly commented Jun 6, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52493 [ run ] triggered by Bot. Commit: be1309c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52493 [ run ] completed with state SUCCESS. Commit: be1309c
/LLM/main/L0_MergeRequest_PR pipeline #41785 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

lancelly commented Jun 6, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52506 [ run ] triggered by Bot. Commit: be1309c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #52506 [ run ] completed with state SUCCESS. Commit: be1309c
/LLM/main/L0_MergeRequest_PR pipeline #41797 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants