Skip to content

[None][feat] Add ConversationAwareADPRouter: explicit conversation->rank affinity for attention DP#14983

Draft
lancelly wants to merge 2 commits into
NVIDIA:mainfrom
lancelly:feat/conversation-aware-adp-router-main
Draft

[None][feat] Add ConversationAwareADPRouter: explicit conversation->rank affinity for attention DP#14983
lancelly wants to merge 2 commits into
NVIDIA:mainfrom
lancelly:feat/conversation-aware-adp-router-main

Conversation

@lancelly
Copy link
Copy Markdown
Collaborator

@lancelly lancelly commented Jun 5, 2026

What

Adds ConversationAwareADPRouter, an instance-level attention-DP router that:

  • round-robins the first request of each conversation across DP ranks, then
  • pins every subsequent request carrying the same conversation_id to that conversation's first-turn rank.

This keeps a multi-turn conversation's growing KV-cache prefix on a single rank — maximizing block reuse and minimizing recompute / cross-rank migration — while spreading the birth of new conversations evenly.

Why / vs KVCacheAwareADPRouter

KVCacheAwareADPRouter infers affinity from probed prefix-match length. That affinity is lost the moment a conversation's blocks are evicted: the request then re-routes by load and the conversation can migrate ranks. ConversationAwareADPRouter keeps an explicit conversation_id -> rank LRU map, so stickiness is deterministic and survives eviction. It also does not require KV-cache block reuse to function (though it is most beneficial with it).

Inspired by the serve-level ConversationRouter (tensorrt_llm/serve/router.py) and the first-turn-round-robin idea from #14744, applied at the intra-instance ADP-rank level.

How it is wired

  • Selected via AttentionDpConfig.kv_cache_routing_conversation_affinity (takes precedence over enable_kv_cache_aware_routing when both are set). kv_cache_routing_max_sessions bounds the LRU map.
  • conversation_id is read from req.py_disaggregated_params.conversation_id (serve-side propagated from the X-Session-ID header). When absent — header not sent, non-disaggregated, or propagation not present — the request falls back to load-balanced round-robin and is not recorded, so behavior degrades gracefully to DefaultADPRouter-style spreading.
  • Deterministic across TP ranks: route_requests runs locally on every rank with no broadcast, so the round-robin cursor and the conversation -> rank map evolve identically (same new_requests, same order) — the same invariant the existing warmup cursor relies on. Divergence would deadlock the allgather protocol.

Tests

tests/unittest/_torch/executor/test_adp_router.py::TestConversationAwareADPRouter covers: first-turn round-robin, stickiness across turns, conversation-less fallback (unrecorded), cross-rank determinism, LRU eviction, sticky overflow (keeps mapping), explicit attention_dp_rank, and factory selection (enabled/disabled).

Notes

  • Draft. The full benefit requires conversation_id reaching the worker's py_disaggregated_params (serve-side propagation from X-Session-ID); without it the router is a safe round-robin no-op.

Update — now self-contained (2 commits)

  1. [None][fix] serve: propagate conversation_id to the executor/worker layer — the base this router needs: adds conversation_id to the executor-layer DisaggregatedParams and copies it through to_llm_disaggregated_params / to_disaggregated_params, so the worker actually sees the id the orchestrator routed on (without it the router safely degrades to round-robin).
  2. [None][feat] Add ConversationAwareADPRouter … — the router. Sticky returns are capped at the loose fair_share_multiplier * fair_share (expected), never the hard max_num_active_requests: a rank exceeding expected breaks the ADP padding invariant (py_executor._pad_attention_dp_dummy_request) and hangs the instance. Regression test test_returned_expected_covers_every_rank guards this.

@lancelly lancelly force-pushed the feat/conversation-aware-adp-router-main branch from bbf3ae0 to 823ca8a Compare June 5, 2026 02:43
lancelly added 2 commits June 4, 2026 20:34
…ayer

conversation_id existed only on the serve-layer DisaggregatedParams
(openai_protocol). to_llm_disaggregated_params() did not copy it and the
executor-layer DisaggregatedParams (tensorrt_llm/disaggregated_params.py,
== LlmDisaggregatedParams) had no such field, so it was silently dropped
when a worker converted the incoming request (openai_server.py
to_llm_disaggregated_params). As a result worker-side consumers that read
request.py_disaggregated_params.conversation_id (e.g. the ADP router) only
ever saw None, even though the orchestrator routed on a real conversation
id from the X-Session-ID header.

- Add conversation_id to the executor-layer DisaggregatedParams.
- Copy it through both to_llm_disaggregated_params and
  to_disaggregated_params so the serve<->executor round-trip preserves it.

This makes the conversation id available all the way to the worker; the
orchestrator-level routers are unaffected (they already read the serve
layer directly).

Extends the converter unit tests to assert conversation_id propagates and
adds a round-trip test.

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
…->rank affinity

Adds an instance-level ADP router that round-robins the first request of each
conversation across ranks, then pins every subsequent request with the same
conversation_id to that conversation's first-turn rank. This keeps a multi-turn
conversation's growing KV-cache prefix on one rank (maximizing block reuse,
minimizing cross-rank migration) while spreading new conversations evenly.

Unlike KVCacheAwareADPRouter (which infers affinity from probed prefix-match
length and loses a conversation when its blocks are evicted), the
conversation_id -> rank map is explicit and survives eviction. Inspired by the
serve-level ConversationRouter and the first-turn-round-robin idea from NVIDIA#14744,
applied at the intra-instance ADP-rank level.

conversation_id is read from py_disaggregated_params.conversation_id (serve-side
propagated from X-Session-ID); falls back to load-balanced round-robin when it
is absent, so behavior degrades gracefully. Selected via the new
attention_dp_config.kv_cache_routing_conversation_affinity flag
(kv_cache_routing_max_sessions bounds the LRU map).

Includes unit tests covering first-turn RR, stickiness, conv-less fallback,
cross-rank determinism, LRU eviction, sticky overflow, and factory selection.

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
@lancelly lancelly force-pushed the feat/conversation-aware-adp-router-main branch from 823ca8a to 5b29ef5 Compare June 5, 2026 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant