Skip to content

Questions about GEMV Coalescing and Persistent Device Tables in RTX 5090 Ablations #1

Description

@xing-cong
Image

Hi, thanks for sharing this interesting work.

I was reading the ablation results on RTX 5090 and noticed that both GEMV coalescing and persistent device tables bring significant speedups. I would like to better understand the implementation details behind these two optimizations.

For GEMV coalescing, could you clarify what is being coalesced in the kernel?
Is the optimization mainly about improving memory coalescing when loading weights/activations, merging multiple small GEMV-related operations into a single kernel path, or reorganizing the command/task layout so that adjacent threads/warps issue more contiguous memory accesses?

For persistent device tables, could you explain what tables are made persistent on the device, and why repeatedly loading or constructing them becomes a bottleneck?
In particular, are these tables used for layer/operator scheduling, command dispatch, weight metadata, KV-cache metadata, or something else?

I am asking because, when I inspected the Stanford MegaKernels implementation, I did not observe the same need for persistent device tables. In that implementation, all commands seem to be fully loaded onto the GPU before the decoding phase starts, so the command tables do not appear to introduce repeated host-device transfer or reconstruction overhead during decoding.

Could you elaborate on the difference between your setting and the Stanford MegaKernels design?
For example, is the issue caused by dynamic command generation, per-token scheduling metadata, device-side dispatch structures, or another runtime mechanism?

Any clarification on the design and motivation of these two optimizations would be very helpful. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions