Skip to content

Per-component offload: streaming text encoder while transformers stay GPU-resident (warm-serving OOM at 96 GB) #232

@patraxo

Description

Setup

Self-hosted serving of LTX-2.3 (22B) on a single RTX PRO 6000 (96 GB), bf16. To keep warm latency low we keep both stage transformers GPU-resident between requests (~70.8 GB) and cache text embeddings per prompt.

Symptom

The first request with a new prompt on a warm container OOMs:

70.8 GB resident transformers + ~23 GB Gemma full-GPU build → 93.75 GB → CUDA OOM

PromptEncoder.__call__ builds the full Gemma text encoder on GPU per call (packages/ltx-pipelines/src/ltx_pipelines/utils/blocks.py_text_encoder_ctx returns gpu_model(self._build_text_encoder()) when offload_mode == NONE). With embedding caching, every repeated-prompt request skips this entirely, so benchmarks stay green and the OOM only appears in production on the first unseen prompt.

Why --offload cpu doesn't solve it

The streaming text-encoder builder already exists on every PromptEncoder instance and works well — but offload_mode is a single global knob: every pipeline factory passes the same value to the PromptEncoder and all DiffusionStages. Setting --offload cpu to fix the text encoder also layer-streams the 22B transformers, which defeats persistent-GPU serving (and per the README, offload disables FP8 quantization as a further coupling).

So today there is no supported way to get: streaming text encoder + GPU-resident transformers.

Request

Per-component offload granularity — for example a text_encoder_offload_mode parameter on the pipeline constructors (defaulting to the global offload_mode), or an explicit offload_mode override accepted by PromptEncoder consumers. Even exposing it only at the Python API level (not CLI) would cover the serving use case.

Workaround we run (in case it helps others)

We wrap the prompt-encode call and flip _offload_mode to OffloadMode.CPU only for cache-miss calls when free VRAM is below a threshold, restoring it after. Same weights/dtype → identical embeddings; the request that used to die completes at 78.6 GB peak (~+10 s once per prompt, then cached). Happy to share details or a patch if useful.

Related: #164 (Gemma VRAM pain, smaller-encoder angle), #143 (Gemma co-residency OOM), #140 (offload question), #152 (adjacent external-serving OOM).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions