Setup
Self-hosted serving of LTX-2.3 (22B) on a single RTX PRO 6000 (96 GB), bf16. To keep warm latency low we keep both stage transformers GPU-resident between requests (~70.8 GB) and cache text embeddings per prompt.
Symptom
The first request with a new prompt on a warm container OOMs:
70.8 GB resident transformers + ~23 GB Gemma full-GPU build → 93.75 GB → CUDA OOM
PromptEncoder.__call__ builds the full Gemma text encoder on GPU per call (packages/ltx-pipelines/src/ltx_pipelines/utils/blocks.py — _text_encoder_ctx returns gpu_model(self._build_text_encoder()) when offload_mode == NONE). With embedding caching, every repeated-prompt request skips this entirely, so benchmarks stay green and the OOM only appears in production on the first unseen prompt.
Why --offload cpu doesn't solve it
The streaming text-encoder builder already exists on every PromptEncoder instance and works well — but offload_mode is a single global knob: every pipeline factory passes the same value to the PromptEncoder and all DiffusionStages. Setting --offload cpu to fix the text encoder also layer-streams the 22B transformers, which defeats persistent-GPU serving (and per the README, offload disables FP8 quantization as a further coupling).
So today there is no supported way to get: streaming text encoder + GPU-resident transformers.
Request
Per-component offload granularity — for example a text_encoder_offload_mode parameter on the pipeline constructors (defaulting to the global offload_mode), or an explicit offload_mode override accepted by PromptEncoder consumers. Even exposing it only at the Python API level (not CLI) would cover the serving use case.
Workaround we run (in case it helps others)
We wrap the prompt-encode call and flip _offload_mode to OffloadMode.CPU only for cache-miss calls when free VRAM is below a threshold, restoring it after. Same weights/dtype → identical embeddings; the request that used to die completes at 78.6 GB peak (~+10 s once per prompt, then cached). Happy to share details or a patch if useful.
Related: #164 (Gemma VRAM pain, smaller-encoder angle), #143 (Gemma co-residency OOM), #140 (offload question), #152 (adjacent external-serving OOM).
Setup
Self-hosted serving of LTX-2.3 (22B) on a single RTX PRO 6000 (96 GB), bf16. To keep warm latency low we keep both stage transformers GPU-resident between requests (~70.8 GB) and cache text embeddings per prompt.
Symptom
The first request with a new prompt on a warm container OOMs:
PromptEncoder.__call__builds the full Gemma text encoder on GPU per call (packages/ltx-pipelines/src/ltx_pipelines/utils/blocks.py—_text_encoder_ctxreturnsgpu_model(self._build_text_encoder())whenoffload_mode == NONE). With embedding caching, every repeated-prompt request skips this entirely, so benchmarks stay green and the OOM only appears in production on the first unseen prompt.Why
--offload cpudoesn't solve itThe streaming text-encoder builder already exists on every
PromptEncoderinstance and works well — butoffload_modeis a single global knob: every pipeline factory passes the same value to the PromptEncoder and allDiffusionStages. Setting--offload cputo fix the text encoder also layer-streams the 22B transformers, which defeats persistent-GPU serving (and per the README, offload disables FP8 quantization as a further coupling).So today there is no supported way to get: streaming text encoder + GPU-resident transformers.
Request
Per-component offload granularity — for example a
text_encoder_offload_modeparameter on the pipeline constructors (defaulting to the globaloffload_mode), or an explicitoffload_modeoverride accepted byPromptEncoderconsumers. Even exposing it only at the Python API level (not CLI) would cover the serving use case.Workaround we run (in case it helps others)
We wrap the prompt-encode call and flip
_offload_modetoOffloadMode.CPUonly for cache-miss calls when free VRAM is below a threshold, restoring it after. Same weights/dtype → identical embeddings; the request that used to die completes at 78.6 GB peak (~+10 s once per prompt, then cached). Happy to share details or a patch if useful.Related: #164 (Gemma VRAM pain, smaller-encoder angle), #143 (Gemma co-residency OOM), #140 (offload question), #152 (adjacent external-serving OOM).