Per-component offload: streaming text encoder while transformers stay GPU-resident (warm-serving OOM at 96 GB)

## Setup

Self-hosted serving of LTX-2.3 (22B) on a single RTX PRO 6000 (96 GB), bf16. To keep warm latency low we keep both stage transformers GPU-resident between requests (~70.8 GB) and cache text embeddings per prompt.

## Symptom

The first request with a **new** prompt on a warm container OOMs:

```
70.8 GB resident transformers + ~23 GB Gemma full-GPU build → 93.75 GB → CUDA OOM
```

`PromptEncoder.__call__` builds the full Gemma text encoder on GPU per call (`packages/ltx-pipelines/src/ltx_pipelines/utils/blocks.py` — `_text_encoder_ctx` returns `gpu_model(self._build_text_encoder())` when `offload_mode == NONE`). With embedding caching, every repeated-prompt request skips this entirely, so benchmarks stay green and the OOM only appears in production on the first unseen prompt.

## Why `--offload cpu` doesn't solve it

The streaming text-encoder builder already exists on every `PromptEncoder` instance and works well — but `offload_mode` is a single global knob: every pipeline factory passes the same value to the PromptEncoder **and** all `DiffusionStage`s. Setting `--offload cpu` to fix the text encoder also layer-streams the 22B transformers, which defeats persistent-GPU serving (and per the README, offload disables FP8 quantization as a further coupling).

So today there is no supported way to get: streaming text encoder + GPU-resident transformers.

## Request

Per-component offload granularity — for example a `text_encoder_offload_mode` parameter on the pipeline constructors (defaulting to the global `offload_mode`), or an explicit `offload_mode` override accepted by `PromptEncoder` consumers. Even exposing it only at the Python API level (not CLI) would cover the serving use case.

## Workaround we run (in case it helps others)

We wrap the prompt-encode call and flip `_offload_mode` to `OffloadMode.CPU` only for cache-miss calls when free VRAM is below a threshold, restoring it after. Same weights/dtype → identical embeddings; the request that used to die completes at 78.6 GB peak (~+10 s once per prompt, then cached). Happy to share details or a patch if useful.

Related: #164 (Gemma VRAM pain, smaller-encoder angle), #143 (Gemma co-residency OOM), #140 (offload question), #152 (adjacent external-serving OOM).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-component offload: streaming text encoder while transformers stay GPU-resident (warm-serving OOM at 96 GB) #232

Setup

Symptom

Why `--offload cpu` doesn't solve it

Request

Workaround we run (in case it helps others)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Per-component offload: streaming text encoder while transformers stay GPU-resident (warm-serving OOM at 96 GB) #232

Description

Setup

Symptom

Why --offload cpu doesn't solve it

Request

Workaround we run (in case it helps others)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why `--offload cpu` doesn't solve it