Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions google-gemma-4-E2B-it/QNN/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe

> **Status: WORK IN PROGRESS / EXPLORATORY.** Starting-point recipe for
> running [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it)
> on the Qualcomm Hexagon NPU via the QNN ONNX Runtime execution
> provider (Snapdragon X / Copilot+ PC, Snapdragon 8 Gen 3+, etc.).
> Has *not* yet been validated on hardware — see
> [Limitations](#limitations).

## Approach

All four Gemma 4 components (decoder, embedding, vision_encoder,
audio_encoder) are compiled into QNN EPContext binaries together.
Olive's per-component dispatch on `CompositeModelHandler` runs each
pass on every component, then `EPContextBinaryGenerator` and
`ComposeOnnxModels` (both `_accepts_composite_model = True`) finalize
the multimodal package.

## Pipeline

```
HfModel (multimodal Gemma 4)
↓ MobiusBuilder (fp32) 4 ONNX components + genai_config + tokenizer + processors
↓ OnnxKQuantQuantization (INT4) mobius-standard Q4_K_M; weights → com.microsoft::MatMulNBits
↓ MatMulNBitsToQDQ MatMulNBits → MatMul + DequantizeLinear (QNN-compatible QDQ)
↓ OnnxStaticQuantization activations uint16 / weights uint8 (calibrated)
↓ StaticLLM static shapes for QNN
↓ EPContextBinaryGenerator HTP EPContext blobs (per component, weight-shared)
↓ ComposeOnnxModels final package
```

Why both `OnnxKQuantQuantization` and `MatMulNBitsToQDQ`?
`OnnxKQuantQuantization` emits `com.microsoft::MatMulNBits`, which has
fast CPU / CUDA kernels but is *not* in the QNN EP's supported-op list
— without `MatMulNBitsToQDQ` the QNN partitioner rejects every
quantized MatMul and the model silently falls back to CPU.
`MatMulNBitsToQDQ` rewrites each `MatMulNBits` into a standard
`MatMul + DequantizeLinear` pair so QNN can claim and compile the
subgraph onto HTP.

## Prerequisites

### Quantization environment (x64, GPU recommended)
```bash
pip install -r requirements.txt
pip install cupy-cuda12x # accelerates OnnxKQuantQuantization (19–51× speedup)
```

### AOT compilation environment (separate venv, x64 with QNN SDK)
```bash
pip install olive-ai mobius-ai
pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps
```

Replace `/path/to/qnn/env/bin` in `config.json` with the directory
containing your QNN venv's Python executable.

### Inference (Snapdragon device)
```bash
pip install onnxruntime-qnn onnxruntime-genai
```

## Build

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explicit note in the Build section: run olive run from the quantization environment; Olive invokes the QNN AOT venv automatically via systems.qnn_system.python_environment_path for the EPContextBinaryGenerator pass. Fixed in cbf992c.

Run `olive run` from the **quantization environment** (not the QNN AOT
venv). Olive invokes the QNN AOT venv automatically via the
`python_environment_path` configured under `systems.qnn_system` for the
`EPContextBinaryGenerator` pass:

```bash
olive run --config config.json
```

## Why no `GraphSurgeries` here?

Most existing QNN recipes (Phi-3, Qwen) chain surgeries like
`RemoveRopeMultiCache`, `AttentionMaskToSequenceLengths`,
`SimplifiedLayerNormToL2Norm`. Those rewrite ModelBuilder-specific
sub-graphs into shapes HTP can lower:

| Surgery | What it does | Why ModelBuilder needs it |
|---|---|---|
| `SimplifiedLayerNormToL2Norm` | `com.microsoft.SimplifiedLayerNorm` → `LpNormalization * gamma` | HTP has no SimplifiedLayerNorm kernel |
| `RemoveRopeMultiCache` | Drop one of ModelBuilder's two RoPE caches | HTP can't dispatch on cache selector |
| `AttentionMaskToSequenceLengths` | `GQA(attention_mask=[B,T])` → `GQA(past_seq_len, total_seq_len)` | HTP's GQA kernel wants scalar seq lens |

`MobiusBuilder` emits opset-23 standard ops (`RMSNormalization`,
`Attention`) instead of the contrib variants, so these surgeries are
either no-ops or inapplicable. Gemma-4–specific surgeries may still be
needed (e.g. lowering the final logit soft-cap `cap * tanh(x / cap)`),
but the existing borrowed-from-Phi-3 set is not it.

## Limitations

This recipe has **not yet been validated end-to-end**. Known gaps:

1. **Logit soft-cap may not lower to HTP.** Gemma 4's
`logits = cap * tanh(logits / cap)` is unusual for QNN. If HTP
rejects it, options are (a) skip soft-cap during QNN compile and
apply it in host post-processing, or (b) add a
`RemoveLogitSoftcap` GraphSurgery upstream in Olive.

2. **Hybrid local/global attention with dual head_dim.** Gemma 4 E2B
alternates local sliding-window (head_dim=256) and global
(head_dim=512) attention layers. Whether HTP can dispatch this
correctly per-layer needs testing.

3. **`per_layer_inputs` data flow.** The embedding component emits a
second output (`per_layer_inputs`, shape `[B, S, L*D]`) consumed by
every decoder block. When all components compile to QNN this should
"just work" (the data path stays inside the package), but the
`StaticLLM` pass may need a hint to recognise this extra tensor.

4. **256k tokenizer calibration.** `wikitext-2` calibration likely
under-represents Gemma 4's image / audio special tokens. Consider
augmenting with multimodal-formatted prompts before production.

5. **`StaticLLM context_length=64`.** Placeholder mirroring existing QNN
recipes; tune to target Snapdragon SKU memory budget.

6. **Standard `Attention` op, not `GroupQueryAttention`.** mobius only
emits `com.microsoft::GroupQueryAttention(seqlens_k, total_seq_len)`
when the EP capability advertises `gqa_dtypes`. The QNN EP
capability in mobius currently has an empty `gqa_dtypes` list, so
`Gemma4TextModel.forward` (`src/mobius/models/gemma4.py:1500-1508`)
falls back to the standard opset-23 `Attention` with an
`attention_mask` input. QNN's HTP backend should have an attention
kernel for the standard op, but if it doesn't lower well there are
two options:
- extend mobius `ep_capabilities()` to advertise QNN-supported
dtypes for `gqa_dtypes`, then mobius will emit `GQA` directly
(no GraphSurgery needed); or
- port `AttentionMaskToSequenceLengths` to operate on standard
`Attention` (it currently checks for `GroupQueryAttention` only
and no-ops otherwise).

## Discussion

If you have a Snapdragon test rig and the pipeline blows up on a
specific pass, please drop the trace in a comment — this recipe is
intentionally a template for iteration, not a finished product.
81 changes: 81 additions & 0 deletions google-gemma-4-E2B-it/QNN/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
{
"input_model": { "type": "HfModel", "model_path": "google/gemma-4-E2B-it" },
"systems": {
"qnn_system": {
"type": "PythonEnvironment",
"python_environment_path": "/path/to/qnn/env/bin",
"accelerators": [
{ "device": "npu", "execution_providers": ["QNNExecutionProvider"] }
]
}
},
"data_configs": [
{
"name": "wikitext2_train_act",
"type": "HuggingfaceContainer",
"load_dataset_config": {
"data_name": "wikitext",
"subset": "wikitext-2-raw-v1",
"split": "train"
},
"pre_process_data_config": {
"strategy": "line-by-line",
"add_special_tokens": true,
"max_samples": 256,
"max_seq_len": 1024
}
}
],
"passes": {
"mobius_build": {
"type": "MobiusBuilder",
"precision": "fp32"
},
"int4_quantize": {
"type": "OnnxKQuantQuantization",
"bits": 4,
"block_size": 32,
"save_as_external_data": true
},
"mnb_to_qdq": {
"type": "MatMulNBitsToQDQ",
"use_int4": true,
"add_zero_point": true,
"save_as_external_data": true
},
Comment on lines +40 to +45
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description was stale. mnb_to_qdq is intentional — OnnxKQuantQuantization emits com.microsoft::MatMulNBits which QNN EP doesn't claim, so without QDQ rewriting every quantized MatMul silently falls back to CPU. Fixed by updating the PR description (commit 1e7c186 already restored the pass; the description was just out of sync).

"static_quant": {
"type": "OnnxStaticQuantization",
"data_config": "wikitext2_train_act",
"activation_type": "uint16",
"precision": "uint8",
"calibration_providers": ["CUDAExecutionProvider"],
"quant_preprocess": true,
"op_types_to_exclude": [
"GatherBlockQuantized",
"GroupQueryAttention",
"MatMulNBits"
],
"save_as_external_data": true
},
"static_llm": {
"type": "StaticLLM",
"batch_size": 1,
"context_length": 64
},
"ep_context": {
"type": "EPContextBinaryGenerator",
"provider_options": {
"htp_performance_mode": "burst",
"htp_graph_finalization_optimization_mode": "3",
"soc_model": "60"
},
"weight_sharing": true
},
"compose": { "type": "ComposeOnnxModels" }
},
"target": "qnn_system",
"output_dir": "model/gemma4_e2b_qnn",
"cache_dir": "cache",
"no_artifacts": true,
"evaluate_input_model": false
}
13 changes: 13 additions & 0 deletions google-gemma-4-E2B-it/QNN/info.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
keywords:
- foundry-local
- qnn
- gemma4
- mobius
arch: gemma
recipes:
- name: gemma4-e2b-qnn
file: config.json
devices:
- npu
eps: QNNExecutionProvider
name: gemma4-e2b-qnn
5 changes: 5 additions & 0 deletions google-gemma-4-E2B-it/QNN/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
datasets
mobius-ai
olive-ai
onnxruntime-gpu
transformers
Loading