-
Notifications
You must be signed in to change notification settings - Fork 51
Add Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe — WIP #432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
01b7061
4b8b6d0
1e7c186
7029e9b
cbf992c
811f3a0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,141 @@ | ||
| # Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe | ||
|
|
||
| > **Status: WORK IN PROGRESS / EXPLORATORY.** Starting-point recipe for | ||
| > running [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it) | ||
| > on the Qualcomm Hexagon NPU via the QNN ONNX Runtime execution | ||
| > provider (Snapdragon X / Copilot+ PC, Snapdragon 8 Gen 3+, etc.). | ||
| > Has *not* yet been validated on hardware — see | ||
| > [Limitations](#limitations). | ||
|
|
||
| ## Approach | ||
|
|
||
| All four Gemma 4 components (decoder, embedding, vision_encoder, | ||
| audio_encoder) are compiled into QNN EPContext binaries together. | ||
| Olive's per-component dispatch on `CompositeModelHandler` runs each | ||
| pass on every component, then `EPContextBinaryGenerator` and | ||
| `ComposeOnnxModels` (both `_accepts_composite_model = True`) finalize | ||
| the multimodal package. | ||
|
|
||
| ## Pipeline | ||
|
|
||
| ``` | ||
| HfModel (multimodal Gemma 4) | ||
| ↓ MobiusBuilder (fp32) 4 ONNX components + genai_config + tokenizer + processors | ||
| ↓ OnnxKQuantQuantization (INT4) mobius-standard Q4_K_M; weights → com.microsoft::MatMulNBits | ||
| ↓ MatMulNBitsToQDQ MatMulNBits → MatMul + DequantizeLinear (QNN-compatible QDQ) | ||
| ↓ OnnxStaticQuantization activations uint16 / weights uint8 (calibrated) | ||
| ↓ StaticLLM static shapes for QNN | ||
| ↓ EPContextBinaryGenerator HTP EPContext blobs (per component, weight-shared) | ||
| ↓ ComposeOnnxModels final package | ||
| ``` | ||
|
|
||
| Why both `OnnxKQuantQuantization` and `MatMulNBitsToQDQ`? | ||
| `OnnxKQuantQuantization` emits `com.microsoft::MatMulNBits`, which has | ||
| fast CPU / CUDA kernels but is *not* in the QNN EP's supported-op list | ||
| — without `MatMulNBitsToQDQ` the QNN partitioner rejects every | ||
| quantized MatMul and the model silently falls back to CPU. | ||
| `MatMulNBitsToQDQ` rewrites each `MatMulNBits` into a standard | ||
| `MatMul + DequantizeLinear` pair so QNN can claim and compile the | ||
| subgraph onto HTP. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| ### Quantization environment (x64, GPU recommended) | ||
| ```bash | ||
| pip install -r requirements.txt | ||
| pip install cupy-cuda12x # accelerates OnnxKQuantQuantization (19–51× speedup) | ||
| ``` | ||
|
|
||
| ### AOT compilation environment (separate venv, x64 with QNN SDK) | ||
| ```bash | ||
| pip install olive-ai mobius-ai | ||
| pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps | ||
| ``` | ||
|
|
||
| Replace `/path/to/qnn/env/bin` in `config.json` with the directory | ||
| containing your QNN venv's Python executable. | ||
|
|
||
| ### Inference (Snapdragon device) | ||
| ```bash | ||
| pip install onnxruntime-qnn onnxruntime-genai | ||
| ``` | ||
|
|
||
| ## Build | ||
|
|
||
| Run `olive run` from the **quantization environment** (not the QNN AOT | ||
| venv). Olive invokes the QNN AOT venv automatically via the | ||
| `python_environment_path` configured under `systems.qnn_system` for the | ||
| `EPContextBinaryGenerator` pass: | ||
|
|
||
| ```bash | ||
| olive run --config config.json | ||
| ``` | ||
|
|
||
| ## Why no `GraphSurgeries` here? | ||
|
|
||
| Most existing QNN recipes (Phi-3, Qwen) chain surgeries like | ||
| `RemoveRopeMultiCache`, `AttentionMaskToSequenceLengths`, | ||
| `SimplifiedLayerNormToL2Norm`. Those rewrite ModelBuilder-specific | ||
| sub-graphs into shapes HTP can lower: | ||
|
|
||
| | Surgery | What it does | Why ModelBuilder needs it | | ||
| |---|---|---| | ||
| | `SimplifiedLayerNormToL2Norm` | `com.microsoft.SimplifiedLayerNorm` → `LpNormalization * gamma` | HTP has no SimplifiedLayerNorm kernel | | ||
| | `RemoveRopeMultiCache` | Drop one of ModelBuilder's two RoPE caches | HTP can't dispatch on cache selector | | ||
| | `AttentionMaskToSequenceLengths` | `GQA(attention_mask=[B,T])` → `GQA(past_seq_len, total_seq_len)` | HTP's GQA kernel wants scalar seq lens | | ||
|
|
||
| `MobiusBuilder` emits opset-23 standard ops (`RMSNormalization`, | ||
| `Attention`) instead of the contrib variants, so these surgeries are | ||
| either no-ops or inapplicable. Gemma-4–specific surgeries may still be | ||
| needed (e.g. lowering the final logit soft-cap `cap * tanh(x / cap)`), | ||
| but the existing borrowed-from-Phi-3 set is not it. | ||
|
|
||
| ## Limitations | ||
|
|
||
| This recipe has **not yet been validated end-to-end**. Known gaps: | ||
|
|
||
| 1. **Logit soft-cap may not lower to HTP.** Gemma 4's | ||
| `logits = cap * tanh(logits / cap)` is unusual for QNN. If HTP | ||
| rejects it, options are (a) skip soft-cap during QNN compile and | ||
| apply it in host post-processing, or (b) add a | ||
| `RemoveLogitSoftcap` GraphSurgery upstream in Olive. | ||
|
|
||
| 2. **Hybrid local/global attention with dual head_dim.** Gemma 4 E2B | ||
| alternates local sliding-window (head_dim=256) and global | ||
| (head_dim=512) attention layers. Whether HTP can dispatch this | ||
| correctly per-layer needs testing. | ||
|
|
||
| 3. **`per_layer_inputs` data flow.** The embedding component emits a | ||
| second output (`per_layer_inputs`, shape `[B, S, L*D]`) consumed by | ||
| every decoder block. When all components compile to QNN this should | ||
| "just work" (the data path stays inside the package), but the | ||
| `StaticLLM` pass may need a hint to recognise this extra tensor. | ||
|
|
||
| 4. **256k tokenizer calibration.** `wikitext-2` calibration likely | ||
| under-represents Gemma 4's image / audio special tokens. Consider | ||
| augmenting with multimodal-formatted prompts before production. | ||
|
|
||
| 5. **`StaticLLM context_length=64`.** Placeholder mirroring existing QNN | ||
| recipes; tune to target Snapdragon SKU memory budget. | ||
|
|
||
| 6. **Standard `Attention` op, not `GroupQueryAttention`.** mobius only | ||
| emits `com.microsoft::GroupQueryAttention(seqlens_k, total_seq_len)` | ||
| when the EP capability advertises `gqa_dtypes`. The QNN EP | ||
| capability in mobius currently has an empty `gqa_dtypes` list, so | ||
| `Gemma4TextModel.forward` (`src/mobius/models/gemma4.py:1500-1508`) | ||
| falls back to the standard opset-23 `Attention` with an | ||
| `attention_mask` input. QNN's HTP backend should have an attention | ||
| kernel for the standard op, but if it doesn't lower well there are | ||
| two options: | ||
| - extend mobius `ep_capabilities()` to advertise QNN-supported | ||
| dtypes for `gqa_dtypes`, then mobius will emit `GQA` directly | ||
| (no GraphSurgery needed); or | ||
| - port `AttentionMaskToSequenceLengths` to operate on standard | ||
| `Attention` (it currently checks for `GroupQueryAttention` only | ||
| and no-ops otherwise). | ||
|
|
||
| ## Discussion | ||
|
|
||
| If you have a Snapdragon test rig and the pipeline blows up on a | ||
| specific pass, please drop the trace in a comment — this recipe is | ||
| intentionally a template for iteration, not a finished product. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| { | ||
| "input_model": { "type": "HfModel", "model_path": "google/gemma-4-E2B-it" }, | ||
| "systems": { | ||
| "qnn_system": { | ||
| "type": "PythonEnvironment", | ||
| "python_environment_path": "/path/to/qnn/env/bin", | ||
| "accelerators": [ | ||
| { "device": "npu", "execution_providers": ["QNNExecutionProvider"] } | ||
| ] | ||
| } | ||
| }, | ||
| "data_configs": [ | ||
| { | ||
| "name": "wikitext2_train_act", | ||
| "type": "HuggingfaceContainer", | ||
| "load_dataset_config": { | ||
| "data_name": "wikitext", | ||
| "subset": "wikitext-2-raw-v1", | ||
| "split": "train" | ||
| }, | ||
| "pre_process_data_config": { | ||
| "strategy": "line-by-line", | ||
| "add_special_tokens": true, | ||
| "max_samples": 256, | ||
| "max_seq_len": 1024 | ||
| } | ||
| } | ||
| ], | ||
| "passes": { | ||
| "mobius_build": { | ||
| "type": "MobiusBuilder", | ||
| "precision": "fp32" | ||
| }, | ||
| "int4_quantize": { | ||
| "type": "OnnxKQuantQuantization", | ||
| "bits": 4, | ||
| "block_size": 32, | ||
| "save_as_external_data": true | ||
| }, | ||
| "mnb_to_qdq": { | ||
| "type": "MatMulNBitsToQDQ", | ||
| "use_int4": true, | ||
| "add_zero_point": true, | ||
| "save_as_external_data": true | ||
| }, | ||
|
Comment on lines
+40
to
+45
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The PR description was stale. |
||
| "static_quant": { | ||
| "type": "OnnxStaticQuantization", | ||
| "data_config": "wikitext2_train_act", | ||
| "activation_type": "uint16", | ||
| "precision": "uint8", | ||
| "calibration_providers": ["CUDAExecutionProvider"], | ||
| "quant_preprocess": true, | ||
| "op_types_to_exclude": [ | ||
| "GatherBlockQuantized", | ||
| "GroupQueryAttention", | ||
| "MatMulNBits" | ||
| ], | ||
| "save_as_external_data": true | ||
| }, | ||
| "static_llm": { | ||
| "type": "StaticLLM", | ||
| "batch_size": 1, | ||
| "context_length": 64 | ||
| }, | ||
| "ep_context": { | ||
| "type": "EPContextBinaryGenerator", | ||
| "provider_options": { | ||
| "htp_performance_mode": "burst", | ||
| "htp_graph_finalization_optimization_mode": "3", | ||
| "soc_model": "60" | ||
| }, | ||
| "weight_sharing": true | ||
| }, | ||
| "compose": { "type": "ComposeOnnxModels" } | ||
| }, | ||
| "target": "qnn_system", | ||
| "output_dir": "model/gemma4_e2b_qnn", | ||
| "cache_dir": "cache", | ||
| "no_artifacts": true, | ||
| "evaluate_input_model": false | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| keywords: | ||
| - foundry-local | ||
| - qnn | ||
| - gemma4 | ||
| - mobius | ||
| arch: gemma | ||
| recipes: | ||
| - name: gemma4-e2b-qnn | ||
| file: config.json | ||
| devices: | ||
| - npu | ||
| eps: QNNExecutionProvider | ||
| name: gemma4-e2b-qnn |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| datasets | ||
| mobius-ai | ||
| olive-ai | ||
| onnxruntime-gpu | ||
| transformers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an explicit note in the Build section: run
olive runfrom the quantization environment; Olive invokes the QNN AOT venv automatically viasystems.qnn_system.python_environment_pathfor the EPContextBinaryGenerator pass. Fixed in cbf992c.