From 01b7061c44e706a13f67f31a48613de70f9694bb Mon Sep 17 00:00:00 2001 From: justinchuby <11205048+justinchuby@users.noreply.github.com> Date: Wed, 27 May 2026 05:47:31 +0000 Subject: [PATCH 1/6] =?UTF-8?q?Add=20Gemma=204=20E2B=20QNN=20(Snapdragon?= =?UTF-8?q?=20Hexagon=20NPU)=20recipe=20=E2=80=94=20WIP?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds google-gemma-4-E2B-it/QNN/ as a starting-point recipe for compiling Gemma 4's text decoder into a QNN EPContext binary for HTP execution on Snapdragon X / Copilot+ PC / Snapdragon 8 Gen 3+. Pipeline: MobiusBuilder fp32 → OnnxKQuantQuantization (INT4 weights) → MatMulNBitsToQDQ → GraphSurgeries (RemoveRopeMultiCache / AttentionMaskToSequenceLengths / SimplifiedLayerNormToL2Norm) → OnnxStaticQuantization (uint16 act / uint8 wt) → SplitModel + StaticLLM → EPContextBinaryGenerator (HTP blob) → ComposeOnnxModels Marked WORK IN PROGRESS in the README. Known limitations called out explicitly: 1) MobiusBuilder always exports the multimodal 4-component package for google/gemma-4-E2B-it; no current way to force the text-only gemma4_text path from the recipe config. Splitting the QNN passes to apply only to the decoder component is still TODO. 2) GraphSurgeries borrowed from Phi-3 / Qwen QNN recipes have not been verified against Gemma 4's hybrid local/global attention, dual head_dim KV cache, or final logit soft-capping (tanh-cap). 3) per_layer_inputs (second embedding output, consumed by every decoder block) needs custom split orchestration if embedding stays on CPU and decoder runs on HTP. 4) Calibration via wikitext-2 may under-represent multimodal-format tokens (256k vocab includes vision/audio specials). 5) StaticLLM context_length=64 is a placeholder for HW tuning. Filed as exploratory template so other contributors with Snapdragon HW can iterate. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com> --- google-gemma-4-E2B-it/QNN/README.md | 112 +++++++++++++++++++++ google-gemma-4-E2B-it/QNN/config.json | 91 +++++++++++++++++ google-gemma-4-E2B-it/QNN/info.yml | 13 +++ google-gemma-4-E2B-it/QNN/requirements.txt | 5 + 4 files changed, 221 insertions(+) create mode 100644 google-gemma-4-E2B-it/QNN/README.md create mode 100644 google-gemma-4-E2B-it/QNN/config.json create mode 100644 google-gemma-4-E2B-it/QNN/info.yml create mode 100644 google-gemma-4-E2B-it/QNN/requirements.txt diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md new file mode 100644 index 000000000..8a8e98a64 --- /dev/null +++ b/google-gemma-4-E2B-it/QNN/README.md @@ -0,0 +1,112 @@ +# Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe + +> **Status: WORK IN PROGRESS / EXPLORATORY.** This recipe is a starting +> point for running [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it) +> on the Qualcomm Hexagon NPU via the QNN ONNX Runtime execution +> provider (Snapdragon X / Copilot+ PC, Snapdragon 8 Gen 3+, etc.). +> It has *not* yet been end-to-end validated on hardware — see the +> [Limitations](#limitations) section below before using it in +> production. + +This recipe targets the **text decoder only**. Gemma 4's vision and +audio encoders run on CPU (`google-gemma-4-E2B-it/cpu/`) or GPU +(`google-gemma-4-E2B-it/cuda/`); only the LM decoder is compiled into +an EPContext binary for HTP execution. + +## Pipeline overview + +``` +HfModel (multimodal Gemma4) + ↓ MobiusBuilder fp32 → ORT GenAI multi-component package + ↓ OnnxKQuantQuantization → INT4 weights (decoder only) + ↓ MatMulNBitsToQDQ → QDQ format for static quantization + ↓ GraphSurgeries → QNN-friendly graph (Rope unmerge, mask, L2Norm) + ↓ OnnxStaticQuantization → activations uint16 / weights uint8 (calibrated) + ↓ SplitModel + StaticLLM → static-shape sub-graphs for QNN + ↓ EPContextBinaryGenerator → compiled HTP EPContext blob +``` + +## Prerequisites + +### Quantization environment (x64 with CUDA GPU) +Quantization (especially `OnnxStaticQuantization`) is resource-intensive +and accelerated by GPU: + +```bash +pip install -r requirements.txt +``` + +### AOT compilation environment (separate venv, x64 with QNN SDK) +Compilation into the EPContext binary requires +`onnxruntime-qnn`: + +```bash +pip install olive-ai mobius-ai +pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn" --no-deps +``` + +Set `/path/to/qnn/env/bin` in `config.json` to the directory containing +your QNN venv's Python executable. + +### Inference environment (Snapdragon device) +On Copilot+ PC / Snapdragon X / Android (Snapdragon 8 Gen 3+): + +```bash +pip install onnxruntime-qnn onnxruntime-genai +``` + +## Build + +```bash +olive run --config config.json +``` + +Output is a self-contained EPContext binary package suitable for QNN HTP +execution. + +## Limitations + +This recipe has **not yet been validated end-to-end**. Known gaps: + +1. **Multimodal `google/gemma-4-E2B-it` always produces a 4-component + package** (decoder + embedding + vision + audio). MobiusBuilder + currently does not expose a `model_type` / `module_class` override + to force the text-only `gemma4_text` build, so the pipeline must + either: (a) run the QNN passes only on the `decoder` component and + leave the others as fp32, or (b) wait for the upstream MobiusBuilder + to gain a `module_class` parameter. This recipe assumes (a) but the + QNN pass chain currently expects a single ONNX model — the + integration is still TODO. + +2. **Gemma 4's exotic ops may break QNN GraphSurgeries.** The recipe + borrows surgeries (`RemoveRopeMultiCache`, `AttentionMaskToSequenceLengths`, + `SimplifiedLayerNormToL2Norm`) from Phi-3 / Qwen QNN recipes; they + have not been verified against Gemma 4's hybrid local/global + attention, `tie_word_embeddings`, dual head_dim KV cache, or final + logit soft-capping (`logits = cap * tanh(logits / cap)`). The + soft-cap subgraph in particular may not lower cleanly to HTP — may + need to be folded into the logit lookup or skipped during QNN + compilation. + +3. **Per-layer-input embeddings.** Gemma 4 E2B emits a second embedding + output (`per_layer_inputs`, shape `[B, S, L*D]`) consumed by every + decoder block. The split between embedding-on-CPU and decoder-on-HTP + needs a custom orchestrator (or both sub-models compiled into QNN + together). + +4. **Calibration data shape.** `OnnxStaticQuantization` calibration uses + `wikitext-2`; for Gemma 4 (which has a 256k tokenizer including + image / audio special tokens) the calibration set may under-represent + tokens that actually appear at inference time. Consider augmenting + with multimodal-formatted prompts. + +5. **`StaticLLM context_length=64`.** Mirrors Phi-3 / Qwen QNN recipes + but is unlikely to be useful for a real Gemma 4 deployment; tune to + target Snapdragon SKU memory budget once HW validation is possible. + +## Discussion + +If you have a Snapdragon test rig and run into specific failures, +please add a comment with the trace; this recipe is intended as a +template that other contributors can iterate on rather than a +production-ready pipeline. diff --git a/google-gemma-4-E2B-it/QNN/config.json b/google-gemma-4-E2B-it/QNN/config.json new file mode 100644 index 000000000..e5320eb76 --- /dev/null +++ b/google-gemma-4-E2B-it/QNN/config.json @@ -0,0 +1,91 @@ +{ + "input_model": { "type": "HfModel", "model_path": "google/gemma-4-E2B-it" }, + "systems": { + "qnn_system": { + "type": "PythonEnvironment", + "python_environment_path": "/path/to/qnn/env/bin", + "accelerators": [ + { "device": "npu", "execution_providers": ["QNNExecutionProvider"] } + ] + } + }, + "data_configs": [ + { + "name": "wikitext2_train_act", + "type": "HuggingfaceContainer", + "load_dataset_config": { + "data_name": "wikitext", + "subset": "wikitext-2-raw-v1", + "split": "train" + }, + "pre_process_data_config": { + "strategy": "line-by-line", + "add_special_tokens": true, + "max_samples": 256, + "max_seq_len": 1024 + } + } + ], + "passes": { + "mobius_build": { + "type": "MobiusBuilder", + "precision": "fp32" + }, + "matmul_nbits": { + "type": "OnnxKQuantQuantization", + "bits": 4, + "block_size": 32, + "save_as_external_data": true + }, + "mq": { + "type": "MatMulNBitsToQDQ", + "use_int4": true, + "add_zero_point": true, + "save_as_external_data": true + }, + "gs": { + "type": "GraphSurgeries", + "surgeries": [ + { "surgeon": "RemoveRopeMultiCache" }, + { "surgeon": "AttentionMaskToSequenceLengths" }, + { "surgeon": "SimplifiedLayerNormToL2Norm" } + ], + "save_as_external_data": true + }, + "sq": { + "type": "OnnxStaticQuantization", + "data_config": "wikitext2_train_act", + "activation_type": "uint16", + "precision": "uint8", + "calibration_providers": ["CUDAExecutionProvider"], + "quant_preprocess": true, + "op_types_to_exclude": [ + "GatherBlockQuantized", + "GroupQueryAttention", + "MatMulNBits" + ], + "save_as_external_data": true + }, + "sp": { "type": "SplitModel" }, + "st": { + "type": "StaticLLM", + "batch_size": 1, + "context_length": 64 + }, + "cb": { + "type": "EPContextBinaryGenerator", + "provider_options": { + "htp_performance_mode": "burst", + "htp_graph_finalization_optimization_mode": "3", + "soc_model": "60" + }, + "weight_sharing": true + }, + "cp": { "type": "ComposeOnnxModels" } + }, + "target": "qnn_system", + "output_dir": "model/gemma4_e2b_qnn", + "cache_dir": "cache", + "no_artifacts": true, + "evaluate_input_model": false +} diff --git a/google-gemma-4-E2B-it/QNN/info.yml b/google-gemma-4-E2B-it/QNN/info.yml new file mode 100644 index 000000000..ff96347b7 --- /dev/null +++ b/google-gemma-4-E2B-it/QNN/info.yml @@ -0,0 +1,13 @@ +keywords: + - foundry-local + - qnn + - gemma4 + - mobius +arch: gemma +recipes: + - name: gemma4-e2b-qnn + file: config.json + devices: + - npu + eps: QNNExecutionProvider +name: gemma4_e2b_qnn diff --git a/google-gemma-4-E2B-it/QNN/requirements.txt b/google-gemma-4-E2B-it/QNN/requirements.txt new file mode 100644 index 000000000..7e6f5c86f --- /dev/null +++ b/google-gemma-4-E2B-it/QNN/requirements.txt @@ -0,0 +1,5 @@ +datasets +mobius-ai +olive-ai +onnxruntime-gpu +transformers>=5.0 From 4b8b6d0b03a583012052c8d01e1fd3caff091bae Mon Sep 17 00:00:00 2001 From: justinchuby <11205048+justinchuby@users.noreply.github.com> Date: Wed, 27 May 2026 05:53:54 +0000 Subject: [PATCH 2/6] Gemma 4 E2B QNN recipe: compile all 4 components, drop ModelBuilder surgeries MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit All four Gemma 4 components (decoder + embedding + vision_encoder + audio_encoder) compile to QNN EPContext binaries together. Olive's CompositeModelHandler dispatch runs quant + StaticLLM per component automatically, then EPContextBinaryGenerator + ComposeOnnxModels (both _accepts_composite_model = True) finalise the multimodal package. Drop: * SplitModel — not needed when all components stay on QNN * MatMulNBitsToQDQ — was a ModelBuilder-specific stepping stone * GraphSurgeries with RemoveRopeMultiCache / AttentionMaskToSequenceLengths / SimplifiedLayerNormToL2Norm — those rewrite ModelBuilder contrib ops that mobius does not emit in the first place (mobius uses opset-23 RMSNormalization / Attention, not com.microsoft variants) The README now explains the surgery removal explicitly and lists what might still need a Gemma-4–specific upstream surgery (logit soft-cap). Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com> --- google-gemma-4-E2B-it/QNN/README.md | 140 +++++++++++++------------- google-gemma-4-E2B-it/QNN/config.json | 26 +---- 2 files changed, 74 insertions(+), 92 deletions(-) diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md index 8a8e98a64..fa12a835f 100644 --- a/google-gemma-4-E2B-it/QNN/README.md +++ b/google-gemma-4-E2B-it/QNN/README.md @@ -1,56 +1,51 @@ # Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe -> **Status: WORK IN PROGRESS / EXPLORATORY.** This recipe is a starting -> point for running [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it) +> **Status: WORK IN PROGRESS / EXPLORATORY.** Starting-point recipe for +> running [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it) > on the Qualcomm Hexagon NPU via the QNN ONNX Runtime execution > provider (Snapdragon X / Copilot+ PC, Snapdragon 8 Gen 3+, etc.). -> It has *not* yet been end-to-end validated on hardware — see the -> [Limitations](#limitations) section below before using it in -> production. +> Has *not* yet been validated on hardware — see +> [Limitations](#limitations). -This recipe targets the **text decoder only**. Gemma 4's vision and -audio encoders run on CPU (`google-gemma-4-E2B-it/cpu/`) or GPU -(`google-gemma-4-E2B-it/cuda/`); only the LM decoder is compiled into -an EPContext binary for HTP execution. +## Approach -## Pipeline overview +All four Gemma 4 components (decoder, embedding, vision_encoder, +audio_encoder) are compiled into QNN EPContext binaries together. +Olive's per-component dispatch on `CompositeModelHandler` runs each +pass on every component, then `EPContextBinaryGenerator` and +`ComposeOnnxModels` (both `_accepts_composite_model = True`) finalize +the multimodal package. + +## Pipeline ``` -HfModel (multimodal Gemma4) - ↓ MobiusBuilder fp32 → ORT GenAI multi-component package - ↓ OnnxKQuantQuantization → INT4 weights (decoder only) - ↓ MatMulNBitsToQDQ → QDQ format for static quantization - ↓ GraphSurgeries → QNN-friendly graph (Rope unmerge, mask, L2Norm) - ↓ OnnxStaticQuantization → activations uint16 / weights uint8 (calibrated) - ↓ SplitModel + StaticLLM → static-shape sub-graphs for QNN - ↓ EPContextBinaryGenerator → compiled HTP EPContext blob +HfModel (multimodal Gemma 4) + ↓ MobiusBuilder (fp32) 4 ONNX components + genai_config + tokenizer + processors + ↓ OnnxKQuantQuantization (INT4) mobius-standard Q4_K_M quant (per component) + ↓ OnnxStaticQuantization activations uint16 / weights uint8 (calibrated) + ↓ StaticLLM static shapes for QNN + ↓ EPContextBinaryGenerator HTP EPContext blobs (per component, weight-shared) + ↓ ComposeOnnxModels final package ``` ## Prerequisites -### Quantization environment (x64 with CUDA GPU) -Quantization (especially `OnnxStaticQuantization`) is resource-intensive -and accelerated by GPU: - +### Quantization environment (x64, GPU recommended) ```bash pip install -r requirements.txt +pip install cupy-cuda12x # accelerates OnnxKQuantQuantization (19–51× speedup) ``` ### AOT compilation environment (separate venv, x64 with QNN SDK) -Compilation into the EPContext binary requires -`onnxruntime-qnn`: - ```bash pip install olive-ai mobius-ai pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn" --no-deps ``` -Set `/path/to/qnn/env/bin` in `config.json` to the directory containing -your QNN venv's Python executable. - -### Inference environment (Snapdragon device) -On Copilot+ PC / Snapdragon X / Android (Snapdragon 8 Gen 3+): +Replace `/path/to/qnn/env/bin` in `config.json` with the directory +containing your QNN venv's Python executable. +### Inference (Snapdragon device) ```bash pip install onnxruntime-qnn onnxruntime-genai ``` @@ -61,52 +56,55 @@ pip install onnxruntime-qnn onnxruntime-genai olive run --config config.json ``` -Output is a self-contained EPContext binary package suitable for QNN HTP -execution. +## Why no `GraphSurgeries` here? + +Most existing QNN recipes (Phi-3, Qwen) chain surgeries like +`RemoveRopeMultiCache`, `AttentionMaskToSequenceLengths`, +`SimplifiedLayerNormToL2Norm`. Those rewrite ModelBuilder-specific +sub-graphs into shapes HTP can lower: + +| Surgery | What it does | Why ModelBuilder needs it | +|---|---|---| +| `SimplifiedLayerNormToL2Norm` | `com.microsoft.SimplifiedLayerNorm` → `LpNormalization * gamma` | HTP has no SimplifiedLayerNorm kernel | +| `RemoveRopeMultiCache` | Drop one of ModelBuilder's two RoPE caches | HTP can't dispatch on cache selector | +| `AttentionMaskToSequenceLengths` | `GQA(attention_mask=[B,T])` → `GQA(past_seq_len, total_seq_len)` | HTP's GQA kernel wants scalar seq lens | + +`MobiusBuilder` emits opset-23 standard ops (`RMSNormalization`, +`Attention`) instead of the contrib variants, so these surgeries are +either no-ops or inapplicable. Gemma-4–specific surgeries may still be +needed (e.g. lowering the final logit soft-cap `cap * tanh(x / cap)`), +but the existing borrowed-from-Phi-3 set is not it. ## Limitations This recipe has **not yet been validated end-to-end**. Known gaps: -1. **Multimodal `google/gemma-4-E2B-it` always produces a 4-component - package** (decoder + embedding + vision + audio). MobiusBuilder - currently does not expose a `model_type` / `module_class` override - to force the text-only `gemma4_text` build, so the pipeline must - either: (a) run the QNN passes only on the `decoder` component and - leave the others as fp32, or (b) wait for the upstream MobiusBuilder - to gain a `module_class` parameter. This recipe assumes (a) but the - QNN pass chain currently expects a single ONNX model — the - integration is still TODO. - -2. **Gemma 4's exotic ops may break QNN GraphSurgeries.** The recipe - borrows surgeries (`RemoveRopeMultiCache`, `AttentionMaskToSequenceLengths`, - `SimplifiedLayerNormToL2Norm`) from Phi-3 / Qwen QNN recipes; they - have not been verified against Gemma 4's hybrid local/global - attention, `tie_word_embeddings`, dual head_dim KV cache, or final - logit soft-capping (`logits = cap * tanh(logits / cap)`). The - soft-cap subgraph in particular may not lower cleanly to HTP — may - need to be folded into the logit lookup or skipped during QNN - compilation. - -3. **Per-layer-input embeddings.** Gemma 4 E2B emits a second embedding - output (`per_layer_inputs`, shape `[B, S, L*D]`) consumed by every - decoder block. The split between embedding-on-CPU and decoder-on-HTP - needs a custom orchestrator (or both sub-models compiled into QNN - together). - -4. **Calibration data shape.** `OnnxStaticQuantization` calibration uses - `wikitext-2`; for Gemma 4 (which has a 256k tokenizer including - image / audio special tokens) the calibration set may under-represent - tokens that actually appear at inference time. Consider augmenting - with multimodal-formatted prompts. - -5. **`StaticLLM context_length=64`.** Mirrors Phi-3 / Qwen QNN recipes - but is unlikely to be useful for a real Gemma 4 deployment; tune to - target Snapdragon SKU memory budget once HW validation is possible. +1. **Logit soft-cap may not lower to HTP.** Gemma 4's + `logits = cap * tanh(logits / cap)` is unusual for QNN. If HTP + rejects it, options are (a) skip soft-cap during QNN compile and + apply it in host post-processing, or (b) add a + `RemoveLogitSoftcap` GraphSurgery upstream in Olive. + +2. **Hybrid local/global attention with dual head_dim.** Gemma 4 E2B + alternates local sliding-window (head_dim=256) and global + (head_dim=512) attention layers. Whether HTP can dispatch this + correctly per-layer needs testing. + +3. **`per_layer_inputs` data flow.** The embedding component emits a + second output (`per_layer_inputs`, shape `[B, S, L*D]`) consumed by + every decoder block. When all components compile to QNN this should + "just work" (the data path stays inside the package), but the + `StaticLLM` pass may need a hint to recognise this extra tensor. + +4. **256k tokenizer calibration.** `wikitext-2` calibration likely + under-represents Gemma 4's image / audio special tokens. Consider + augmenting with multimodal-formatted prompts before production. + +5. **`StaticLLM context_length=64`.** Placeholder mirroring existing QNN + recipes; tune to target Snapdragon SKU memory budget. ## Discussion -If you have a Snapdragon test rig and run into specific failures, -please add a comment with the trace; this recipe is intended as a -template that other contributors can iterate on rather than a -production-ready pipeline. +If you have a Snapdragon test rig and the pipeline blows up on a +specific pass, please drop the trace in a comment — this recipe is +intentionally a template for iteration, not a finished product. diff --git a/google-gemma-4-E2B-it/QNN/config.json b/google-gemma-4-E2B-it/QNN/config.json index e5320eb76..dd9bc0458 100644 --- a/google-gemma-4-E2B-it/QNN/config.json +++ b/google-gemma-4-E2B-it/QNN/config.json @@ -31,28 +31,13 @@ "type": "MobiusBuilder", "precision": "fp32" }, - "matmul_nbits": { + "int4_quantize": { "type": "OnnxKQuantQuantization", "bits": 4, "block_size": 32, "save_as_external_data": true }, - "mq": { - "type": "MatMulNBitsToQDQ", - "use_int4": true, - "add_zero_point": true, - "save_as_external_data": true - }, - "gs": { - "type": "GraphSurgeries", - "surgeries": [ - { "surgeon": "RemoveRopeMultiCache" }, - { "surgeon": "AttentionMaskToSequenceLengths" }, - { "surgeon": "SimplifiedLayerNormToL2Norm" } - ], - "save_as_external_data": true - }, - "sq": { + "static_quant": { "type": "OnnxStaticQuantization", "data_config": "wikitext2_train_act", "activation_type": "uint16", @@ -66,13 +51,12 @@ ], "save_as_external_data": true }, - "sp": { "type": "SplitModel" }, - "st": { + "static_llm": { "type": "StaticLLM", "batch_size": 1, "context_length": 64 }, - "cb": { + "ep_context": { "type": "EPContextBinaryGenerator", "provider_options": { "htp_performance_mode": "burst", @@ -81,7 +65,7 @@ }, "weight_sharing": true }, - "cp": { "type": "ComposeOnnxModels" } + "compose": { "type": "ComposeOnnxModels" } }, "target": "qnn_system", "output_dir": "model/gemma4_e2b_qnn", From 1e7c18658365512a3820c3c3d43fb82351629169 Mon Sep 17 00:00:00 2001 From: justinchuby <11205048+justinchuby@users.noreply.github.com> Date: Wed, 27 May 2026 06:05:40 +0000 Subject: [PATCH 3/6] =?UTF-8?q?Gemma=204=20E2B=20QNN:=20restore=20MatMulNB?= =?UTF-8?q?itsToQDQ=20=E2=80=94=20QNN=20can't=20run=20MatMulNBits?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit OnnxKQuantQuantization emits com.microsoft::MatMulNBits which is fast on CPU / CUDA but not in the QNN EP's supported-op list. Without MatMulNBitsToQDQ the QNN partitioner rejects every quantized MatMul node and the model silently falls back to CPU — defeating the point of compiling to HTP. Restore MatMulNBitsToQDQ between the INT4 quant and the static activation quant so each MatMulNBits gets rewritten into the standard MatMul + DequantizeLinear pair the QNN partitioner can claim and lower onto HTP. README updated with an explanation of why both passes are needed. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com> --- google-gemma-4-E2B-it/QNN/README.md | 12 +++++++++++- google-gemma-4-E2B-it/QNN/config.json | 6 ++++++ 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md index fa12a835f..85b6df874 100644 --- a/google-gemma-4-E2B-it/QNN/README.md +++ b/google-gemma-4-E2B-it/QNN/README.md @@ -21,13 +21,23 @@ the multimodal package. ``` HfModel (multimodal Gemma 4) ↓ MobiusBuilder (fp32) 4 ONNX components + genai_config + tokenizer + processors - ↓ OnnxKQuantQuantization (INT4) mobius-standard Q4_K_M quant (per component) + ↓ OnnxKQuantQuantization (INT4) mobius-standard Q4_K_M; weights → com.microsoft::MatMulNBits + ↓ MatMulNBitsToQDQ MatMulNBits → MatMul + DequantizeLinear (QNN-compatible QDQ) ↓ OnnxStaticQuantization activations uint16 / weights uint8 (calibrated) ↓ StaticLLM static shapes for QNN ↓ EPContextBinaryGenerator HTP EPContext blobs (per component, weight-shared) ↓ ComposeOnnxModels final package ``` +Why both `OnnxKQuantQuantization` and `MatMulNBitsToQDQ`? +`OnnxKQuantQuantization` emits `com.microsoft::MatMulNBits`, which has +fast CPU / CUDA kernels but is *not* in the QNN EP's supported-op list +— without `MatMulNBitsToQDQ` the QNN partitioner rejects every +quantized MatMul and the model silently falls back to CPU. +`MatMulNBitsToQDQ` rewrites each `MatMulNBits` into a standard +`MatMul + DequantizeLinear` pair so QNN can claim and compile the +subgraph onto HTP. + ## Prerequisites ### Quantization environment (x64, GPU recommended) diff --git a/google-gemma-4-E2B-it/QNN/config.json b/google-gemma-4-E2B-it/QNN/config.json index dd9bc0458..52fde21e9 100644 --- a/google-gemma-4-E2B-it/QNN/config.json +++ b/google-gemma-4-E2B-it/QNN/config.json @@ -37,6 +37,12 @@ "block_size": 32, "save_as_external_data": true }, + "mnb_to_qdq": { + "type": "MatMulNBitsToQDQ", + "use_int4": true, + "add_zero_point": true, + "save_as_external_data": true + }, "static_quant": { "type": "OnnxStaticQuantization", "data_config": "wikitext2_train_act", From 7029e9b0f49dc58312e1477073638601d40967c0 Mon Sep 17 00:00:00 2001 From: justinchuby <11205048+justinchuby@users.noreply.github.com> Date: Wed, 27 May 2026 06:07:52 +0000 Subject: [PATCH 4/6] Gemma 4 QNN README: document standard Attention vs GroupQueryAttention Make explicit that mobius emits opset-23 Attention (with attention_mask input) for QNN, not com.microsoft::GroupQueryAttention(seqlens_k, total_seq_len), because mobius's QNN ep_capabilities() advertises an empty gqa_dtypes list. The existing AttentionMaskToSequenceLengths GraphSurgery is therefore inapplicable (it only rewrites GQA), and no surgery is needed if HTP's standard-attention kernel lowers cleanly. Two follow-up options spelled out if HW shows the standard Attention path is too slow on HTP: (a) extend mobius ep_capabilities for QNN to set gqa_dtypes so the builder emits GQA directly; or (b) port AttentionMaskToSequenceLengths to also rewrite standard Attention (currently it short-circuits when GQA is absent). Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com> --- google-gemma-4-E2B-it/QNN/README.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md index 85b6df874..45d138d5b 100644 --- a/google-gemma-4-E2B-it/QNN/README.md +++ b/google-gemma-4-E2B-it/QNN/README.md @@ -113,6 +113,22 @@ This recipe has **not yet been validated end-to-end**. Known gaps: 5. **`StaticLLM context_length=64`.** Placeholder mirroring existing QNN recipes; tune to target Snapdragon SKU memory budget. +6. **Standard `Attention` op, not `GroupQueryAttention`.** mobius only + emits `com.microsoft::GroupQueryAttention(seqlens_k, total_seq_len)` + when the EP capability advertises `gqa_dtypes`. The QNN EP + capability in mobius currently has an empty `gqa_dtypes` list, so + `Gemma4TextModel.forward` (`src/mobius/models/gemma4.py:1500-1508`) + falls back to the standard opset-23 `Attention` with an + `attention_mask` input. QNN's HTP backend should have an attention + kernel for the standard op, but if it doesn't lower well there are + two options: + - extend mobius `ep_capabilities()` to advertise QNN-supported + dtypes for `gqa_dtypes`, then mobius will emit `GQA` directly + (no GraphSurgery needed); or + - port `AttentionMaskToSequenceLengths` to operate on standard + `Attention` (it currently checks for `GroupQueryAttention` only + and no-ops otherwise). + ## Discussion If you have a Snapdragon test rig and the pipeline blows up on a From cbf992c23f16ab4be28dd306bc0d3321723da600 Mon Sep 17 00:00:00 2001 From: justinchuby <11205048+justinchuby@users.noreply.github.com> Date: Wed, 27 May 2026 06:16:03 +0000 Subject: [PATCH 5/6] =?UTF-8?q?Gemma=204=20QNN:=20address=20Copilot=20revi?= =?UTF-8?q?ew=20=E2=80=94=20pin=20versions,=20align=20names,=20doc=20env?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * requirements.txt: pin olive-ai==0.9.3 and onnxruntime-gpu==1.21.1 to match the last-validated versions used by the other QNN recipes in this repo (e.g. microsoft-Phi-3-mini-4k-instruct/QNN/). Keep mobius-ai and transformers>=5.0 unpinned for now since this recipe is still WIP and the validated version set will only stabilize after HW validation. * README: pin onnxruntime-qnn==1.22.2 in the AOT compilation env install command, matching microsoft-Phi-3-mini-4k-instruct/QNN/. * README: state explicitly that 'olive run' runs from the quantization environment, with Olive invoking the QNN AOT venv via systems.qnn_system.python_environment_path for the EPContextBinary pass. Avoids the easy mistake of running 'olive run' from the QNN venv (which lacks GPU quantization deps). * info.yml: align the top-level name (gemma4_e2b_qnn → gemma4-e2b-qnn) with the recipe name so scanner tables aren't ambiguous. PR description updated to drop the stale 'v2 drops MatMulNBitsToQDQ' claim — that pass was restored in 1e7c186 (QNN cannot run MatMulNBits). Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com> --- google-gemma-4-E2B-it/QNN/README.md | 9 +++++++-- google-gemma-4-E2B-it/QNN/info.yml | 2 +- google-gemma-4-E2B-it/QNN/requirements.txt | 5 +++-- 3 files changed, 11 insertions(+), 5 deletions(-) diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md index 45d138d5b..dc9d089f5 100644 --- a/google-gemma-4-E2B-it/QNN/README.md +++ b/google-gemma-4-E2B-it/QNN/README.md @@ -48,8 +48,8 @@ pip install cupy-cuda12x # accelerates OnnxKQuantQuantization (19–51× speed ### AOT compilation environment (separate venv, x64 with QNN SDK) ```bash -pip install olive-ai mobius-ai -pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn" --no-deps +pip install olive-ai==0.9.3 mobius-ai +pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn==1.22.2" --no-deps ``` Replace `/path/to/qnn/env/bin` in `config.json` with the directory @@ -62,6 +62,11 @@ pip install onnxruntime-qnn onnxruntime-genai ## Build +Run `olive run` from the **quantization environment** (not the QNN AOT +venv). Olive invokes the QNN AOT venv automatically via the +`python_environment_path` configured under `systems.qnn_system` for the +`EPContextBinaryGenerator` pass: + ```bash olive run --config config.json ``` diff --git a/google-gemma-4-E2B-it/QNN/info.yml b/google-gemma-4-E2B-it/QNN/info.yml index ff96347b7..eda1a22a3 100644 --- a/google-gemma-4-E2B-it/QNN/info.yml +++ b/google-gemma-4-E2B-it/QNN/info.yml @@ -10,4 +10,4 @@ recipes: devices: - npu eps: QNNExecutionProvider -name: gemma4_e2b_qnn +name: gemma4-e2b-qnn diff --git a/google-gemma-4-E2B-it/QNN/requirements.txt b/google-gemma-4-E2B-it/QNN/requirements.txt index 7e6f5c86f..ce774f410 100644 --- a/google-gemma-4-E2B-it/QNN/requirements.txt +++ b/google-gemma-4-E2B-it/QNN/requirements.txt @@ -1,5 +1,6 @@ datasets mobius-ai -olive-ai -onnxruntime-gpu +olive-ai==0.9.3 +# these are the versions the recipes were last validated with +onnxruntime-gpu==1.21.1 transformers>=5.0 From 811f3a02a65a43d21757eb0e2bc7114614f716e1 Mon Sep 17 00:00:00 2001 From: justinchuby <11205048+justinchuby@users.noreply.github.com> Date: Wed, 27 May 2026 14:20:56 +0000 Subject: [PATCH 6/6] Gemma 4 QNN: unpin requirements / install commands MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Mobius isn't published yet, so freezing olive-ai / onnxruntime-gpu / onnxruntime-qnn / transformers at specific versions doesn't help reproducibility — anyone trying this recipe needs the floating latest of each anyway. Revert the version pins added in cbf992c and let upstream tracking ride. When the recipe is hardware-validated and the project starts publishing pinned-version-validated recipes we can revisit. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com> --- google-gemma-4-E2B-it/QNN/README.md | 4 ++-- google-gemma-4-E2B-it/QNN/requirements.txt | 7 +++---- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md index dc9d089f5..1a0081f4c 100644 --- a/google-gemma-4-E2B-it/QNN/README.md +++ b/google-gemma-4-E2B-it/QNN/README.md @@ -48,8 +48,8 @@ pip install cupy-cuda12x # accelerates OnnxKQuantQuantization (19–51× speed ### AOT compilation environment (separate venv, x64 with QNN SDK) ```bash -pip install olive-ai==0.9.3 mobius-ai -pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn==1.22.2" --no-deps +pip install olive-ai mobius-ai +pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps ``` Replace `/path/to/qnn/env/bin` in `config.json` with the directory diff --git a/google-gemma-4-E2B-it/QNN/requirements.txt b/google-gemma-4-E2B-it/QNN/requirements.txt index ce774f410..6a89675c7 100644 --- a/google-gemma-4-E2B-it/QNN/requirements.txt +++ b/google-gemma-4-E2B-it/QNN/requirements.txt @@ -1,6 +1,5 @@ datasets mobius-ai -olive-ai==0.9.3 -# these are the versions the recipes were last validated with -onnxruntime-gpu==1.21.1 -transformers>=5.0 +olive-ai +onnxruntime-gpu +transformers