microsoft · justinchuby · May 27, 2026 · May 27, 2026 · May 27, 2026 · May 27, 2026
diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md
@@ -0,0 +1,141 @@
+# Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe
+
+> **Status: WORK IN PROGRESS / EXPLORATORY.** Starting-point recipe for
+> running [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it)
+> on the Qualcomm Hexagon NPU via the QNN ONNX Runtime execution
+> provider (Snapdragon X / Copilot+ PC, Snapdragon 8 Gen 3+, etc.).
+> Has *not* yet been validated on hardware — see
+> [Limitations](#limitations).
+
+## Approach
+
+All four Gemma 4 components (decoder, embedding, vision_encoder,
+audio_encoder) are compiled into QNN EPContext binaries together.
+Olive's per-component dispatch on `CompositeModelHandler` runs each
+pass on every component, then `EPContextBinaryGenerator` and
+`ComposeOnnxModels` (both `_accepts_composite_model = True`) finalize
+the multimodal package.
+
+## Pipeline
+
+```
+HfModel (multimodal Gemma 4)
+   ↓ MobiusBuilder (fp32)               4 ONNX components + genai_config + tokenizer + processors
+   ↓ OnnxKQuantQuantization (INT4)      mobius-standard Q4_K_M; weights → com.microsoft::MatMulNBits
+   ↓ MatMulNBitsToQDQ                   MatMulNBits → MatMul + DequantizeLinear (QNN-compatible QDQ)
+   ↓ OnnxStaticQuantization             activations uint16 / weights uint8 (calibrated)
+   ↓ StaticLLM                          static shapes for QNN
+   ↓ EPContextBinaryGenerator           HTP EPContext blobs (per component, weight-shared)
+   ↓ ComposeOnnxModels                  final package
+```
+
+Why both `OnnxKQuantQuantization` and `MatMulNBitsToQDQ`?
+`OnnxKQuantQuantization` emits `com.microsoft::MatMulNBits`, which has
+fast CPU / CUDA kernels but is *not* in the QNN EP's supported-op list
+— without `MatMulNBitsToQDQ` the QNN partitioner rejects every
+quantized MatMul and the model silently falls back to CPU.
+`MatMulNBitsToQDQ` rewrites each `MatMulNBits` into a standard
+`MatMul + DequantizeLinear` pair so QNN can claim and compile the
+subgraph onto HTP.
+
+## Prerequisites
+
+### Quantization environment (x64, GPU recommended)
+```bash
+pip install -r requirements.txt
+pip install cupy-cuda12x   # accelerates OnnxKQuantQuantization (19–51× speedup)
+```
+
+### AOT compilation environment (separate venv, x64 with QNN SDK)
+```bash
+pip install olive-ai mobius-ai
+pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps
+```
+
+Replace `/path/to/qnn/env/bin` in `config.json` with the directory
+containing your QNN venv's Python executable.
+
+### Inference (Snapdragon device)
+```bash
+pip install onnxruntime-qnn onnxruntime-genai
+```
+
+## Build
+
+Run `olive run` from the **quantization environment** (not the QNN AOT
+venv). Olive invokes the QNN AOT venv automatically via the
+`python_environment_path` configured under `systems.qnn_system` for the
+`EPContextBinaryGenerator` pass:
+
+```bash
+olive run --config config.json
+```
+
+## Why no `GraphSurgeries` here?
+
+Most existing QNN recipes (Phi-3, Qwen) chain surgeries like
+`RemoveRopeMultiCache`, `AttentionMaskToSequenceLengths`,
+`SimplifiedLayerNormToL2Norm`. Those rewrite ModelBuilder-specific
+sub-graphs into shapes HTP can lower:
+
+| Surgery | What it does | Why ModelBuilder needs it |
+|---|---|---|
+| `SimplifiedLayerNormToL2Norm` | `com.microsoft.SimplifiedLayerNorm` → `LpNormalization * gamma` | HTP has no SimplifiedLayerNorm kernel |
+| `RemoveRopeMultiCache` | Drop one of ModelBuilder's two RoPE caches | HTP can't dispatch on cache selector |
+| `AttentionMaskToSequenceLengths` | `GQA(attention_mask=[B,T])` → `GQA(past_seq_len, total_seq_len)` | HTP's GQA kernel wants scalar seq lens |
+
+`MobiusBuilder` emits opset-23 standard ops (`RMSNormalization`,
+`Attention`) instead of the contrib variants, so these surgeries are
+either no-ops or inapplicable. Gemma-4–specific surgeries may still be
+needed (e.g. lowering the final logit soft-cap `cap * tanh(x / cap)`),
+but the existing borrowed-from-Phi-3 set is not it.
+
+## Limitations
+
+This recipe has **not yet been validated end-to-end**. Known gaps:
+
+1. **Logit soft-cap may not lower to HTP.** Gemma 4's
+   `logits = cap * tanh(logits / cap)` is unusual for QNN. If HTP
+   rejects it, options are (a) skip soft-cap during QNN compile and
+   apply it in host post-processing, or (b) add a
+   `RemoveLogitSoftcap` GraphSurgery upstream in Olive.
+
+2. **Hybrid local/global attention with dual head_dim.** Gemma 4 E2B
+   alternates local sliding-window (head_dim=256) and global
+   (head_dim=512) attention layers. Whether HTP can dispatch this
+   correctly per-layer needs testing.
+
+3. **`per_layer_inputs` data flow.** The embedding component emits a
+   second output (`per_layer_inputs`, shape `[B, S, L*D]`) consumed by
+   every decoder block. When all components compile to QNN this should
+   "just work" (the data path stays inside the package), but the
+   `StaticLLM` pass may need a hint to recognise this extra tensor.
+
+4. **256k tokenizer calibration.** `wikitext-2` calibration likely
+   under-represents Gemma 4's image / audio special tokens. Consider
+   augmenting with multimodal-formatted prompts before production.
+
+5. **`StaticLLM context_length=64`.** Placeholder mirroring existing QNN
+   recipes; tune to target Snapdragon SKU memory budget.
+
+6. **Standard `Attention` op, not `GroupQueryAttention`.** mobius only
+   emits `com.microsoft::GroupQueryAttention(seqlens_k, total_seq_len)`
+   when the EP capability advertises `gqa_dtypes`. The QNN EP
+   capability in mobius currently has an empty `gqa_dtypes` list, so
+   `Gemma4TextModel.forward` (`src/mobius/models/gemma4.py:1500-1508`)
+   falls back to the standard opset-23 `Attention` with an
+   `attention_mask` input. QNN's HTP backend should have an attention
+   kernel for the standard op, but if it doesn't lower well there are
+   two options:
+   - extend mobius `ep_capabilities()` to advertise QNN-supported
+     dtypes for `gqa_dtypes`, then mobius will emit `GQA` directly
+     (no GraphSurgery needed); or
+   - port `AttentionMaskToSequenceLengths` to operate on standard
+     `Attention` (it currently checks for `GroupQueryAttention` only
+     and no-ops otherwise).
+
+## Discussion
+
+If you have a Snapdragon test rig and the pipeline blows up on a
+specific pass, please drop the trace in a comment — this recipe is
+intentionally a template for iteration, not a finished product.
diff --git a/google-gemma-4-E2B-it/QNN/config.json b/google-gemma-4-E2B-it/QNN/config.json
@@ -0,0 +1,81 @@
+{
+    "input_model": { "type": "HfModel", "model_path": "google/gemma-4-E2B-it" },
+    "systems": {
+        "qnn_system": {
+            "type": "PythonEnvironment",
+            "python_environment_path": "/path/to/qnn/env/bin",
+            "accelerators": [
+                { "device": "npu", "execution_providers": ["QNNExecutionProvider"] }
+            ]
+        }
+    },
+    "data_configs": [
+        {
+            "name": "wikitext2_train_act",
+            "type": "HuggingfaceContainer",
+            "load_dataset_config": {
+                "data_name": "wikitext",
+                "subset": "wikitext-2-raw-v1",
+                "split": "train"
+            },
+            "pre_process_data_config": {
+                "strategy": "line-by-line",
+                "add_special_tokens": true,
+                "max_samples": 256,
+                "max_seq_len": 1024
+            }
+        }
+    ],
+    "passes": {
+        "mobius_build": {
+            "type": "MobiusBuilder",
+            "precision": "fp32"
+        },
+        "int4_quantize": {
+            "type": "OnnxKQuantQuantization",
+            "bits": 4,
+            "block_size": 32,
+            "save_as_external_data": true
+        },
+        "mnb_to_qdq": {
+            "type": "MatMulNBitsToQDQ",
+            "use_int4": true,
+            "add_zero_point": true,
+            "save_as_external_data": true
+        },
+        "static_quant": {
+            "type": "OnnxStaticQuantization",
+            "data_config": "wikitext2_train_act",
+            "activation_type": "uint16",
+            "precision": "uint8",
+            "calibration_providers": ["CUDAExecutionProvider"],
+            "quant_preprocess": true,
+            "op_types_to_exclude": [
+                "GatherBlockQuantized",
+                "GroupQueryAttention",
+                "MatMulNBits"
+            ],
+            "save_as_external_data": true
+        },
+        "static_llm": {
+            "type": "StaticLLM",
+            "batch_size": 1,
+            "context_length": 64
+        },
+        "ep_context": {
+            "type": "EPContextBinaryGenerator",
+            "provider_options": {
+                "htp_performance_mode": "burst",
+                "htp_graph_finalization_optimization_mode": "3",
+                "soc_model": "60"
+            },
+            "weight_sharing": true
+        },
+        "compose": { "type": "ComposeOnnxModels" }
+    },
+    "target": "qnn_system",
+    "output_dir": "model/gemma4_e2b_qnn",
+    "cache_dir": "cache",
+    "no_artifacts": true,
+    "evaluate_input_model": false
+}
diff --git a/google-gemma-4-E2B-it/QNN/info.yml b/google-gemma-4-E2B-it/QNN/info.yml
@@ -0,0 +1,13 @@
+keywords:
+    - foundry-local
+    - qnn
+    - gemma4
+    - mobius
+arch: gemma
+recipes:
+  - name: gemma4-e2b-qnn
+    file: config.json
+    devices:
+      - npu
+    eps: QNNExecutionProvider
+name: gemma4-e2b-qnn
diff --git a/google-gemma-4-E2B-it/QNN/requirements.txt b/google-gemma-4-E2B-it/QNN/requirements.txt
@@ -0,0 +1,5 @@
+datasets
+mobius-ai
+olive-ai
+onnxruntime-gpu
+transformers