From 01b7061c44e706a13f67f31a48613de70f9694bb Mon Sep 17 00:00:00 2001
From: justinchuby <11205048+justinchuby@users.noreply.github.com>
Date: Wed, 27 May 2026 05:47:31 +0000
Subject: [PATCH 1/6] =?UTF-8?q?Add=20Gemma=204=20E2B=20QNN=20(Snapdragon?=
 =?UTF-8?q?=20Hexagon=20NPU)=20recipe=20=E2=80=94=20WIP?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds google-gemma-4-E2B-it/QNN/ as a starting-point recipe for
compiling Gemma 4's text decoder into a QNN EPContext binary for
HTP execution on Snapdragon X / Copilot+ PC / Snapdragon 8 Gen 3+.

Pipeline:
  MobiusBuilder fp32 → OnnxKQuantQuantization (INT4 weights)
                     → MatMulNBitsToQDQ
                     → GraphSurgeries (RemoveRopeMultiCache /
                       AttentionMaskToSequenceLengths /
                       SimplifiedLayerNormToL2Norm)
                     → OnnxStaticQuantization (uint16 act / uint8 wt)
                     → SplitModel + StaticLLM
                     → EPContextBinaryGenerator (HTP blob)
                     → ComposeOnnxModels

Marked WORK IN PROGRESS in the README. Known limitations called out
explicitly:

  1) MobiusBuilder always exports the multimodal 4-component package
     for google/gemma-4-E2B-it; no current way to force the text-only
     gemma4_text path from the recipe config. Splitting the QNN passes
     to apply only to the decoder component is still TODO.
  2) GraphSurgeries borrowed from Phi-3 / Qwen QNN recipes have not
     been verified against Gemma 4's hybrid local/global attention,
     dual head_dim KV cache, or final logit soft-capping (tanh-cap).
  3) per_layer_inputs (second embedding output, consumed by every
     decoder block) needs custom split orchestration if embedding stays
     on CPU and decoder runs on HTP.
  4) Calibration via wikitext-2 may under-represent multimodal-format
     tokens (256k vocab includes vision/audio specials).
  5) StaticLLM context_length=64 is a placeholder for HW tuning.

Filed as exploratory template so other contributors with Snapdragon
HW can iterate.

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
---
 google-gemma-4-E2B-it/QNN/README.md        | 112 +++++++++++++++++++++
 google-gemma-4-E2B-it/QNN/config.json      |  91 +++++++++++++++++
 google-gemma-4-E2B-it/QNN/info.yml         |  13 +++
 google-gemma-4-E2B-it/QNN/requirements.txt |   5 +
 4 files changed, 221 insertions(+)
 create mode 100644 google-gemma-4-E2B-it/QNN/README.md
 create mode 100644 google-gemma-4-E2B-it/QNN/config.json
 create mode 100644 google-gemma-4-E2B-it/QNN/info.yml
 create mode 100644 google-gemma-4-E2B-it/QNN/requirements.txt

diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md
new file mode 100644
index 000000000..8a8e98a64
--- /dev/null
+++ b/google-gemma-4-E2B-it/QNN/README.md
@@ -0,0 +1,112 @@
+# Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe
+
+> **Status: WORK IN PROGRESS / EXPLORATORY.** This recipe is a starting
+> point for running [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it)
+> on the Qualcomm Hexagon NPU via the QNN ONNX Runtime execution
+> provider (Snapdragon X / Copilot+ PC, Snapdragon 8 Gen 3+, etc.).
+> It has *not* yet been end-to-end validated on hardware — see the
+> [Limitations](#limitations) section below before using it in
+> production.
+
+This recipe targets the **text decoder only**. Gemma 4's vision and
+audio encoders run on CPU (`google-gemma-4-E2B-it/cpu/`) or GPU
+(`google-gemma-4-E2B-it/cuda/`); only the LM decoder is compiled into
+an EPContext binary for HTP execution.
+
+## Pipeline overview
+
+```
+HfModel (multimodal Gemma4)
+   ↓ MobiusBuilder fp32       → ORT GenAI multi-component package
+   ↓ OnnxKQuantQuantization   → INT4 weights (decoder only)
+   ↓ MatMulNBitsToQDQ         → QDQ format for static quantization
+   ↓ GraphSurgeries           → QNN-friendly graph (Rope unmerge, mask, L2Norm)
+   ↓ OnnxStaticQuantization   → activations uint16 / weights uint8 (calibrated)
+   ↓ SplitModel + StaticLLM   → static-shape sub-graphs for QNN
+   ↓ EPContextBinaryGenerator → compiled HTP EPContext blob
+```
+
+## Prerequisites
+
+### Quantization environment (x64 with CUDA GPU)
+Quantization (especially `OnnxStaticQuantization`) is resource-intensive
+and accelerated by GPU:
+
+```bash
+pip install -r requirements.txt
+```
+
+### AOT compilation environment (separate venv, x64 with QNN SDK)
+Compilation into the EPContext binary requires
+`onnxruntime-qnn`:
+
+```bash
+pip install olive-ai mobius-ai
+pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn" --no-deps
+```
+
+Set `/path/to/qnn/env/bin` in `config.json` to the directory containing
+your QNN venv's Python executable.
+
+### Inference environment (Snapdragon device)
+On Copilot+ PC / Snapdragon X / Android (Snapdragon 8 Gen 3+):
+
+```bash
+pip install onnxruntime-qnn onnxruntime-genai
+```
+
+## Build
+
+```bash
+olive run --config config.json
+```
+
+Output is a self-contained EPContext binary package suitable for QNN HTP
+execution.
+
+## Limitations
+
+This recipe has **not yet been validated end-to-end**. Known gaps:
+
+1. **Multimodal `google/gemma-4-E2B-it` always produces a 4-component
+   package** (decoder + embedding + vision + audio). MobiusBuilder
+   currently does not expose a `model_type` / `module_class` override
+   to force the text-only `gemma4_text` build, so the pipeline must
+   either: (a) run the QNN passes only on the `decoder` component and
+   leave the others as fp32, or (b) wait for the upstream MobiusBuilder
+   to gain a `module_class` parameter. This recipe assumes (a) but the
+   QNN pass chain currently expects a single ONNX model — the
+   integration is still TODO.
+
+2. **Gemma 4's exotic ops may break QNN GraphSurgeries.** The recipe
+   borrows surgeries (`RemoveRopeMultiCache`, `AttentionMaskToSequenceLengths`,
+   `SimplifiedLayerNormToL2Norm`) from Phi-3 / Qwen QNN recipes; they
+   have not been verified against Gemma 4's hybrid local/global
+   attention, `tie_word_embeddings`, dual head_dim KV cache, or final
+   logit soft-capping (`logits = cap * tanh(logits / cap)`). The
+   soft-cap subgraph in particular may not lower cleanly to HTP — may
+   need to be folded into the logit lookup or skipped during QNN
+   compilation.
+
+3. **Per-layer-input embeddings.** Gemma 4 E2B emits a second embedding
+   output (`per_layer_inputs`, shape `[B, S, L*D]`) consumed by every
+   decoder block. The split between embedding-on-CPU and decoder-on-HTP
+   needs a custom orchestrator (or both sub-models compiled into QNN
+   together).
+
+4. **Calibration data shape.** `OnnxStaticQuantization` calibration uses
+   `wikitext-2`; for Gemma 4 (which has a 256k tokenizer including
+   image / audio special tokens) the calibration set may under-represent
+   tokens that actually appear at inference time. Consider augmenting
+   with multimodal-formatted prompts.
+
+5. **`StaticLLM context_length=64`.** Mirrors Phi-3 / Qwen QNN recipes
+   but is unlikely to be useful for a real Gemma 4 deployment; tune to
+   target Snapdragon SKU memory budget once HW validation is possible.
+
+## Discussion
+
+If you have a Snapdragon test rig and run into specific failures,
+please add a comment with the trace; this recipe is intended as a
+template that other contributors can iterate on rather than a
+production-ready pipeline.
diff --git a/google-gemma-4-E2B-it/QNN/config.json b/google-gemma-4-E2B-it/QNN/config.json
new file mode 100644
index 000000000..e5320eb76
--- /dev/null
+++ b/google-gemma-4-E2B-it/QNN/config.json
@@ -0,0 +1,91 @@
+{
+    "input_model": { "type": "HfModel", "model_path": "google/gemma-4-E2B-it" },
+    "systems": {
+        "qnn_system": {
+            "type": "PythonEnvironment",
+            "python_environment_path": "/path/to/qnn/env/bin",
+            "accelerators": [
+                { "device": "npu", "execution_providers": ["QNNExecutionProvider"] }
+            ]
+        }
+    },
+    "data_configs": [
+        {
+            "name": "wikitext2_train_act",
+            "type": "HuggingfaceContainer",
+            "load_dataset_config": {
+                "data_name": "wikitext",
+                "subset": "wikitext-2-raw-v1",
+                "split": "train"
+            },
+            "pre_process_data_config": {
+                "strategy": "line-by-line",
+                "add_special_tokens": true,
+                "max_samples": 256,
+                "max_seq_len": 1024
+            }
+        }
+    ],
+    "passes": {
+        "mobius_build": {
+            "type": "MobiusBuilder",
+            "precision": "fp32"
+        },
+        "matmul_nbits": {
+            "type": "OnnxKQuantQuantization",
+            "bits": 4,
+            "block_size": 32,
+            "save_as_external_data": true
+        },
+        "mq": {
+            "type": "MatMulNBitsToQDQ",
+            "use_int4": true,
+            "add_zero_point": true,
+            "save_as_external_data": true
+        },
+        "gs": {
+            "type": "GraphSurgeries",
+            "surgeries": [
+                { "surgeon": "RemoveRopeMultiCache" },
+                { "surgeon": "AttentionMaskToSequenceLengths" },
+                { "surgeon": "SimplifiedLayerNormToL2Norm" }
+            ],
+            "save_as_external_data": true
+        },
+        "sq": {
+            "type": "OnnxStaticQuantization",
+            "data_config": "wikitext2_train_act",
+            "activation_type": "uint16",
+            "precision": "uint8",
+            "calibration_providers": ["CUDAExecutionProvider"],
+            "quant_preprocess": true,
+            "op_types_to_exclude": [
+                "GatherBlockQuantized",
+                "GroupQueryAttention",
+                "MatMulNBits"
+            ],
+            "save_as_external_data": true
+        },
+        "sp": { "type": "SplitModel" },
+        "st": {
+            "type": "StaticLLM",
+            "batch_size": 1,
+            "context_length": 64
+        },
+        "cb": {
+            "type": "EPContextBinaryGenerator",
+            "provider_options": {
+                "htp_performance_mode": "burst",
+                "htp_graph_finalization_optimization_mode": "3",
+                "soc_model": "60"
+            },
+            "weight_sharing": true
+        },
+        "cp": { "type": "ComposeOnnxModels" }
+    },
+    "target": "qnn_system",
+    "output_dir": "model/gemma4_e2b_qnn",
+    "cache_dir": "cache",
+    "no_artifacts": true,
+    "evaluate_input_model": false
+}
diff --git a/google-gemma-4-E2B-it/QNN/info.yml b/google-gemma-4-E2B-it/QNN/info.yml
new file mode 100644
index 000000000..ff96347b7
--- /dev/null
+++ b/google-gemma-4-E2B-it/QNN/info.yml
@@ -0,0 +1,13 @@
+keywords:
+    - foundry-local
+    - qnn
+    - gemma4
+    - mobius
+arch: gemma
+recipes:
+  - name: gemma4-e2b-qnn
+    file: config.json
+    devices:
+      - npu
+    eps: QNNExecutionProvider
+name: gemma4_e2b_qnn
diff --git a/google-gemma-4-E2B-it/QNN/requirements.txt b/google-gemma-4-E2B-it/QNN/requirements.txt
new file mode 100644
index 000000000..7e6f5c86f
--- /dev/null
+++ b/google-gemma-4-E2B-it/QNN/requirements.txt
@@ -0,0 +1,5 @@
+datasets
+mobius-ai
+olive-ai
+onnxruntime-gpu
+transformers>=5.0

From 4b8b6d0b03a583012052c8d01e1fd3caff091bae Mon Sep 17 00:00:00 2001
From: justinchuby <11205048+justinchuby@users.noreply.github.com>
Date: Wed, 27 May 2026 05:53:54 +0000
Subject: [PATCH 2/6] Gemma 4 E2B QNN recipe: compile all 4 components, drop
 ModelBuilder surgeries
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

All four Gemma 4 components (decoder + embedding + vision_encoder +
audio_encoder) compile to QNN EPContext binaries together. Olive's
CompositeModelHandler dispatch runs quant + StaticLLM per component
automatically, then EPContextBinaryGenerator + ComposeOnnxModels
(both _accepts_composite_model = True) finalise the multimodal
package.

Drop:

  * SplitModel — not needed when all components stay on QNN
  * MatMulNBitsToQDQ — was a ModelBuilder-specific stepping stone
  * GraphSurgeries with RemoveRopeMultiCache /
    AttentionMaskToSequenceLengths / SimplifiedLayerNormToL2Norm —
    those rewrite ModelBuilder contrib ops that mobius does not emit
    in the first place (mobius uses opset-23 RMSNormalization /
    Attention, not com.microsoft variants)

The README now explains the surgery removal explicitly and lists what
might still need a Gemma-4–specific upstream surgery (logit soft-cap).

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
---
 google-gemma-4-E2B-it/QNN/README.md   | 140 +++++++++++++-------------
 google-gemma-4-E2B-it/QNN/config.json |  26 +----
 2 files changed, 74 insertions(+), 92 deletions(-)

diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md
index 8a8e98a64..fa12a835f 100644
--- a/google-gemma-4-E2B-it/QNN/README.md
+++ b/google-gemma-4-E2B-it/QNN/README.md
@@ -1,56 +1,51 @@
 # Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe
 
-> **Status: WORK IN PROGRESS / EXPLORATORY.** This recipe is a starting
-> point for running [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it)
+> **Status: WORK IN PROGRESS / EXPLORATORY.** Starting-point recipe for
+> running [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it)
 > on the Qualcomm Hexagon NPU via the QNN ONNX Runtime execution
 > provider (Snapdragon X / Copilot+ PC, Snapdragon 8 Gen 3+, etc.).
-> It has *not* yet been end-to-end validated on hardware — see the
-> [Limitations](#limitations) section below before using it in
-> production.
+> Has *not* yet been validated on hardware — see
+> [Limitations](#limitations).
 
-This recipe targets the **text decoder only**. Gemma 4's vision and
-audio encoders run on CPU (`google-gemma-4-E2B-it/cpu/`) or GPU
-(`google-gemma-4-E2B-it/cuda/`); only the LM decoder is compiled into
-an EPContext binary for HTP execution.
+## Approach
 
-## Pipeline overview
+All four Gemma 4 components (decoder, embedding, vision_encoder,
+audio_encoder) are compiled into QNN EPContext binaries together.
+Olive's per-component dispatch on `CompositeModelHandler` runs each
+pass on every component, then `EPContextBinaryGenerator` and
+`ComposeOnnxModels` (both `_accepts_composite_model = True`) finalize
+the multimodal package.
+
+## Pipeline
 
 ```
-HfModel (multimodal Gemma4)
-   ↓ MobiusBuilder fp32       → ORT GenAI multi-component package
-   ↓ OnnxKQuantQuantization   → INT4 weights (decoder only)
-   ↓ MatMulNBitsToQDQ         → QDQ format for static quantization
-   ↓ GraphSurgeries           → QNN-friendly graph (Rope unmerge, mask, L2Norm)
-   ↓ OnnxStaticQuantization   → activations uint16 / weights uint8 (calibrated)
-   ↓ SplitModel + StaticLLM   → static-shape sub-graphs for QNN
-   ↓ EPContextBinaryGenerator → compiled HTP EPContext blob
+HfModel (multimodal Gemma 4)
+   ↓ MobiusBuilder (fp32)               4 ONNX components + genai_config + tokenizer + processors
+   ↓ OnnxKQuantQuantization (INT4)      mobius-standard Q4_K_M quant (per component)
+   ↓ OnnxStaticQuantization             activations uint16 / weights uint8 (calibrated)
+   ↓ StaticLLM                          static shapes for QNN
+   ↓ EPContextBinaryGenerator           HTP EPContext blobs (per component, weight-shared)
+   ↓ ComposeOnnxModels                  final package
 ```
 
 ## Prerequisites
 
-### Quantization environment (x64 with CUDA GPU)
-Quantization (especially `OnnxStaticQuantization`) is resource-intensive
-and accelerated by GPU:
-
+### Quantization environment (x64, GPU recommended)
 ```bash
 pip install -r requirements.txt
+pip install cupy-cuda12x   # accelerates OnnxKQuantQuantization (19–51× speedup)
 ```
 
 ### AOT compilation environment (separate venv, x64 with QNN SDK)
-Compilation into the EPContext binary requires
-`onnxruntime-qnn`:
-
 ```bash
 pip install olive-ai mobius-ai
 pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn" --no-deps
 ```
 
-Set `/path/to/qnn/env/bin` in `config.json` to the directory containing
-your QNN venv's Python executable.
-
-### Inference environment (Snapdragon device)
-On Copilot+ PC / Snapdragon X / Android (Snapdragon 8 Gen 3+):
+Replace `/path/to/qnn/env/bin` in `config.json` with the directory
+containing your QNN venv's Python executable.
 
+### Inference (Snapdragon device)
 ```bash
 pip install onnxruntime-qnn onnxruntime-genai
 ```
@@ -61,52 +56,55 @@ pip install onnxruntime-qnn onnxruntime-genai
 olive run --config config.json
 ```
 
-Output is a self-contained EPContext binary package suitable for QNN HTP
-execution.
+## Why no `GraphSurgeries` here?
+
+Most existing QNN recipes (Phi-3, Qwen) chain surgeries like
+`RemoveRopeMultiCache`, `AttentionMaskToSequenceLengths`,
+`SimplifiedLayerNormToL2Norm`. Those rewrite ModelBuilder-specific
+sub-graphs into shapes HTP can lower:
+
+| Surgery | What it does | Why ModelBuilder needs it |
+|---|---|---|
+| `SimplifiedLayerNormToL2Norm` | `com.microsoft.SimplifiedLayerNorm` → `LpNormalization * gamma` | HTP has no SimplifiedLayerNorm kernel |
+| `RemoveRopeMultiCache` | Drop one of ModelBuilder's two RoPE caches | HTP can't dispatch on cache selector |
+| `AttentionMaskToSequenceLengths` | `GQA(attention_mask=[B,T])` → `GQA(past_seq_len, total_seq_len)` | HTP's GQA kernel wants scalar seq lens |
+
+`MobiusBuilder` emits opset-23 standard ops (`RMSNormalization`,
+`Attention`) instead of the contrib variants, so these surgeries are
+either no-ops or inapplicable. Gemma-4–specific surgeries may still be
+needed (e.g. lowering the final logit soft-cap `cap * tanh(x / cap)`),
+but the existing borrowed-from-Phi-3 set is not it.
 
 ## Limitations
 
 This recipe has **not yet been validated end-to-end**. Known gaps:
 
-1. **Multimodal `google/gemma-4-E2B-it` always produces a 4-component
-   package** (decoder + embedding + vision + audio). MobiusBuilder
-   currently does not expose a `model_type` / `module_class` override
-   to force the text-only `gemma4_text` build, so the pipeline must
-   either: (a) run the QNN passes only on the `decoder` component and
-   leave the others as fp32, or (b) wait for the upstream MobiusBuilder
-   to gain a `module_class` parameter. This recipe assumes (a) but the
-   QNN pass chain currently expects a single ONNX model — the
-   integration is still TODO.
-
-2. **Gemma 4's exotic ops may break QNN GraphSurgeries.** The recipe
-   borrows surgeries (`RemoveRopeMultiCache`, `AttentionMaskToSequenceLengths`,
-   `SimplifiedLayerNormToL2Norm`) from Phi-3 / Qwen QNN recipes; they
-   have not been verified against Gemma 4's hybrid local/global
-   attention, `tie_word_embeddings`, dual head_dim KV cache, or final
-   logit soft-capping (`logits = cap * tanh(logits / cap)`). The
-   soft-cap subgraph in particular may not lower cleanly to HTP — may
-   need to be folded into the logit lookup or skipped during QNN
-   compilation.
-
-3. **Per-layer-input embeddings.** Gemma 4 E2B emits a second embedding
-   output (`per_layer_inputs`, shape `[B, S, L*D]`) consumed by every
-   decoder block. The split between embedding-on-CPU and decoder-on-HTP
-   needs a custom orchestrator (or both sub-models compiled into QNN
-   together).
-
-4. **Calibration data shape.** `OnnxStaticQuantization` calibration uses
-   `wikitext-2`; for Gemma 4 (which has a 256k tokenizer including
-   image / audio special tokens) the calibration set may under-represent
-   tokens that actually appear at inference time. Consider augmenting
-   with multimodal-formatted prompts.
-
-5. **`StaticLLM context_length=64`.** Mirrors Phi-3 / Qwen QNN recipes
-   but is unlikely to be useful for a real Gemma 4 deployment; tune to
-   target Snapdragon SKU memory budget once HW validation is possible.
+1. **Logit soft-cap may not lower to HTP.** Gemma 4's
+   `logits = cap * tanh(logits / cap)` is unusual for QNN. If HTP
+   rejects it, options are (a) skip soft-cap during QNN compile and
+   apply it in host post-processing, or (b) add a
+   `RemoveLogitSoftcap` GraphSurgery upstream in Olive.
+
+2. **Hybrid local/global attention with dual head_dim.** Gemma 4 E2B
+   alternates local sliding-window (head_dim=256) and global
+   (head_dim=512) attention layers. Whether HTP can dispatch this
+   correctly per-layer needs testing.
+
+3. **`per_layer_inputs` data flow.** The embedding component emits a
+   second output (`per_layer_inputs`, shape `[B, S, L*D]`) consumed by
+   every decoder block. When all components compile to QNN this should
+   "just work" (the data path stays inside the package), but the
+   `StaticLLM` pass may need a hint to recognise this extra tensor.
+
+4. **256k tokenizer calibration.** `wikitext-2` calibration likely
+   under-represents Gemma 4's image / audio special tokens. Consider
+   augmenting with multimodal-formatted prompts before production.
+
+5. **`StaticLLM context_length=64`.** Placeholder mirroring existing QNN
+   recipes; tune to target Snapdragon SKU memory budget.
 
 ## Discussion
 
-If you have a Snapdragon test rig and run into specific failures,
-please add a comment with the trace; this recipe is intended as a
-template that other contributors can iterate on rather than a
-production-ready pipeline.
+If you have a Snapdragon test rig and the pipeline blows up on a
+specific pass, please drop the trace in a comment — this recipe is
+intentionally a template for iteration, not a finished product.
diff --git a/google-gemma-4-E2B-it/QNN/config.json b/google-gemma-4-E2B-it/QNN/config.json
index e5320eb76..dd9bc0458 100644
--- a/google-gemma-4-E2B-it/QNN/config.json
+++ b/google-gemma-4-E2B-it/QNN/config.json
@@ -31,28 +31,13 @@
             "type": "MobiusBuilder",
             "precision": "fp32"
         },
-        "matmul_nbits": {
+        "int4_quantize": {
             "type": "OnnxKQuantQuantization",
             "bits": 4,
             "block_size": 32,
             "save_as_external_data": true
         },
-        "mq": {
-            "type": "MatMulNBitsToQDQ",
-            "use_int4": true,
-            "add_zero_point": true,
-            "save_as_external_data": true
-        },
-        "gs": {
-            "type": "GraphSurgeries",
-            "surgeries": [
-                { "surgeon": "RemoveRopeMultiCache" },
-                { "surgeon": "AttentionMaskToSequenceLengths" },
-                { "surgeon": "SimplifiedLayerNormToL2Norm" }
-            ],
-            "save_as_external_data": true
-        },
-        "sq": {
+        "static_quant": {
             "type": "OnnxStaticQuantization",
             "data_config": "wikitext2_train_act",
             "activation_type": "uint16",
@@ -66,13 +51,12 @@
             ],
             "save_as_external_data": true
         },
-        "sp": { "type": "SplitModel" },
-        "st": {
+        "static_llm": {
             "type": "StaticLLM",
             "batch_size": 1,
             "context_length": 64
         },
-        "cb": {
+        "ep_context": {
             "type": "EPContextBinaryGenerator",
             "provider_options": {
                 "htp_performance_mode": "burst",
@@ -81,7 +65,7 @@
             },
             "weight_sharing": true
         },
-        "cp": { "type": "ComposeOnnxModels" }
+        "compose": { "type": "ComposeOnnxModels" }
     },
     "target": "qnn_system",
     "output_dir": "model/gemma4_e2b_qnn",

From 1e7c18658365512a3820c3c3d43fb82351629169 Mon Sep 17 00:00:00 2001
From: justinchuby <11205048+justinchuby@users.noreply.github.com>
Date: Wed, 27 May 2026 06:05:40 +0000
Subject: [PATCH 3/6] =?UTF-8?q?Gemma=204=20E2B=20QNN:=20restore=20MatMulNB?=
 =?UTF-8?q?itsToQDQ=20=E2=80=94=20QNN=20can't=20run=20MatMulNBits?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

OnnxKQuantQuantization emits com.microsoft::MatMulNBits which is fast
on CPU / CUDA but not in the QNN EP's supported-op list. Without
MatMulNBitsToQDQ the QNN partitioner rejects every quantized MatMul
node and the model silently falls back to CPU — defeating the point
of compiling to HTP.

Restore MatMulNBitsToQDQ between the INT4 quant and the static
activation quant so each MatMulNBits gets rewritten into the standard
MatMul + DequantizeLinear pair the QNN partitioner can claim and
lower onto HTP.

README updated with an explanation of why both passes are needed.

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
---
 google-gemma-4-E2B-it/QNN/README.md   | 12 +++++++++++-
 google-gemma-4-E2B-it/QNN/config.json |  6 ++++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md
index fa12a835f..85b6df874 100644
--- a/google-gemma-4-E2B-it/QNN/README.md
+++ b/google-gemma-4-E2B-it/QNN/README.md
@@ -21,13 +21,23 @@ the multimodal package.
 ```
 HfModel (multimodal Gemma 4)
    ↓ MobiusBuilder (fp32)               4 ONNX components + genai_config + tokenizer + processors
-   ↓ OnnxKQuantQuantization (INT4)      mobius-standard Q4_K_M quant (per component)
+   ↓ OnnxKQuantQuantization (INT4)      mobius-standard Q4_K_M; weights → com.microsoft::MatMulNBits
+   ↓ MatMulNBitsToQDQ                   MatMulNBits → MatMul + DequantizeLinear (QNN-compatible QDQ)
    ↓ OnnxStaticQuantization             activations uint16 / weights uint8 (calibrated)
    ↓ StaticLLM                          static shapes for QNN
    ↓ EPContextBinaryGenerator           HTP EPContext blobs (per component, weight-shared)
    ↓ ComposeOnnxModels                  final package
 ```
 
+Why both `OnnxKQuantQuantization` and `MatMulNBitsToQDQ`?
+`OnnxKQuantQuantization` emits `com.microsoft::MatMulNBits`, which has
+fast CPU / CUDA kernels but is *not* in the QNN EP's supported-op list
+— without `MatMulNBitsToQDQ` the QNN partitioner rejects every
+quantized MatMul and the model silently falls back to CPU.
+`MatMulNBitsToQDQ` rewrites each `MatMulNBits` into a standard
+`MatMul + DequantizeLinear` pair so QNN can claim and compile the
+subgraph onto HTP.
+
 ## Prerequisites
 
 ### Quantization environment (x64, GPU recommended)
diff --git a/google-gemma-4-E2B-it/QNN/config.json b/google-gemma-4-E2B-it/QNN/config.json
index dd9bc0458..52fde21e9 100644
--- a/google-gemma-4-E2B-it/QNN/config.json
+++ b/google-gemma-4-E2B-it/QNN/config.json
@@ -37,6 +37,12 @@
             "block_size": 32,
             "save_as_external_data": true
         },
+        "mnb_to_qdq": {
+            "type": "MatMulNBitsToQDQ",
+            "use_int4": true,
+            "add_zero_point": true,
+            "save_as_external_data": true
+        },
         "static_quant": {
             "type": "OnnxStaticQuantization",
             "data_config": "wikitext2_train_act",

From 7029e9b0f49dc58312e1477073638601d40967c0 Mon Sep 17 00:00:00 2001
From: justinchuby <11205048+justinchuby@users.noreply.github.com>
Date: Wed, 27 May 2026 06:07:52 +0000
Subject: [PATCH 4/6] Gemma 4 QNN README: document standard Attention vs
 GroupQueryAttention

Make explicit that mobius emits opset-23 Attention (with attention_mask
input) for QNN, not com.microsoft::GroupQueryAttention(seqlens_k,
total_seq_len), because mobius's QNN ep_capabilities() advertises an
empty gqa_dtypes list. The existing AttentionMaskToSequenceLengths
GraphSurgery is therefore inapplicable (it only rewrites GQA), and
no surgery is needed if HTP's standard-attention kernel lowers cleanly.

Two follow-up options spelled out if HW shows the standard Attention
path is too slow on HTP:
  (a) extend mobius ep_capabilities for QNN to set gqa_dtypes so the
      builder emits GQA directly; or
  (b) port AttentionMaskToSequenceLengths to also rewrite standard
      Attention (currently it short-circuits when GQA is absent).

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
---
 google-gemma-4-E2B-it/QNN/README.md | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md
index 85b6df874..45d138d5b 100644
--- a/google-gemma-4-E2B-it/QNN/README.md
+++ b/google-gemma-4-E2B-it/QNN/README.md
@@ -113,6 +113,22 @@ This recipe has **not yet been validated end-to-end**. Known gaps:
 5. **`StaticLLM context_length=64`.** Placeholder mirroring existing QNN
    recipes; tune to target Snapdragon SKU memory budget.
 
+6. **Standard `Attention` op, not `GroupQueryAttention`.** mobius only
+   emits `com.microsoft::GroupQueryAttention(seqlens_k, total_seq_len)`
+   when the EP capability advertises `gqa_dtypes`. The QNN EP
+   capability in mobius currently has an empty `gqa_dtypes` list, so
+   `Gemma4TextModel.forward` (`src/mobius/models/gemma4.py:1500-1508`)
+   falls back to the standard opset-23 `Attention` with an
+   `attention_mask` input. QNN's HTP backend should have an attention
+   kernel for the standard op, but if it doesn't lower well there are
+   two options:
+   - extend mobius `ep_capabilities()` to advertise QNN-supported
+     dtypes for `gqa_dtypes`, then mobius will emit `GQA` directly
+     (no GraphSurgery needed); or
+   - port `AttentionMaskToSequenceLengths` to operate on standard
+     `Attention` (it currently checks for `GroupQueryAttention` only
+     and no-ops otherwise).
+
 ## Discussion
 
 If you have a Snapdragon test rig and the pipeline blows up on a

From cbf992c23f16ab4be28dd306bc0d3321723da600 Mon Sep 17 00:00:00 2001
From: justinchuby <11205048+justinchuby@users.noreply.github.com>
Date: Wed, 27 May 2026 06:16:03 +0000
Subject: [PATCH 5/6] =?UTF-8?q?Gemma=204=20QNN:=20address=20Copilot=20revi?=
 =?UTF-8?q?ew=20=E2=80=94=20pin=20versions,=20align=20names,=20doc=20env?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* requirements.txt: pin olive-ai==0.9.3 and onnxruntime-gpu==1.21.1 to
  match the last-validated versions used by the other QNN recipes in
  this repo (e.g. microsoft-Phi-3-mini-4k-instruct/QNN/). Keep
  mobius-ai and transformers>=5.0 unpinned for now since this recipe
  is still WIP and the validated version set will only stabilize after
  HW validation.

* README: pin onnxruntime-qnn==1.22.2 in the AOT compilation env
  install command, matching microsoft-Phi-3-mini-4k-instruct/QNN/.

* README: state explicitly that 'olive run' runs from the
  quantization environment, with Olive invoking the QNN AOT venv via
  systems.qnn_system.python_environment_path for the EPContextBinary
  pass. Avoids the easy mistake of running 'olive run' from the QNN
  venv (which lacks GPU quantization deps).

* info.yml: align the top-level name (gemma4_e2b_qnn → gemma4-e2b-qnn)
  with the recipe name so scanner tables aren't ambiguous.

PR description updated to drop the stale 'v2 drops MatMulNBitsToQDQ'
claim — that pass was restored in 1e7c186 (QNN cannot run
MatMulNBits).

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
---
 google-gemma-4-E2B-it/QNN/README.md        | 9 +++++++--
 google-gemma-4-E2B-it/QNN/info.yml         | 2 +-
 google-gemma-4-E2B-it/QNN/requirements.txt | 5 +++--
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md
index 45d138d5b..dc9d089f5 100644
--- a/google-gemma-4-E2B-it/QNN/README.md
+++ b/google-gemma-4-E2B-it/QNN/README.md
@@ -48,8 +48,8 @@ pip install cupy-cuda12x   # accelerates OnnxKQuantQuantization (19–51× speed
 
 ### AOT compilation environment (separate venv, x64 with QNN SDK)
 ```bash
-pip install olive-ai mobius-ai
-pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn" --no-deps
+pip install olive-ai==0.9.3 mobius-ai
+pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn==1.22.2" --no-deps
 ```
 
 Replace `/path/to/qnn/env/bin` in `config.json` with the directory
@@ -62,6 +62,11 @@ pip install onnxruntime-qnn onnxruntime-genai
 
 ## Build
 
+Run `olive run` from the **quantization environment** (not the QNN AOT
+venv). Olive invokes the QNN AOT venv automatically via the
+`python_environment_path` configured under `systems.qnn_system` for the
+`EPContextBinaryGenerator` pass:
+
 ```bash
 olive run --config config.json
 ```
diff --git a/google-gemma-4-E2B-it/QNN/info.yml b/google-gemma-4-E2B-it/QNN/info.yml
index ff96347b7..eda1a22a3 100644
--- a/google-gemma-4-E2B-it/QNN/info.yml
+++ b/google-gemma-4-E2B-it/QNN/info.yml
@@ -10,4 +10,4 @@ recipes:
     devices:
       - npu
     eps: QNNExecutionProvider
-name: gemma4_e2b_qnn
+name: gemma4-e2b-qnn
diff --git a/google-gemma-4-E2B-it/QNN/requirements.txt b/google-gemma-4-E2B-it/QNN/requirements.txt
index 7e6f5c86f..ce774f410 100644
--- a/google-gemma-4-E2B-it/QNN/requirements.txt
+++ b/google-gemma-4-E2B-it/QNN/requirements.txt
@@ -1,5 +1,6 @@
 datasets
 mobius-ai
-olive-ai
-onnxruntime-gpu
+olive-ai==0.9.3
+# these are the versions the recipes were last validated with
+onnxruntime-gpu==1.21.1
 transformers>=5.0

From 811f3a02a65a43d21757eb0e2bc7114614f716e1 Mon Sep 17 00:00:00 2001
From: justinchuby <11205048+justinchuby@users.noreply.github.com>
Date: Wed, 27 May 2026 14:20:56 +0000
Subject: [PATCH 6/6] Gemma 4 QNN: unpin requirements / install commands
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Mobius isn't published yet, so freezing olive-ai / onnxruntime-gpu /
onnxruntime-qnn / transformers at specific versions doesn't help
reproducibility — anyone trying this recipe needs the floating latest
of each anyway. Revert the version pins added in cbf992c and let
upstream tracking ride. When the recipe is hardware-validated and the
project starts publishing pinned-version-validated recipes we can
revisit.

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
---
 google-gemma-4-E2B-it/QNN/README.md        | 4 ++--
 google-gemma-4-E2B-it/QNN/requirements.txt | 7 +++----
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md
index dc9d089f5..1a0081f4c 100644
--- a/google-gemma-4-E2B-it/QNN/README.md
+++ b/google-gemma-4-E2B-it/QNN/README.md
@@ -48,8 +48,8 @@ pip install cupy-cuda12x   # accelerates OnnxKQuantQuantization (19–51× speed
 
 ### AOT compilation environment (separate venv, x64 with QNN SDK)
 ```bash
-pip install olive-ai==0.9.3 mobius-ai
-pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn==1.22.2" --no-deps
+pip install olive-ai mobius-ai
+pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps
 ```
 
 Replace `/path/to/qnn/env/bin` in `config.json` with the directory
diff --git a/google-gemma-4-E2B-it/QNN/requirements.txt b/google-gemma-4-E2B-it/QNN/requirements.txt
index ce774f410..6a89675c7 100644
--- a/google-gemma-4-E2B-it/QNN/requirements.txt
+++ b/google-gemma-4-E2B-it/QNN/requirements.txt
@@ -1,6 +1,5 @@
 datasets
 mobius-ai
-olive-ai==0.9.3
-# these are the versions the recipes were last validated with
-onnxruntime-gpu==1.21.1
-transformers>=5.0
+olive-ai
+onnxruntime-gpu
+transformers