diff --git a/google-gemma-4-E2B-it/QNN/README.md b/google-gemma-4-E2B-it/QNN/README.md new file mode 100644 index 000000000..1a0081f4c --- /dev/null +++ b/google-gemma-4-E2B-it/QNN/README.md @@ -0,0 +1,141 @@ +# Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe + +> **Status: WORK IN PROGRESS / EXPLORATORY.** Starting-point recipe for +> running [`google/gemma-4-E2B-it`](https://huggingface.co/google/gemma-4-E2B-it) +> on the Qualcomm Hexagon NPU via the QNN ONNX Runtime execution +> provider (Snapdragon X / Copilot+ PC, Snapdragon 8 Gen 3+, etc.). +> Has *not* yet been validated on hardware — see +> [Limitations](#limitations). + +## Approach + +All four Gemma 4 components (decoder, embedding, vision_encoder, +audio_encoder) are compiled into QNN EPContext binaries together. +Olive's per-component dispatch on `CompositeModelHandler` runs each +pass on every component, then `EPContextBinaryGenerator` and +`ComposeOnnxModels` (both `_accepts_composite_model = True`) finalize +the multimodal package. + +## Pipeline + +``` +HfModel (multimodal Gemma 4) + ↓ MobiusBuilder (fp32) 4 ONNX components + genai_config + tokenizer + processors + ↓ OnnxKQuantQuantization (INT4) mobius-standard Q4_K_M; weights → com.microsoft::MatMulNBits + ↓ MatMulNBitsToQDQ MatMulNBits → MatMul + DequantizeLinear (QNN-compatible QDQ) + ↓ OnnxStaticQuantization activations uint16 / weights uint8 (calibrated) + ↓ StaticLLM static shapes for QNN + ↓ EPContextBinaryGenerator HTP EPContext blobs (per component, weight-shared) + ↓ ComposeOnnxModels final package +``` + +Why both `OnnxKQuantQuantization` and `MatMulNBitsToQDQ`? +`OnnxKQuantQuantization` emits `com.microsoft::MatMulNBits`, which has +fast CPU / CUDA kernels but is *not* in the QNN EP's supported-op list +— without `MatMulNBitsToQDQ` the QNN partitioner rejects every +quantized MatMul and the model silently falls back to CPU. +`MatMulNBitsToQDQ` rewrites each `MatMulNBits` into a standard +`MatMul + DequantizeLinear` pair so QNN can claim and compile the +subgraph onto HTP. + +## Prerequisites + +### Quantization environment (x64, GPU recommended) +```bash +pip install -r requirements.txt +pip install cupy-cuda12x # accelerates OnnxKQuantQuantization (19–51× speedup) +``` + +### AOT compilation environment (separate venv, x64 with QNN SDK) +```bash +pip install olive-ai mobius-ai +pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps +``` + +Replace `/path/to/qnn/env/bin` in `config.json` with the directory +containing your QNN venv's Python executable. + +### Inference (Snapdragon device) +```bash +pip install onnxruntime-qnn onnxruntime-genai +``` + +## Build + +Run `olive run` from the **quantization environment** (not the QNN AOT +venv). Olive invokes the QNN AOT venv automatically via the +`python_environment_path` configured under `systems.qnn_system` for the +`EPContextBinaryGenerator` pass: + +```bash +olive run --config config.json +``` + +## Why no `GraphSurgeries` here? + +Most existing QNN recipes (Phi-3, Qwen) chain surgeries like +`RemoveRopeMultiCache`, `AttentionMaskToSequenceLengths`, +`SimplifiedLayerNormToL2Norm`. Those rewrite ModelBuilder-specific +sub-graphs into shapes HTP can lower: + +| Surgery | What it does | Why ModelBuilder needs it | +|---|---|---| +| `SimplifiedLayerNormToL2Norm` | `com.microsoft.SimplifiedLayerNorm` → `LpNormalization * gamma` | HTP has no SimplifiedLayerNorm kernel | +| `RemoveRopeMultiCache` | Drop one of ModelBuilder's two RoPE caches | HTP can't dispatch on cache selector | +| `AttentionMaskToSequenceLengths` | `GQA(attention_mask=[B,T])` → `GQA(past_seq_len, total_seq_len)` | HTP's GQA kernel wants scalar seq lens | + +`MobiusBuilder` emits opset-23 standard ops (`RMSNormalization`, +`Attention`) instead of the contrib variants, so these surgeries are +either no-ops or inapplicable. Gemma-4–specific surgeries may still be +needed (e.g. lowering the final logit soft-cap `cap * tanh(x / cap)`), +but the existing borrowed-from-Phi-3 set is not it. + +## Limitations + +This recipe has **not yet been validated end-to-end**. Known gaps: + +1. **Logit soft-cap may not lower to HTP.** Gemma 4's + `logits = cap * tanh(logits / cap)` is unusual for QNN. If HTP + rejects it, options are (a) skip soft-cap during QNN compile and + apply it in host post-processing, or (b) add a + `RemoveLogitSoftcap` GraphSurgery upstream in Olive. + +2. **Hybrid local/global attention with dual head_dim.** Gemma 4 E2B + alternates local sliding-window (head_dim=256) and global + (head_dim=512) attention layers. Whether HTP can dispatch this + correctly per-layer needs testing. + +3. **`per_layer_inputs` data flow.** The embedding component emits a + second output (`per_layer_inputs`, shape `[B, S, L*D]`) consumed by + every decoder block. When all components compile to QNN this should + "just work" (the data path stays inside the package), but the + `StaticLLM` pass may need a hint to recognise this extra tensor. + +4. **256k tokenizer calibration.** `wikitext-2` calibration likely + under-represents Gemma 4's image / audio special tokens. Consider + augmenting with multimodal-formatted prompts before production. + +5. **`StaticLLM context_length=64`.** Placeholder mirroring existing QNN + recipes; tune to target Snapdragon SKU memory budget. + +6. **Standard `Attention` op, not `GroupQueryAttention`.** mobius only + emits `com.microsoft::GroupQueryAttention(seqlens_k, total_seq_len)` + when the EP capability advertises `gqa_dtypes`. The QNN EP + capability in mobius currently has an empty `gqa_dtypes` list, so + `Gemma4TextModel.forward` (`src/mobius/models/gemma4.py:1500-1508`) + falls back to the standard opset-23 `Attention` with an + `attention_mask` input. QNN's HTP backend should have an attention + kernel for the standard op, but if it doesn't lower well there are + two options: + - extend mobius `ep_capabilities()` to advertise QNN-supported + dtypes for `gqa_dtypes`, then mobius will emit `GQA` directly + (no GraphSurgery needed); or + - port `AttentionMaskToSequenceLengths` to operate on standard + `Attention` (it currently checks for `GroupQueryAttention` only + and no-ops otherwise). + +## Discussion + +If you have a Snapdragon test rig and the pipeline blows up on a +specific pass, please drop the trace in a comment — this recipe is +intentionally a template for iteration, not a finished product. diff --git a/google-gemma-4-E2B-it/QNN/config.json b/google-gemma-4-E2B-it/QNN/config.json new file mode 100644 index 000000000..52fde21e9 --- /dev/null +++ b/google-gemma-4-E2B-it/QNN/config.json @@ -0,0 +1,81 @@ +{ + "input_model": { "type": "HfModel", "model_path": "google/gemma-4-E2B-it" }, + "systems": { + "qnn_system": { + "type": "PythonEnvironment", + "python_environment_path": "/path/to/qnn/env/bin", + "accelerators": [ + { "device": "npu", "execution_providers": ["QNNExecutionProvider"] } + ] + } + }, + "data_configs": [ + { + "name": "wikitext2_train_act", + "type": "HuggingfaceContainer", + "load_dataset_config": { + "data_name": "wikitext", + "subset": "wikitext-2-raw-v1", + "split": "train" + }, + "pre_process_data_config": { + "strategy": "line-by-line", + "add_special_tokens": true, + "max_samples": 256, + "max_seq_len": 1024 + } + } + ], + "passes": { + "mobius_build": { + "type": "MobiusBuilder", + "precision": "fp32" + }, + "int4_quantize": { + "type": "OnnxKQuantQuantization", + "bits": 4, + "block_size": 32, + "save_as_external_data": true + }, + "mnb_to_qdq": { + "type": "MatMulNBitsToQDQ", + "use_int4": true, + "add_zero_point": true, + "save_as_external_data": true + }, + "static_quant": { + "type": "OnnxStaticQuantization", + "data_config": "wikitext2_train_act", + "activation_type": "uint16", + "precision": "uint8", + "calibration_providers": ["CUDAExecutionProvider"], + "quant_preprocess": true, + "op_types_to_exclude": [ + "GatherBlockQuantized", + "GroupQueryAttention", + "MatMulNBits" + ], + "save_as_external_data": true + }, + "static_llm": { + "type": "StaticLLM", + "batch_size": 1, + "context_length": 64 + }, + "ep_context": { + "type": "EPContextBinaryGenerator", + "provider_options": { + "htp_performance_mode": "burst", + "htp_graph_finalization_optimization_mode": "3", + "soc_model": "60" + }, + "weight_sharing": true + }, + "compose": { "type": "ComposeOnnxModels" } + }, + "target": "qnn_system", + "output_dir": "model/gemma4_e2b_qnn", + "cache_dir": "cache", + "no_artifacts": true, + "evaluate_input_model": false +} diff --git a/google-gemma-4-E2B-it/QNN/info.yml b/google-gemma-4-E2B-it/QNN/info.yml new file mode 100644 index 000000000..eda1a22a3 --- /dev/null +++ b/google-gemma-4-E2B-it/QNN/info.yml @@ -0,0 +1,13 @@ +keywords: + - foundry-local + - qnn + - gemma4 + - mobius +arch: gemma +recipes: + - name: gemma4-e2b-qnn + file: config.json + devices: + - npu + eps: QNNExecutionProvider +name: gemma4-e2b-qnn diff --git a/google-gemma-4-E2B-it/QNN/requirements.txt b/google-gemma-4-E2B-it/QNN/requirements.txt new file mode 100644 index 000000000..6a89675c7 --- /dev/null +++ b/google-gemma-4-E2B-it/QNN/requirements.txt @@ -0,0 +1,5 @@ +datasets +mobius-ai +olive-ai +onnxruntime-gpu +transformers