-
Notifications
You must be signed in to change notification settings - Fork 50
Add Qwen3.5-2B text only olive-recipe with INT4 weights and shared INT8 embedding #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
apsonawane
wants to merge
11
commits into
main
Choose a base branch
from
asonawane/int
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
c0b60d3
Add Qwen3.5-2B text only olive-recipe with INT4 weights and shared IN…
apsonawane 5503317
Merge branch 'main' into asonawane/int
apsonawane 4696d8d
Fix Readme
apsonawane 6a9d975
Fix eval
apsonawane eb5fbc1
Use external data
apsonawane 6d6a815
Update recipes
apsonawane 48bd358
Update recipes for webgpu
apsonawane 854c18d
Merge branch 'main' into asonawane/int
apsonawane 95a6f92
Update WebGPU recipes
apsonawane 61a4b4b
Merge branch 'main' into asonawane/int
apsonawane affcc66
Merge branch 'main' into asonawane/int
apsonawane File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
32 changes: 32 additions & 0 deletions
32
Qwen-Qwen3.5-2B/baseline/Qwen-Qwen3.5-2B_baseline_mmlu.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| { | ||
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "Qwen/Qwen3.5-2B", | ||
| "load_kwargs": { | ||
| "torch_dtype": "float16" | ||
| } | ||
| }, | ||
| "systems": { | ||
| "local_system": { | ||
| "type": "LocalSystem", | ||
| "accelerators": [ | ||
| { | ||
| "device": "gpu", | ||
| "execution_providers": ["CUDAExecutionProvider"] | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "evaluators": { | ||
| "mmlu": { | ||
| "type": "LMEvaluator", | ||
| "tasks": ["mmlu"], | ||
| "model_class": "hf", | ||
| "batch_size": 8 | ||
| } | ||
| }, | ||
| "evaluator": "mmlu", | ||
| "target": "local_system", | ||
| "log_severity_level": 0, | ||
| "evaluate_input_model": true | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| accelerate | ||
| datasets | ||
| lm-eval | ||
| torch | ||
| transformers==4.52.4 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| { | ||
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "Qwen/Qwen3.5-2B", | ||
| "load_kwargs": { | ||
| "torch_dtype": "float16" | ||
| } | ||
| }, | ||
| "systems": { | ||
| "local_system": { | ||
| "type": "LocalSystem", | ||
| "accelerators": [ | ||
| { | ||
| "device": "cpu", | ||
| "execution_providers": ["CPUExecutionProvider"] | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "passes": { | ||
| "m": { | ||
| "type": "ModelBuilder", | ||
| "precision": "int4", | ||
| "extra_options": { | ||
| "exclude_embeds": false | ||
| } | ||
| }, | ||
| "q": { | ||
| "type": "GraphSurgeries", | ||
| "surgeries": [ | ||
| {"surgeon": "QuantizeEmbeddingInt8"}, | ||
| {"surgeon": "ShareEmbeddingLmHead"} | ||
| ] | ||
| } | ||
| }, | ||
| "target": "local_system", | ||
| "log_severity_level": 0, | ||
| "output_dir": "model", | ||
| "cache_dir": "cache", | ||
| "no_artifacts": true | ||
| } |
49 changes: 49 additions & 0 deletions
49
Qwen-Qwen3.5-2B/cpu/Qwen-Qwen3.5-2B_cpu_int4_with_eval.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| { | ||
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "Qwen/Qwen3.5-2B", | ||
| "load_kwargs": { | ||
| "torch_dtype": "float16" | ||
| } | ||
| }, | ||
| "systems": { | ||
| "local_system": { | ||
| "type": "LocalSystem", | ||
| "accelerators": [ | ||
| { | ||
| "device": "cpu", | ||
| "execution_providers": ["CPUExecutionProvider"] | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "passes": { | ||
| "m": { | ||
| "type": "ModelBuilder", | ||
| "precision": "int4", | ||
| "extra_options": { | ||
| "exclude_embeds": false | ||
| } | ||
| }, | ||
| "q": { | ||
| "type": "GraphSurgeries", | ||
| "surgeries": [ | ||
| {"surgeon": "QuantizeEmbeddingInt8"}, | ||
| {"surgeon": "ShareEmbeddingLmHead"} | ||
| ] | ||
| } | ||
| }, | ||
| "evaluators": { | ||
| "mmlu": { | ||
| "type": "LMEvaluator", | ||
| "tasks": ["mmlu"], | ||
| "batch_size": 8 | ||
| } | ||
| }, | ||
| "evaluator": "mmlu", | ||
| "target": "local_system", | ||
| "log_severity_level": 0, | ||
| "output_dir": "model", | ||
| "cache_dir": "cache", | ||
| "no_artifacts": true | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| # Qwen-Qwen3.5-2B — CPU optimization | ||
|
|
||
| This folder contains Olive recipes for optimizing Qwen-Qwen3.5-2B targeting the CPU EP. | ||
|
|
||
| ## What this folder is for | ||
|
|
||
| - Execution Provider: CPU EP | ||
| - Typical precision: INT4 precision by default | ||
| - Example recipe filename: Qwen-Qwen3.5-2B_cpu_int4.json | ||
|
|
||
| ## Setup | ||
|
|
||
| 1) Install the main branch of Olive: | ||
| - pip install git+https://github.com/microsoft/olive.git | ||
| 2) Install the appropriate runtime package for this backend: | ||
| - onnxruntime-genai (CPU build) | ||
| 3) Run Olive to build/optimize the model | ||
| - olive run --config Qwen-Qwen3.5-2B_cpu_int4.json | ||
|
|
||
| Additional notes: | ||
| - Pipeline: `SelectiveMixedPrecision` (kld_gradient) → `GPTQ` → `RTN` (8-bit lm_head/embeddings) → `ModelBuilder` → `TieWordEmbeddings` | ||
| - GPTQ group size: 128 | ||
| - Uses text-only mode (exclude_embeds=false, prune_lm_head=true) for standalone LLM inference without multimodal pipeline. | ||
| - Runs purely on CPU; no GPU required. | ||
|
|
||
| --- | ||
|
|
||
| This README was auto-generated for the CPU EP of Qwen-Qwen3.5-2B. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| arch: qwen3_5_text | ||
| recipes: | ||
| - name: Qwen-Qwen3.5-2B_cpu_int4 | ||
| file: Qwen-Qwen3.5-2B_cpu_int4.json | ||
| devices: cpu | ||
| eps: CPUExecutionProvider |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| accelerate | ||
| datasets | ||
| onnxruntime-genai | ||
| transformers==4.52.4 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| { | ||
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "Qwen/Qwen3.5-2B", | ||
| "load_kwargs": { | ||
| "torch_dtype": "float16" | ||
| } | ||
| }, | ||
| "systems": { | ||
| "local_system": { | ||
| "type": "LocalSystem", | ||
| "accelerators": [ | ||
| { | ||
| "device": "gpu", | ||
| "execution_providers": ["CUDAExecutionProvider"] | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "passes": { | ||
| "m": { | ||
| "type": "ModelBuilder", | ||
| "precision": "int4", | ||
| "extra_options": { | ||
| "exclude_embeds": false, | ||
| "enable_cuda_graph": true | ||
| } | ||
| }, | ||
| "q": { | ||
| "type": "GraphSurgeries", | ||
| "surgeries": [ | ||
| {"surgeon": "QuantizeEmbeddingInt8"}, | ||
| {"surgeon": "ShareEmbeddingLmHead"} | ||
| ] | ||
| } | ||
| }, | ||
| "target": "local_system", | ||
| "log_severity_level": 0, | ||
| "output_dir": "model", | ||
| "cache_dir": "cache", | ||
| "no_artifacts": true | ||
| } |
51 changes: 51 additions & 0 deletions
51
Qwen-Qwen3.5-2B/cuda/Qwen-Qwen3.5-2B_cuda_int4_with_eval.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| { | ||
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "Qwen/Qwen3.5-2B", | ||
| "load_kwargs": { | ||
| "torch_dtype": "float16" | ||
| } | ||
| }, | ||
| "systems": { | ||
| "local_system": { | ||
| "type": "LocalSystem", | ||
| "accelerators": [ | ||
| { | ||
| "device": "gpu", | ||
| "execution_providers": ["CUDAExecutionProvider"] | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "passes": { | ||
| "m": { | ||
| "type": "ModelBuilder", | ||
| "precision": "int4", | ||
| "extra_options": { | ||
| "exclude_embeds": false, | ||
| "enable_cuda_graph": true | ||
| } | ||
| }, | ||
| "q": { | ||
| "type": "GraphSurgeries", | ||
| "surgeries": [ | ||
| {"surgeon": "QuantizeEmbeddingInt8"}, | ||
| {"surgeon": "ShareEmbeddingLmHead"} | ||
| ] | ||
| } | ||
| }, | ||
| "evaluators": { | ||
| "mmlu": { | ||
| "type": "LMEvaluator", | ||
| "tasks": ["mmlu"], | ||
| "batch_size": 8 | ||
| } | ||
| }, | ||
| "evaluator": "mmlu", | ||
| "target": "local_system", | ||
| "log_severity_level": 0, | ||
| "output_dir": "model", | ||
| "cache_dir": "cache", | ||
| "no_artifacts": true, | ||
| "evaluate_input_model": false | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # Qwen-Qwen3.5-2B — CUDA optimization | ||
|
|
||
| This folder contains Olive recipes for optimizing Qwen-Qwen3.5-2B targeting the CUDA EP. | ||
|
|
||
| ## What this folder is for | ||
|
|
||
| - Execution Provider: CUDA EP | ||
| - Typical precision: INT4 precision by default | ||
| - Example recipe filename: Qwen-Qwen3.5-2B_cuda_int4.json | ||
|
|
||
| ## Setup | ||
|
|
||
| 1) Install the main branch of Olive: | ||
| - pip install git+https://github.com/microsoft/olive.git | ||
| 2) Install the appropriate runtime package for this backend: | ||
| - onnxruntime-genai-cuda (CUDA build) | ||
| 3) Run Olive to build/optimize the model | ||
| - olive run --config Qwen-Qwen3.5-2B_cuda_int4.json | ||
|
|
||
| Additional notes: | ||
| - Pipeline: `SelectiveMixedPrecision` (kld_gradient) → `GPTQ` → `RTN` (8-bit lm_head/embeddings) → `ModelBuilder` → `TieWordEmbeddings` | ||
| - GPTQ group size: 128 | ||
| - Uses text-only mode (exclude_embeds=false, prune_lm_head=true) for standalone LLM inference without multimodal pipeline. | ||
|
apsonawane marked this conversation as resolved.
Outdated
|
||
| - Requires NVIDIA GPU with CUDA support. | ||
| - Ensure CUDA toolkit and cuDNN are properly installed. | ||
|
|
||
| --- | ||
|
|
||
| This README was auto-generated for the CUDA EP of Qwen-Qwen3.5-2B. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| arch: qwen3_5_text | ||
| recipes: | ||
| - name: Qwen-Qwen3.5-2B_cuda_int4 | ||
| file: Qwen-Qwen3.5-2B_cuda_int4.json | ||
| devices: gpu | ||
| eps: CUDAExecutionProvider |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| accelerate | ||
| datasets | ||
| onnxruntime-genai | ||
| transformers==4.52.4 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| { | ||
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "Qwen/Qwen3.5-2B", | ||
| "load_kwargs": { | ||
| "torch_dtype": "float16" | ||
| } | ||
| }, | ||
| "systems": { | ||
| "local_system": { | ||
| "type": "LocalSystem", | ||
| "accelerators": [ | ||
| { | ||
| "device": "gpu", | ||
| "execution_providers": ["WebGpuExecutionProvider"] | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "passes": { | ||
| "m": { | ||
| "type": "ModelBuilder", | ||
| "precision": "int4", | ||
| "extra_options": { | ||
| "exclude_embeds": false | ||
| } | ||
| }, | ||
| "q": { | ||
| "type": "GraphSurgeries", | ||
| "surgeries": [ | ||
| {"surgeon": "QuantizeEmbeddingInt8"}, | ||
| {"surgeon": "ShareEmbeddingLmHead"} | ||
| ] | ||
| } | ||
| }, | ||
| "target": "local_system", | ||
| "log_severity_level": 0, | ||
| "output_dir": "model", | ||
| "cache_dir": "cache", | ||
| "no_artifacts": true | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.