Add vision evaluation CI for VLM recipes with PyTorch comparison and perf benchmarks by apsonawane · Pull Request #473 · microsoft/olive-recipes

apsonawane · 2026-06-04T18:08:23Z

# Add vision evaluation CI for VLM recipes with PyTorch comparison and perf benchmarks

## Summary

Adds automated vision evaluation CI for VLM (Vision-Language Model) recipes. On every PR that modifies a VLM recipe, the pipeline builds the INT4 ONNX model, runs vision accuracy benchmarks against public HuggingFace datasets, compares accuracy with the original PyTorch model, and reports performance metrics.

## Changed files

### New files

| File | Description |
|------|-------------|
| `.github/scripts/run_vision_eval.py` | Reusable vision eval script with PyTorch comparison and perf benchmarks |
| `google-gemma-4-E2B-it/olive_ci.json` | CI config for Gemma 4 E2B-it (CPU + CUDA) |
| `Qwen-Qwen3-VL-2B-Instruct/olive_ci.json` | CI config for Qwen3-VL-2B (CPU + CUDA) |
| `Qwen-Qwen2.5-VL-3B-Instruct/olive_ci.json` | CI config for Qwen2.5-VL-3B (CPU + CUDA) |
| `Qwen-Qwen3.5-4B/olive_ci.json` | CI config for Qwen3.5-4B (CPU + CUDA) |

### Modified files

| File | Description |
|------|-------------|
| `.github/scripts/generate_matrix.py` | Added `--changed-files` filtering so only modified recipes run in PRs |
| `.github/workflows/main.yml` | Added trigger paths for `**/config.json` and `run_vision_eval.py`; added changed-file detection step |

## What's included

### 1. Reusable vision eval script

`.github/scripts/run_vision_eval.py` supports any VLM exported to multi-file ONNX with `genai_config.json`.

#### Supported benchmarks

| Benchmark | Dataset | Metric | Task Type |
|-----------|---------|--------|-----------|
| TextVQA | `facebook/textvqa` | `exact_match` | `vision-vqa` |
| ChartQA | `HuggingFaceM4/ChartQA` | `relaxed_accuracy` | `vision-chart-qa` |
| DocumentVQA | `HuggingFaceM4/DocumentVQA` | `word_sort_ratio` | `vision-ocr` |

#### CLI options

| Flag | Default | Description |
|------|---------|-------------|
| `--config` | — | Olive config to build model (skipped if `--model-path` set) |
| `--model-path` | — | Pre-built ONNX model directory |
| `--pytorch-model` | — | HuggingFace model ID for PyTorch comparison |
| `--device` | `cpu` | `cpu` or `gpu` |
| `--benchmarks` | `textvqa` | Comma-separated: `textvqa`, `chartqa`, `docvqa` |
| `--limit` | `50` | Number of eval samples (0 = full dataset) |
| `--threshold` | `0.0` | Minimum accuracy threshold (fails if below) |
| `--max-delta` | `0.0` | Max accuracy drop from PyTorch to ONNX (e.g. `0.02` = 2pp) |
| `--perf` | `false` | Run performance benchmarks (latency, throughput, memory) |
| `--perf-samples` | `10` | Number of inference runs for perf measurement |

#### Usage examples

```bash
# Build + eval (standard CI flow)
python run_vision_eval.py --config cpu/int4/config.json --benchmarks textvqa --limit 100

# With PyTorch comparison
python run_vision_eval.py --model-path /path/to/model --pytorch-model google/gemma-4-E2B-it --benchmarks textvqa --limit 100 --max-delta 0.05

# With perf benchmarks
python run_vision_eval.py --model-path /path/to/model --benchmarks textvqa --limit 100 --perf

2. Per-recipe CI configs

All recipes test INT4 quantized models only:

Recipe	CPU samples	CUDA samples	PyTorch comparison
`google-gemma-4-E2B-it`	100	200	`google/gemma-4-E2B-it`
`Qwen-Qwen3-VL-2B-Instruct`	100	200	`Qwen/Qwen3-VL-2B-Instruct`
`Qwen-Qwen2.5-VL-3B-Instruct`	100	200	`Qwen/Qwen2.5-VL-3B-Instruct`
`Qwen-Qwen3.5-4B`	100	200	`Qwen/Qwen3.5-4B`

Example olive_ci.json

[
    {
        "name": "gemma4-e2b-int4-cpu-vision-eval",
        "os": "ubuntu",
        "device": "cpu",
        "requirements_file": "requirements.txt",
        "command": "python ../../.github/scripts/run_vision_eval.py --config cpu/int4/config.json --pytorch-model google/gemma-4-E2B-it --benchmarks textvqa --limit 100 --device cpu --perf --max-delta 0.05"
    },
    {
        "name": "gemma4-e2b-int4-cuda-vision-eval",
        "os": "ubuntu",
        "device": "cuda",
        "requirements_file": "requirements.txt",
        "command": "python ../../.github/scripts/run_vision_eval.py --config cuda/int4/config.json --pytorch-model google/gemma-4-E2B-it --benchmarks textvqa --limit 200 --device gpu --perf --max-delta 0.05"
    }
]

3. Smarter CI filtering

generate_matrix.py now accepts --changed-files to only run recipes whose files were modified:

Scenario	What runs
PR touches `Qwen-Qwen3-VL-2B-Instruct/`	Only Qwen3-VL jobs
PR touches `google-gemma-4-E2B-it/`	Only Gemma 4 jobs
PR touches `.github/scripts/run_vision_eval.py`	All recipes (shared infra)
PR touches `.github/workflows/main.yml`	All recipes (shared infra)
`workflow_dispatch` (manual trigger)	All recipes

4. Updated workflow triggers

main.yml now triggers on:

**/olive_ci.json (existing)
**/config.json (new — recipe config changes)
.github/scripts/run_vision_eval.py (new — eval script changes)

Example CI output

============================================================
Running TextVQA (exact_match)
  Dataset: facebook/textvqa (split=validation)
  Limit: 100
============================================================

  [ONNX] Evaluating...
  [ONNX] PASS: 0.3800 (45.2s)

  [PyTorch] Evaluating google/gemma-4-E2B-it...
  [PyTorch] 0.4200 (82.1s)
  [Delta] PyTorch - ONNX = +0.0400 (+4.00pp)

============================================================
PERFORMANCE BENCHMARKS
============================================================
  Running 10 inference iterations...
  Avg latency:      0.823s
  P50 latency:      0.801s
  P90 latency:      0.912s
  Avg tokens/run:   31.4
  Tokens/sec:       38.2
  Peak GPU memory:  4210 MB

============================================================
RESULTS SUMMARY
============================================================
  Benchmark                           ONNX  PyTorch    Delta  ONNX Time    PT Time
  -----------------------------------  --------  --------  --------  ----------  ----------
  TextVQA (exact_match)               0.3800   0.4200  +0.0400      45.2s      82.1s

  Avg ONNX speedup vs PyTorch: 1.82x

All benchmarks passed.

How to add vision eval to a new VLM recipe

Create your recipe folder with configs:

olive-recipes/
  my-new-vlm/
    requirements.txt
    cpu/int4/config.json
    cuda/int4/config.json

Add an olive_ci.json:

[
    {
        "name": "my-vlm-int4-cpu-vision-eval",
        "os": "ubuntu",
        "device": "cpu",
        "requirements_file": "requirements.txt",
        "command": "python ../../.github/scripts/run_vision_eval.py --config cpu/int4/config.json --pytorch-model org/model-name --benchmarks textvqa --limit 100 --device cpu --perf --max-delta 0.05"
    },
    {
        "name": "my-vlm-int4-cuda-vision-eval",
        "os": "ubuntu",
        "device": "cuda",
        "requirements_file": "requirements.txt",
        "command": "python ../../.github/scripts/run_vision_eval.py --config cuda/int4/config.json --pytorch-model org/model-name --benchmarks textvqa --limit 200 --device gpu --perf --max-delta 0.05"
    }
]

Open a PR — CI will automatically pick it up.

Dependencies

Olive (olive-ai) with vision evaluation support (PRs #2476, #2488 — both merged)
onnxruntime-genai for ONNX inference
mobius-ai for model export (Gemma 4)
pillow for image handling
Public HuggingFace datasets (no auth required for TextVQA, ChartQA, DocumentVQA)

Testing checklist

CPU CI jobs pass for all 4 recipes
CUDA CI jobs pass for all 4 recipes
PyTorch vs ONNX accuracy delta within 5pp tolerance
Perf metrics (latency, throughput, memory) reported correctly
Matrix filtering works — only changed recipes run on PR
workflow_dispatch still runs all recipes

…perf benchmarks

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

hanbitmyths · 2026-06-05T05:15:44Z

 lm-eval
 mobius-ai
-olive-ai[gpu]
+onnxruntime==1.26.0


Should we pin the version of onnxruntime and genai? If we want to check regression, should it be minimum versions? We still can pin transformers or torch version, though.

I pinned this version because 1.25.1 is the minimum version needed for Qwen3.5 but I kept it to be latest. Also, the CI pipeline runs on 3.10 and 1.26.0 does not support it.
After 1.27.0 is released we need to pin it to it because Gemma would need that as minimum
We can definitely pin transformers and torch version

…cipes into asonawane/e2e

Add vision evaluation CI for VLM recipes with PyTorch comparison and …

aea97b6

…perf benchmarks

Copilot AI review requested due to automatic review settings June 4, 2026 18:08

Copilot started reviewing on behalf of apsonawane June 4, 2026 18:08 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

apsonawane requested a review from Copilot June 4, 2026 18:10

Copilot started reviewing on behalf of apsonawane June 4, 2026 18:10 View session

github-code-quality Bot found potential problems Jun 4, 2026

View reviewed changes

Comment thread .github/scripts/run_vision_eval.py Fixed

Comment thread .github/scripts/run_vision_eval.py Fixed

Comment thread .github/scripts/run_vision_eval.py Fixed

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Update requirements

a269849

github-code-quality Bot found potential problems Jun 4, 2026

View reviewed changes

Comment thread .github/scripts/run_vision_eval.py Fixed

Add trigger

11110cf

apsonawane requested a review from Copilot June 4, 2026 18:33

Copilot started reviewing on behalf of apsonawane June 4, 2026 18:33 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

apsonawane added 2 commits June 4, 2026 11:45

Add dependencies

238acf6

Add new models

64afa0d

Copilot stopped reviewing on behalf of apsonawane due to an error June 4, 2026 19:33
Copilot had to stop work due to a timeout.

apsonawane added 13 commits June 4, 2026 13:17

pin ort and genai versions

e827242

Fix precommit

94b8117

update python version

ed34c75

Add subprocess

753037b

Update requirements and path

5593f5b

Update

41ac0a5

Cleanup

43864a9

Add datasets

e710839

Run only mmmu test

fb23ab0

remove submetric

e7d0710

Fix

5f6aafc

Run only one model

b888148

Fix eval

189528e

hanbitmyths reviewed Jun 5, 2026

View reviewed changes

hanbitmyths and others added 10 commits June 5, 2026 05:18

Merge branch 'main' into asonawane/e2e

0551947

Update models

c234fc3

Merge branch 'asonawane/e2e' of https://github.com/microsoft/olive-re…

90b9567

…cipes into asonawane/e2e

Add all subjects

ee1e833

Update model

c617b41

Update tests

64b2e2f

Enable cuda model

7133176

Enable cuda model

995515f

Enable cuda model

c0ba214

Enable cuda model

a8cf28a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vision evaluation CI for VLM recipes with PyTorch comparison and perf benchmarks#473

Add vision evaluation CI for VLM recipes with PyTorch comparison and perf benchmarks#473
apsonawane wants to merge 28 commits into
mainfrom
asonawane/e2e

apsonawane commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

hanbitmyths Jun 5, 2026 •

edited

Loading

Uh oh!

apsonawane Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

apsonawane commented Jun 4, 2026

2. Per-recipe CI configs

Example olive_ci.json

3. Smarter CI filtering

4. Updated workflow triggers

Example CI output

How to add vision eval to a new VLM recipe

Dependencies

Testing checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

hanbitmyths Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apsonawane Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hanbitmyths Jun 5, 2026 •

edited

Loading