Skip to content

Add vision evaluation CI for VLM recipes with PyTorch comparison and perf benchmarks#473

Open
apsonawane wants to merge 28 commits into
mainfrom
asonawane/e2e
Open

Add vision evaluation CI for VLM recipes with PyTorch comparison and perf benchmarks#473
apsonawane wants to merge 28 commits into
mainfrom
asonawane/e2e

Conversation

@apsonawane
Copy link
Copy Markdown
Contributor

# Add vision evaluation CI for VLM recipes with PyTorch comparison and perf benchmarks

## Summary

Adds automated vision evaluation CI for VLM (Vision-Language Model) recipes. On every PR that modifies a VLM recipe, the pipeline builds the INT4 ONNX model, runs vision accuracy benchmarks against public HuggingFace datasets, compares accuracy with the original PyTorch model, and reports performance metrics.

## Changed files

### New files

| File | Description |
|------|-------------|
| `.github/scripts/run_vision_eval.py` | Reusable vision eval script with PyTorch comparison and perf benchmarks |
| `google-gemma-4-E2B-it/olive_ci.json` | CI config for Gemma 4 E2B-it (CPU + CUDA) |
| `Qwen-Qwen3-VL-2B-Instruct/olive_ci.json` | CI config for Qwen3-VL-2B (CPU + CUDA) |
| `Qwen-Qwen2.5-VL-3B-Instruct/olive_ci.json` | CI config for Qwen2.5-VL-3B (CPU + CUDA) |
| `Qwen-Qwen3.5-4B/olive_ci.json` | CI config for Qwen3.5-4B (CPU + CUDA) |

### Modified files

| File | Description |
|------|-------------|
| `.github/scripts/generate_matrix.py` | Added `--changed-files` filtering so only modified recipes run in PRs |
| `.github/workflows/main.yml` | Added trigger paths for `**/config.json` and `run_vision_eval.py`; added changed-file detection step |

## What's included

### 1. Reusable vision eval script

`.github/scripts/run_vision_eval.py` supports any VLM exported to multi-file ONNX with `genai_config.json`.

#### Supported benchmarks

| Benchmark | Dataset | Metric | Task Type |
|-----------|---------|--------|-----------|
| TextVQA | `facebook/textvqa` | `exact_match` | `vision-vqa` |
| ChartQA | `HuggingFaceM4/ChartQA` | `relaxed_accuracy` | `vision-chart-qa` |
| DocumentVQA | `HuggingFaceM4/DocumentVQA` | `word_sort_ratio` | `vision-ocr` |

#### CLI options

| Flag | Default | Description |
|------|---------|-------------|
| `--config` || Olive config to build model (skipped if `--model-path` set) |
| `--model-path` || Pre-built ONNX model directory |
| `--pytorch-model` || HuggingFace model ID for PyTorch comparison |
| `--device` | `cpu` | `cpu` or `gpu` |
| `--benchmarks` | `textvqa` | Comma-separated: `textvqa`, `chartqa`, `docvqa` |
| `--limit` | `50` | Number of eval samples (0 = full dataset) |
| `--threshold` | `0.0` | Minimum accuracy threshold (fails if below) |
| `--max-delta` | `0.0` | Max accuracy drop from PyTorch to ONNX (e.g. `0.02` = 2pp) |
| `--perf` | `false` | Run performance benchmarks (latency, throughput, memory) |
| `--perf-samples` | `10` | Number of inference runs for perf measurement |

#### Usage examples

```bash
# Build + eval (standard CI flow)
python run_vision_eval.py --config cpu/int4/config.json --benchmarks textvqa --limit 100

# With PyTorch comparison
python run_vision_eval.py --model-path /path/to/model --pytorch-model google/gemma-4-E2B-it --benchmarks textvqa --limit 100 --max-delta 0.05

# With perf benchmarks
python run_vision_eval.py --model-path /path/to/model --benchmarks textvqa --limit 100 --perf

2. Per-recipe CI configs

All recipes test INT4 quantized models only:

Recipe CPU samples CUDA samples PyTorch comparison
google-gemma-4-E2B-it 100 200 google/gemma-4-E2B-it
Qwen-Qwen3-VL-2B-Instruct 100 200 Qwen/Qwen3-VL-2B-Instruct
Qwen-Qwen2.5-VL-3B-Instruct 100 200 Qwen/Qwen2.5-VL-3B-Instruct
Qwen-Qwen3.5-4B 100 200 Qwen/Qwen3.5-4B

Example olive_ci.json

[
    {
        "name": "gemma4-e2b-int4-cpu-vision-eval",
        "os": "ubuntu",
        "device": "cpu",
        "requirements_file": "requirements.txt",
        "command": "python ../../.github/scripts/run_vision_eval.py --config cpu/int4/config.json --pytorch-model google/gemma-4-E2B-it --benchmarks textvqa --limit 100 --device cpu --perf --max-delta 0.05"
    },
    {
        "name": "gemma4-e2b-int4-cuda-vision-eval",
        "os": "ubuntu",
        "device": "cuda",
        "requirements_file": "requirements.txt",
        "command": "python ../../.github/scripts/run_vision_eval.py --config cuda/int4/config.json --pytorch-model google/gemma-4-E2B-it --benchmarks textvqa --limit 200 --device gpu --perf --max-delta 0.05"
    }
]

3. Smarter CI filtering

generate_matrix.py now accepts --changed-files to only run recipes whose files were modified:

Scenario What runs
PR touches Qwen-Qwen3-VL-2B-Instruct/ Only Qwen3-VL jobs
PR touches google-gemma-4-E2B-it/ Only Gemma 4 jobs
PR touches .github/scripts/run_vision_eval.py All recipes (shared infra)
PR touches .github/workflows/main.yml All recipes (shared infra)
workflow_dispatch (manual trigger) All recipes

4. Updated workflow triggers

main.yml now triggers on:

  • **/olive_ci.json (existing)
  • **/config.json (new — recipe config changes)
  • .github/scripts/run_vision_eval.py (new — eval script changes)

Example CI output

============================================================
Running TextVQA (exact_match)
  Dataset: facebook/textvqa (split=validation)
  Limit: 100
============================================================

  [ONNX] Evaluating...
  [ONNX] PASS: 0.3800 (45.2s)

  [PyTorch] Evaluating google/gemma-4-E2B-it...
  [PyTorch] 0.4200 (82.1s)
  [Delta] PyTorch - ONNX = +0.0400 (+4.00pp)

============================================================
PERFORMANCE BENCHMARKS
============================================================
  Running 10 inference iterations...
  Avg latency:      0.823s
  P50 latency:      0.801s
  P90 latency:      0.912s
  Avg tokens/run:   31.4
  Tokens/sec:       38.2
  Peak GPU memory:  4210 MB

============================================================
RESULTS SUMMARY
============================================================
  Benchmark                           ONNX  PyTorch    Delta  ONNX Time    PT Time
  -----------------------------------  --------  --------  --------  ----------  ----------
  TextVQA (exact_match)               0.3800   0.4200  +0.0400      45.2s      82.1s

  Avg ONNX speedup vs PyTorch: 1.82x

All benchmarks passed.

How to add vision eval to a new VLM recipe

  1. Create your recipe folder with configs:
olive-recipes/
  my-new-vlm/
    requirements.txt
    cpu/int4/config.json
    cuda/int4/config.json
  1. Add an olive_ci.json:
[
    {
        "name": "my-vlm-int4-cpu-vision-eval",
        "os": "ubuntu",
        "device": "cpu",
        "requirements_file": "requirements.txt",
        "command": "python ../../.github/scripts/run_vision_eval.py --config cpu/int4/config.json --pytorch-model org/model-name --benchmarks textvqa --limit 100 --device cpu --perf --max-delta 0.05"
    },
    {
        "name": "my-vlm-int4-cuda-vision-eval",
        "os": "ubuntu",
        "device": "cuda",
        "requirements_file": "requirements.txt",
        "command": "python ../../.github/scripts/run_vision_eval.py --config cuda/int4/config.json --pytorch-model org/model-name --benchmarks textvqa --limit 200 --device gpu --perf --max-delta 0.05"
    }
]
  1. Open a PR — CI will automatically pick it up.

Dependencies

  • Olive (olive-ai) with vision evaluation support (PRs #2476, #2488 — both merged)
  • onnxruntime-genai for ONNX inference
  • mobius-ai for model export (Gemma 4)
  • pillow for image handling
  • Public HuggingFace datasets (no auth required for TextVQA, ChartQA, DocumentVQA)

Testing checklist

  • CPU CI jobs pass for all 4 recipes
  • CUDA CI jobs pass for all 4 recipes
  • PyTorch vs ONNX accuracy delta within 5pp tolerance
  • Perf metrics (latency, throughput, memory) reported correctly
  • Matrix filtering works — only changed recipes run on PR
  • workflow_dispatch still runs all recipes

Copilot AI review requested due to automatic review settings June 4, 2026 18:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment thread .github/scripts/run_vision_eval.py Fixed
Comment thread .github/scripts/run_vision_eval.py Fixed
Comment thread .github/scripts/run_vision_eval.py Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment thread .github/scripts/run_vision_eval.py Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot stopped reviewing on behalf of apsonawane due to an error June 4, 2026 19:33
lm-eval
mobius-ai
olive-ai[gpu]
onnxruntime==1.26.0
Copy link
Copy Markdown
Contributor

@hanbitmyths hanbitmyths Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pin the version of onnxruntime and genai? If we want to check regression, should it be minimum versions? We still can pin transformers or torch version, though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pinned this version because 1.25.1 is the minimum version needed for Qwen3.5 but I kept it to be latest. Also, the CI pipeline runs on 3.10 and 1.26.0 does not support it.
After 1.27.0 is released we need to pin it to it because Gemma would need that as minimum
We can definitely pin transformers and torch version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants