Skip to content

feat(mcv): add --no-gpu support for cache creation without GPU hardware#138

Open
maryamtahhan wants to merge 19 commits into
redhat-et:mainfrom
maryamtahhan:feat/mcv-no-gpu-create
Open

feat(mcv): add --no-gpu support for cache creation without GPU hardware#138
maryamtahhan wants to merge 19 commits into
redhat-et:mainfrom
maryamtahhan:feat/mcv-no-gpu-create

Conversation

@maryamtahhan

@maryamtahhan maryamtahhan commented May 27, 2026

Copy link
Copy Markdown
Collaborator

feat(mcv): Add --no-gpu support for cache creation without GPU hardware

Summary

This PR enables MCV to create and extract cache images without GPU hardware by using cache metadata instead of hardware detection. This is useful for CI/CD pipelines, development environments, and containerized workflows where GPU access isn't available.

Problem Statement

Previously, MCV required GPU hardware (NVIDIA or AMD) during cache image creation because it detected GPU information (architecture, backend, warp size) directly from the system. This made it impossible to:

  • Build cache images in CI/CD pipelines without GPUs
  • Create cache images on development machines without GPU access
  • Run MCV in containers without GPU passthrough
  • Distribute lightweight container images without GPU libraries

Solution

Core Changes

  1. --no-gpu Flag Support (main.go)

    • Fixed flag processing order so --no-gpu works with --create
    • Moved configureBoolFlags() before operation dispatch
  2. GPU Detection Bypass (vllm.go)

    • Added check for config.IsGPUEnabled() in detectActualGPUInfo()
    • Falls back to extracting GPU info from cache metadata when --no-gpu is set
    • Cache metadata already contains all necessary information:
      • VLLM_TARGET_DEVICE: Backend (cuda/rocm)
      • VLLM_PAGED_ATTN_ARCH: Architecture (sm_75, gfx1100, etc.)
      • VLLM_MAIN_CUDA_VERSION / ROCM_VERSION: Toolkit versions
  3. Multi-Variant Container Images (amd64.dockerfile)

    • mcv:minimal (~176MB): No GPU libraries, for --no-gpu workflows
    • mcv:amd (~924MB): ROCm libraries for AMD GPU validation
    • mcv:nvidia (~1.5GB): CUDA runtime + NVML for NVIDIA GPU validation
  4. CI/CD Integration (mcv-build-image.yml)

    • Updated GitHub Actions workflow to build all three variants
    • Added matrix strategy with variant-specific tags
    • Added separate cache scopes for each variant
  5. Documentation

Testing

Local Testing

# Build and test locally
cd mcv && make build

# Create cache image without GPU
./mcv --create --image docker.io/test/cache:v1 \\
  --dir example/qwen-binary-cache --no-gpu

Result: ✅ Successfully created 99.8MB cache image with GPU metadata extracted from cache files:

  • Backend: cuda
  • Architecture: sm_75
  • CUDA Version: 12.9
  • Warp Size: 32

EC2 Testing

Built and verified all container variants on EC2 (x86_64):

# Build all variants
podman build --target mcv-minimal -t mcv:minimal -f mcv/images/amd64.dockerfile .
podman build --target mcv-full -t mcv:amd -f mcv/images/amd64.dockerfile .
podman build --target mcv-nvidia -t mcv:nvidia -f mcv/images/amd64.dockerfile .

Results:

  • mcv:minimal - 176MB (no GPU libraries)
  • mcv:amd - 924MB (with ROCm)
  • mcv:nvidia - 356MB (with CUDA + NVML)

How It Works

When --no-gpu is used:

  1. MCV skips GPU hardware detection via NVML/ROCm libraries
  2. Extracts GPU information from cache metadata (environment variables in cache_key_factors.json)
  3. Skips preflight compatibility checks (no hardware to compare against)
  4. Creates cache image with labels containing GPU metadata from cache

The vLLM/Triton cache files already contain all necessary GPU information stored by vLLM when the cache was generated on a GPU system.

Use Cases

Use Case Image Variant Flag GPU Required
CI/CD cache builds minimal --no-gpu ❌ No
Extract cache (no validation) minimal --no-gpu ❌ No
Development/testing minimal --no-gpu ❌ No
Production (AMD) amd (none) ✅ AMD GPU
Production (NVIDIA) nvidia (none) ✅ NVIDIA GPU
Compatibility checks (AMD) amd (none) ✅ AMD GPU
Compatibility checks (NVIDIA) nvidia (none) ✅ NVIDIA GPU

Container Image Usage

# Build minimal image (no GPU libraries)
docker build --target mcv-minimal -t quay.io/gkm/mcv:minimal -f mcv/images/amd64.dockerfile .

# Create cache image without GPU
docker run --rm \\
  -v /path/to/cache:/cache:ro \\
  -v /var/run/docker.sock:/var/run/docker.sock \\
  quay.io/gkm/mcv:minimal \\
  /mcv --create --image quay.io/myorg/cache:v1 --dir /cache --no-gpu

# Build AMD image (with ROCm)
docker build --target mcv-full -t quay.io/gkm/mcv:amd -f mcv/images/amd64.dockerfile .

# Build NVIDIA image (with CUDA)
docker build --target mcv-nvidia -t quay.io/gkm/mcv:nvidia -f mcv/images/amd64.dockerfile .

CI/CD Example

name: Build Cache Images
on: push

jobs:
  build-cache:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Generate vLLM cache
        run: python generate_cache.py
      
      - name: Build cache OCI image
        run: |
          docker run --rm \\
            -v $(pwd)/.cache/vllm:/cache:ro \\
            -v /var/run/docker.sock:/var/run/docker.sock \\
            quay.io/gkm/mcv:minimal \\
            /mcv --create --image quay.io/myorg/cache:${{ github.sha }} \\
            --dir /cache --no-gpu
      
      - name: Push image
        run: docker push quay.io/myorg/cache:${{ github.sha }}

Benefits

No GPU required for cache image creation
Smaller images for distribution (~176MB vs ~900MB)
CI/CD friendly - runs in standard GitHub Actions runners
Flexible deployment - Choose GPU-specific or minimal variants
Backwards compatible - Existing workflows continue to work

Breaking Changes

None. All changes are additive:

  • --no-gpu is a new optional flag
  • Default behavior (GPU detection) unchanged
  • New container variants don't affect existing images

Limitations

When using --no-gpu:

  • ⚠️ No preflight compatibility checks (assumes cache metadata is correct)
  • ⚠️ Cannot detect mismatches between cache and actual hardware
  • ⚠️ PTX version validation skipped for CUDA caches

Recommendation: Use --no-gpu for building/distribution, but validate with actual GPU hardware before production deployment.

Commits

  • 6ff77bb3 feat(mcv): add --no-gpu support for cache creation without GPU hardware
  • c200115a feat(mcv): add NVIDIA/CUDA target to multi-stage Dockerfile
  • 1241116d ci(mcv): build all three image variants in GitHub Actions
  • 8407badc chore: fix trailing whitespace in no-gpu-usage.md

Related Issues

Checklist

  • Code changes implemented and tested
  • Documentation added (no-gpu-usage.md)
  • README updated with quick start
  • CI/CD workflow updated to build all variants
  • Local testing completed successfully
  • EC2 testing completed (minimal & AMD variants)
  • Pre-commit hooks passed
  • NVIDIA variant build verification
  • Test GPU images extraction and preflight checks

Summary by CodeRabbit

  • New Features

    • Added --no-gpu mode and introduced image variants: minimal (default/:latest), AMD (:amd), NVIDIA (:nvidia), and a unified GPU-support image.
  • Build & Release

    • CI and Make automation now build and publish per-variant images with variant-specific tags and isolated per-variant cache scopes; added targets to build/push all variants.
  • Bug Fixes

    • Entrypoint and GPU detection now respect --no-gpu to avoid unintended GPU initialization.
  • Documentation

    • New no-GPU and unified-image guides, README updates, examples, and usage notes.

This commit enables MCV to create and extract cache images without GPU
hardware by using cache metadata instead of hardware detection.

Changes:
- main.go: Move configureBoolFlags() before create check so --no-gpu works with --create
- vllm.go: Add GPU detection bypass in detectActualGPUInfo() when --no-gpu is set
- amd64.dockerfile: Add multi-stage build with minimal and full targets
  - mcv-minimal (~200MB): No GPU libraries, for --no-gpu workflows
  - mcv-full (~2GB): Includes ROCm for GPU validation
- docs/no-gpu-usage.md: Comprehensive usage guide with examples
- README.md: Added No-GPU Mode section with quick start

How it works:
When --no-gpu is used, MCV skips hardware detection and extracts GPU
information (backend, architecture, warp size) from cache metadata that
vLLM/Triton stores in cache files (VLLM_TARGET_DEVICE, VLLM_PAGED_ATTN_ARCH,
VLLM_MAIN_CUDA_VERSION, etc.).

Benefits:
- CI/CD pipelines without GPU access
- Containerized workflows without GPU passthrough
- Minimal container images for cache distribution
- Development environments without GPU hardware

Tested with qwen-binary-cache example - successfully created cache image
with GPU metadata extracted from cache files.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds multi-target container images (minimal, AMD ROCm, NVIDIA CUDA, unified), local Makefile targets and a CI workflow matrix to build/push them, moves flag configuration so --no-gpu applies to create flows, short-circuits GPU detection when disabled, updates the container entrypoint, and provides comprehensive no-GPU and unified-image documentation.

Changes

Multi-image GPU variant support

Layer / File(s) Summary
Multi-target Dockerfile and image targets
mcv/images/amd64.dockerfile
Reworks Dockerfile into builder + named runtime targets: mcv-minimal, mcv-amd, mcv-nvidia, mcv-unified; updates builder GO_VERSION, adds tarball SHA256 verification, labels, and build examples.
Container entrypoint executable
mcv/images/entrypoint.sh
Entrypoint now exec /mcv "$@" to run the compiled binary with forwarded arguments.
GitHub Actions workflow matrix
.github/workflows/mcv-build-image.yml
Expands build matrix to variants (minimal,amd,nvidia), updates job display name, and scopes Buildx cache and tags to the variant.
Local Makefile image build targets
mcv/Makefile
Adds IMAGE_* vars, detects CONTAINER_RUNTIME (docker/podman), and adds phony targets image-minimal, image-amd, image-nvidia, image-unified, images, and image-push.
--no-gpu CLI wiring & detection guard
mcv/cmd/main.go, mcv/pkg/cache/vllm.go
Configure boolean flags before action selection so --no-gpu/--stub apply to create flows; detectActualGPUInfo returns unknown/zero when GPU detection is disabled.
AMD JSON parsing robustness
mcv/pkg/accelerator/devices/amd.go
Truncate trailing non-JSON from amd-smi static --json output before unmarshalling; accept variable JSON types for some fields (e.g., MaxPCIeWidth, MaxPCIeSpeed, ECCBlockState).
No-GPU usage docs & README updates
mcv/docs/no-gpu-usage.md, mcv/README.md
Adds a detailed No-GPU Mode usage guide, examples, CI snippet, GPU access requirements, a mode-selection table, troubleshooting, and README clarifications about image tags (:latest/:amd/:nvidia).
Unified image documentation
mcv/docs/unified-mcv-container.md
New unified-image doc describing NVIDIA+AMD support, runtime detection, Podman/Kubernetes usage, CI examples, troubleshooting, migration guidance, and build/tag instructions.
Formatting and alignment cleanup
mcv/pkg/cache/dummytritonkey.go, mcv/pkg/accelerator/devices/amd.go
Minor whitespace/comment alignment and formatting-only rewrites; no functional changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(mcv): add --no-gpu support for cache creation without GPU hardware' directly and clearly describes the main feature being added, which is the primary change across the entire changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

maryamtahhan and others added 4 commits May 27, 2026 19:51
Add mcv-nvidia target using NVIDIA CUDA base image for NVIDIA GPU support.
This provides a third containerimage variant alongside minimal (no GPU) and
AMD (ROCm) variants.

Changes:
- amd64.dockerfile: Add mcv-nvidia target based on nvcr.io/nvidia/cuda:12.6.3-base
- README.md: Update to show all three variants (minimal, AMD, NVIDIA)
- no-gpu-usage.md: Document NVIDIA variant with usage examples

Image variants:
- mcv:minimal (~200MB): No GPU libraries, for --no-gpu workflows
- mcv:amd (~2GB): ROCm libraries for AMD GPUs
- mcv:nvidia (~1.5GB): CUDA runtime + NVML for NVIDIA GPUs

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Update mcv-build-image.yml workflow to build all three MCV variants:
- minimal (~200MB): No GPU libraries, for --no-gpu workflows
- amd (~900MB): ROCm libraries for AMD GPU validation
- nvidia (~1.5GB): CUDA runtime + NVML for NVIDIA GPU validation

Changes:
- Add matrix strategy with three variants
- Add target parameter to specify build stage
- Add variant-specific cache scopes
- Add suffix tags for each variant (e.g., mcv:minimal, mcv:amd, mcv:nvidia)
- Keep AMD as default (latest tag)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Added detailed documentation explaining when GPU access flags
(--gpus all for NVIDIA, --device flags for AMD) are required:

- GPU access flags ONLY needed for validation/preflight checks
- NOT required when using --no-gpu flag for creation/extraction
- Added "GPU Access Requirements" section to no-gpu-usage.md
- Updated README.md to clarify when GPU flags are needed
- Includes examples for both Docker and Podman

Also includes minor formatting fixes from linter.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the feat/mcv-no-gpu-create branch from 8407bad to a8e5710 Compare May 28, 2026 08:07
Changed the default container image from AMD to minimal variant:

- quay.io/gkm/mcv:latest now maps to minimal (~200MB) instead of AMD (~2GB)
- Minimal is more versatile (works everywhere with --no-gpu)
- Smallest image size, ideal for CI/CD pipelines
- Users needing GPU validation explicitly choose :amd or :nvidia

Changes:
- Updated GitHub workflow to tag minimal with 'latest'
- Updated Makefile to build minimal as default
- Added image build targets: make image-minimal, image-amd, image-nvidia, images
- Updated README.md and no-gpu-usage.md documentation
- Added image tags to documentation for clarity

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the feat/mcv-no-gpu-create branch from a8e5710 to 5df108d Compare May 28, 2026 08:10
maryamtahhan and others added 6 commits May 28, 2026 09:15
… failures

Moved container runtime (docker/podman) detection from Makefile parse time
to target execution time. This prevents the error check from running during
`make tidy-vendor` inside the Dockerfile build, where docker/podman don't exist.

Changes:
- Changed CONTAINER_RUNTIME from immediate (?=) to deferred (=) evaluation
- Removed top-level error check that ran at parse time
- Added runtime check inside each image-* target recipe
- Now only validates container runtime when actually building images

This fixes the build failure on EC2 where the Dockerfile's "RUN make tidy-vendor"
step was failing with "No container runtime found".

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Removed --build-arg BUILDPLATFORM=linux/amd64 from all image build targets.
The --platform flag already specifies the target platform, making the
build arg redundant.

This eliminates the warning:
"one or more build args were not consumed: [BUILDPLATFORM]"

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Fixed image tags to use variant names directly instead of concatenating
with IMAGE_TAG variable:

Before:
- quay.io/gkm/mcv:latest-minimal
- quay.io/gkm/mcv:latest-amd
- quay.io/gkm/mcv:latest-nvidia

After:
- quay.io/gkm/mcv:minimal (+ :latest for minimal only)
- quay.io/gkm/mcv:amd
- quay.io/gkm/mcv:nvidia

This matches the GitHub workflow tag structure.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Changed all documentation from docker to podman commands:
- Build commands: docker build → podman build
- Run commands: docker run → podman run
- Removed socket mounts, added --privileged where needed
- NVIDIA GPU flag: --gpus all → --device nvidia.com/gpu=all

Updated image sizes to reflect actual build results:
- Minimal: ~200MB → ~176MB
- AMD: ~2GB → ~923MB
- NVIDIA: ~1.5GB → ~356MB

Changes across:
- mcv/docs/no-gpu-usage.md
- mcv/README.md
- mcv/Makefile

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Fixed entrypoint.sh to invoke /mcv with the provided arguments.

Before: exec "$@" (tried to execute --extract as a command)
After: exec /mcv "$@" (invokes /mcv --extract ...)

This fixes the error:
"/entrypoint.sh: 5: exec: --extract: not found"

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Added hwdata package to AMD and NVIDIA container images to provide
/usr/share/hwdata/pci.ids database. This is required by the ghw library
to identify PCI devices as GPU accelerators.

Without this file, ghw returns 0 accelerators even when GPUs are present
on the PCI bus, preventing NVML/ROCm initialization.

Fixes GPU detection in containerized environments with manual device mounts.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan marked this pull request as ready for review May 28, 2026 09:24
Added missing --dir /cache parameter to the NVIDIA GPU validation example.
The command was incomplete without specifying the output directory.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the feat/mcv-no-gpu-create branch from 5189125 to 5dd6939 Compare May 28, 2026 09:29

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@mcv/docs/no-gpu-usage.md`:
- Around line 117-118: The examples use the unpublished image tag
"quay.io/gkm/mcv:full"; update those occurrences to the published AMD tag
"quay.io/gkm/mcv:amd" (replace "quay.io/gkm/mcv:full" with
"quay.io/gkm/mcv:amd") in the example commands so the extract/image flags (e.g.,
the lines showing "--extract --image quay.io/myorg/vllm-cache:v1 --dir /cache")
refer to the correct published AMD image; ensure all instances (including the
ones around the shown diff and the occurrences noted at lines ~117 and ~130) are
changed.

In `@mcv/images/amd64.dockerfile`:
- Around line 98-101: The Dockerfile is downloading the Ubuntu Jammy ROCm .deb
into the Debian Bookworm-based mcv-full stage which is brittle; replace the
Jammy-specific install with the Debian/Bookworm-compatible ROCm install flow:
remove the wget of the jammy URL and the local .deb install steps and instead
configure the official ROCm APT repository for Debian Bookworm (using
ROCM_VERSION/AMDGPU_VERSION variables as needed), apt-get update, and install
the ROCm packages (e.g., amd-smi-lib/rocm-smi-lib) from that repo; update the
RUN commands in mcv/images/amd64.dockerfile that reference the wget line and the
apt install ./*.deb sequence so the image uses the Debian repo-based installer
rather than the Ubuntu Jammy package.
- Around line 21-23: The Dockerfile currently downloads and extracts Go using
the RUN block that references GO_VERSION without verifying the tarball; update
the RUN step that uses wget/tar to also download the official SHA256 checksum
for the matching release, compute the SHA256 of /tmp/go.tgz (e.g., via
sha256sum), compare it to the expected checksum string
(957647d3d78995393c200542ab4c23c72b220c3848b6250787a2d48083818314 for go1.24.6),
and fail the build if mismatched before extracting: keep the same GO_VERSION
variable usage and ensure the verification occurs between the download (wget)
and the tar -xzf step in the RUN that currently removes /tmp/go.tgz.

In `@mcv/README.md`:
- Around line 533-536: The README has trailing whitespace in the updated
paragraph (around the line containing "quay.io/gkm/mcv:latest" /
"quay.io/gkm/mcv:amd" / "quay.io/gkm/mcv:nvidia"); run your project's pre-commit
hooks (or manually remove trailing spaces) to clean up trailing whitespace in
mcv/README.md, then commit the whitespace-only fix so the `trailing-whitespace`
pre-commit check passes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 9e898107-7999-476e-a08b-6ac16a821111

📥 Commits

Reviewing files that changed from the base of the PR and between eae0569 and 5189125.

📒 Files selected for processing (10)
  • .github/workflows/mcv-build-image.yml
  • mcv/Makefile
  • mcv/README.md
  • mcv/cmd/main.go
  • mcv/docs/no-gpu-usage.md
  • mcv/images/amd64.dockerfile
  • mcv/images/entrypoint.sh
  • mcv/pkg/accelerator/devices/amd.go
  • mcv/pkg/cache/dummytritonkey.go
  • mcv/pkg/cache/vllm.go

Comment thread mcv/docs/no-gpu-usage.md Outdated
Comment thread mcv/images/amd64.dockerfile
Comment thread mcv/images/amd64.dockerfile Outdated
Comment thread mcv/README.md Outdated
Fixed issues from code review:

1. **docs/no-gpu-usage.md**: Replaced unpublished tag references
   - Changed quay.io/gkm/mcv:full → quay.io/gkm/mcv:amd (lines 117, 130)
   - Ensures examples use the published AMD image tag

2. **images/amd64.dockerfile**: Added Go tarball SHA256 verification
   - Added checksum verification for go1.24.6 download (line 22)
   - Prevents installation of corrupted/tampered Go binaries
   - Checksum: 957647d3d78995393c200542ab4c23c72b220c3848b6250787a2d48083818314

3. **images/amd64.dockerfile**: Fixed ROCm installation for Debian
   - Replaced Ubuntu Jammy .deb download with Debian Bookworm APT repo (lines 98-102)
   - Uses official ROCm repository for proper Debian compatibility
   - Fixes brittle cross-distro package installation

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

Actionable comments posted: 0

maryamtahhan and others added 3 commits May 28, 2026 12:05
Updated Dockerfile to use Go 1.25.0 (from 1.24.6) to match project
requirements in go.mod (requires 1.25.5).

Updated SHA256 checksum to match go1.25.0.linux-amd64.tar.gz:
2852af0cb20a13139b3448992e69b868e50ed0f8a1e5940ee1de9e19a123b613

This fixes the build failure where the checksum verification was failing.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…orted

Reverted ROCm installation to use Ubuntu Jammy packages because ROCm
does not officially support Debian. The AMD repository only provides
packages for Ubuntu distributions.

Error encountered:
  https://repo.radeon.com/rocm/apt/7.0.1 bookworm Release: 404 Not Found

ROCm officially supports:
- Ubuntu 20.04 (Focal)
- Ubuntu 22.04 (Jammy)
- RHEL/Rocky

Using Ubuntu packages on Debian is the standard approach when upstream
doesn't provide native Debian packages. The packages are compatible as
both use the same base libraries.

This addresses the code review comment but explains why the Ubuntu
approach is necessary rather than brittle.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Renamed the Docker build target from "mcv-full" to "mcv-amd" to match
the published image tag naming convention.

Changes:
- Dockerfile: Renamed stage "FROM mcv-minimal AS mcv-full" → "AS mcv-amd"
- Dockerfile: Updated build command comments to use mcv-amd target
- GitHub workflow: Updated target from mcv-full → mcv-amd
- Makefile: Updated image-amd target to build mcv-amd
- README: Updated build example to use mcv-amd target

This ensures consistency between the build target name and the published
image tags (quay.io/gkm/mcv:amd).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
mcv/Makefile (1)

197-207: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Push the explicit :unified tag too.

image-unified builds both :unified and :latest, but image-push only publishes :latest. That leaves the documented :unified artifact stale or missing even though local builds create it.

Suggested patch
 	$(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):minimal
 	$(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):amd
 	$(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):nvidia
+	$(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):unified
 	$(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):latest
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcv/Makefile` around lines 197 - 207, The image-push target currently pushes
minimal, amd, nvidia and latest tags but omits the unified tag; update the
image-push recipe (target: image-push) to also push
$(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):unified (e.g., add a
$(CONTAINER_RUNTIME) push ...:unified line alongside the other tag pushes) so
the :unified artifact produced by image-unified is published.
mcv/images/amd64.dockerfile (1)

118-128: ⚠️ Potential issue | 🟠 Major

Align GPGME package with Ubuntu 24.04 (Noble) base in mcv-nvidia.

mcv-nvidia (based on nvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04) installs libgpgme11, but mcv-unified on the same Ubuntu 24.04 base installs libgpgme11t64. Ubuntu 24.04 replaces libgpgme11 with libgpgme11t64, so mcv-nvidia is likely to fail during apt-get install.

Suggested patch
 RUN apt-get update && apt-get install -y --no-install-recommends \
-    libgpgme11 \
+    libgpgme11t64 \
     libbtrfs0 \
     libffi8 \
     libc6 \
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcv/images/amd64.dockerfile` around lines 118 - 128, Replace the obsolete
package name libgpgme11 with libgpgme11t64 in the apt-get install command in the
Dockerfile RUN block (the line that currently installs libgpgme11 along with
libbtrfs0, libffi8, libc6, etc.) so the image matches the Ubuntu 24.04 (Noble)
package names used by mcv-unified and avoids apt install failures on the
nvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04 base.
♻️ Duplicate comments (1)
mcv/images/amd64.dockerfile (1)

98-104: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Use a ROCm-supported base for the ROCm-capable images.

mcv-amd adds a Jammy ROCm repo to Debian Bookworm, and mcv-unified adds the same Jammy repo to Ubuntu 24.04. In both stages, apt is resolving ROCm packages against a foreign distro, so these builds are brittle and can break on the next ROCm or base-image update. The ROCm variants should be built from a distro ROCm actually supports instead of layering Jammy packages onto Bookworm/Noble.

Also applies to: 181-192

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcv/images/amd64.dockerfile` around lines 98 - 104, The Dockerfile currently
adds Jammy ROCm packages onto Debian Bookworm/Ubuntu 24.04 (see RUN wget ...
amdgpu-install_${AMDGPU_VERSION} and subsequent apt installs), which is
unsupported and brittle; update the mcv-amd and mcv-unified build stages to use
an ROCm-supported base image (e.g., an Ubuntu Jammy or official ROCm base image
matching ROCM_VERSION/OPT_ROCM_VERSION) instead of layering Jammy repos onto
Bookworm/Noble, remove the cross-distro apt repo additions and foreign-package
installs (the RUN wget/apt install ./*.deb and ln -s steps), and ensure
ROCM_VERSION/AMDGPU_VERSION/OPT_ROCM_VERSION are aligned with the chosen base so
apt update/apt install resolve natively; rebuild and verify amd-smi/rocm-smi
binaries exist in /opt/rocm-${OPT_ROCM_VERSION}/bin in those stages.
🧹 Nitpick comments (1)
mcv/docs/unified-mcv-container.md (1)

20-28: ⚖️ Poor tradeoff

Add security guidance for privileged container usage.

The documentation extensively recommends --privileged (lines 24, 65, 78, 93, 109) and privileged: true (lines 140, 195) without explaining security implications or providing alternatives. Privileged containers bypass kernel security boundaries (AppArmor, SELinux, seccomp) and grant access to all host devices.

Consider adding:

  1. A security notice early in the document explaining privileged implications
  2. Alternative approaches where applicable (e.g., specific --device flags for GPU access may be sufficient in some scenarios)
  3. A note that --privileged is required for certain operations (like buildah in cache creation) but may be reducible for read-only operations

Example addition after line 28:

> **Security Note:** The `--privileged` flag grants the container extensive host access. 
> This is required for cache creation (buildah) and certain GPU operations. 
> For production deployments, evaluate whether specific device access (`--device`) 
> or reduced capabilities meet your security requirements.

Also applies to: 59-103, 115-204

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcv/docs/unified-mcv-container.md` around lines 20 - 28, The document
repeatedly recommends using --privileged and privileged: true without warnings;
add a short security notice near the top of unified-mcv-container.md (after the
Basic Usage block) that explains the security implications of --privileged,
calls out that it bypasses AppArmor/SELinux/seccomp and grants host device
access, and states that --privileged is only required for certain operations
(e.g., buildah-based cache creation) while read-only or runtime GPU usage may be
satisfied with targeted alternatives like --device flags or reduced
capabilities; also annotate the other occurrences of --privileged / privileged:
true in the file (the sections around lines 59-103 and 115-204) to recommend
these alternatives and clarify when --privileged is necessary versus when
device-level access suffices.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@mcv/docs/unified-mcv-container.md`:
- Around line 44-48: The fenced code block showing runtime detection logic lacks
a language specifier which prevents proper rendering; update the block around
the three lines referencing nvmlCheck(), rocmCheck(), and --no-gpu to include a
language identifier (for example use ```text) immediately after the opening
backticks so the block becomes ```text ... ``` to enable correct formatting and
highlighting.
- Around line 389-399: Update the documented snippet to reflect actual
registration order and logic: mention that registerDevices calls staticCheck(r)
first when config.IsStubEnabled() is true, then calls amdCheck(r), rocmCheck(r),
nvmlCheck(r) sequentially (no short-circuit “first wins”), and note AMD
detection relies on amd-smi via initAMDLib which uses utils.HasApp("amd-smi");
also state that AMD/ROCm exclusivity is handled in addDeviceInterface (AMD
unregisters ROCm and ROCm is skipped if AMD already registered).

---

Outside diff comments:
In `@mcv/images/amd64.dockerfile`:
- Around line 118-128: Replace the obsolete package name libgpgme11 with
libgpgme11t64 in the apt-get install command in the Dockerfile RUN block (the
line that currently installs libgpgme11 along with libbtrfs0, libffi8, libc6,
etc.) so the image matches the Ubuntu 24.04 (Noble) package names used by
mcv-unified and avoids apt install failures on the
nvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04 base.

In `@mcv/Makefile`:
- Around line 197-207: The image-push target currently pushes minimal, amd,
nvidia and latest tags but omits the unified tag; update the image-push recipe
(target: image-push) to also push
$(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):unified (e.g., add a
$(CONTAINER_RUNTIME) push ...:unified line alongside the other tag pushes) so
the :unified artifact produced by image-unified is published.

---

Duplicate comments:
In `@mcv/images/amd64.dockerfile`:
- Around line 98-104: The Dockerfile currently adds Jammy ROCm packages onto
Debian Bookworm/Ubuntu 24.04 (see RUN wget ... amdgpu-install_${AMDGPU_VERSION}
and subsequent apt installs), which is unsupported and brittle; update the
mcv-amd and mcv-unified build stages to use an ROCm-supported base image (e.g.,
an Ubuntu Jammy or official ROCm base image matching
ROCM_VERSION/OPT_ROCM_VERSION) instead of layering Jammy repos onto
Bookworm/Noble, remove the cross-distro apt repo additions and foreign-package
installs (the RUN wget/apt install ./*.deb and ln -s steps), and ensure
ROCM_VERSION/AMDGPU_VERSION/OPT_ROCM_VERSION are aligned with the chosen base so
apt update/apt install resolve natively; rebuild and verify amd-smi/rocm-smi
binaries exist in /opt/rocm-${OPT_ROCM_VERSION}/bin in those stages.

---

Nitpick comments:
In `@mcv/docs/unified-mcv-container.md`:
- Around line 20-28: The document repeatedly recommends using --privileged and
privileged: true without warnings; add a short security notice near the top of
unified-mcv-container.md (after the Basic Usage block) that explains the
security implications of --privileged, calls out that it bypasses
AppArmor/SELinux/seccomp and grants host device access, and states that
--privileged is only required for certain operations (e.g., buildah-based cache
creation) while read-only or runtime GPU usage may be satisfied with targeted
alternatives like --device flags or reduced capabilities; also annotate the
other occurrences of --privileged / privileged: true in the file (the sections
around lines 59-103 and 115-204) to recommend these alternatives and clarify
when --privileged is necessary versus when device-level access suffices.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: c3d769f1-4418-406d-974c-3d013a572bc2

📥 Commits

Reviewing files that changed from the base of the PR and between 31fc11e and 6023f77.

📒 Files selected for processing (5)
  • .github/workflows/mcv-build-image.yml
  • mcv/Makefile
  • mcv/README.md
  • mcv/docs/unified-mcv-container.md
  • mcv/images/amd64.dockerfile
✅ Files skipped from review due to trivial changes (1)
  • mcv/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/mcv-build-image.yml

Comment on lines +389 to +399
### Runtime Detection Logic

From `mcv/pkg/accelerator/devices/device.go`:
```go
func registerDevices(r *Registry) {
amdCheck(r) // Checks for AMD GPUs via sysfs
rocmCheck(r) // Checks for rocm-smi CLI tool
nvmlCheck(r) // Checks for libnvidia-ml.so.1 library
// First successful check wins!
}
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the registerDevices function matches the documented behavior

# Search for the registerDevices function implementation
ast-grep --pattern $'func registerDevices($$$) {
  $$$
}'

# Also check for the specific check function calls
rg -nP -A5 'func registerDevices' mcv/pkg/accelerator/devices/

Repository: redhat-et/GKM

Length of output: 2520


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find implementations of amdCheck/rocmCheck/nvmlCheck
rg -n "func (amdCheck|rocmCheck|nvmlCheck)\b" -S mcv/pkg/accelerator/devices/device.go mcv/pkg/accelerator/devices || true

# Also print surrounding code for each call site area and definitions if found
for fn in amdCheck rocmCheck nvmlCheck; do
  echo "---- $fn ----"
  rg -n -A40 -B5 "func ${fn}\\b" mcv/pkg/accelerator/devices || true
done

# Find any logic in Registry about "first successful" / "success wins" / short-circuit
rg -n "type Registry|First successful|short.?circuit|wins|return" -S mcv/pkg/accelerator/devices || true
rg -n "type Registry|func \\(r \\*Registry\\)|Add|Register|Detect|Successful" -S mcv/pkg/accelerator/devices || true

Repository: redhat-et/GKM

Length of output: 23783


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find implementations of amdCheck/rocmCheck/nvmlCheck
rg -n "func (amdCheck|rocmCheck|nvmlCheck)\b" -S mcv/pkg/accelerator/devices || true

# For each function, show its body and any returns
for fn in amdCheck rocmCheck nvmlCheck; do
  echo "---- ${fn} ----"
  rg -n -A120 -B5 "func ${fn}\\b" mcv/pkg/accelerator/devices || true
done

# Look for Registry methods that might short-circuit
rg -n "type Registry\\b|func \\(r \\*Registry\\)|func \\(.*Registry\\)|Add\\b|Register\\b|Detected|success" -S mcv/pkg/accelerator/devices || true

# Also inspect device.go further near registerDevices and any shared helper that might implement short-circuit
rg -n -A80 -B20 "func registerDevices" mcv/pkg/accelerator/devices/device.go || true

Repository: redhat-et/GKM

Length of output: 29327


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate amdCheck/rocmCheck/nvmlCheck implementations
rg -n "func (amdCheck|rocmCheck|nvmlCheck)\b" -S mcv/pkg/accelerator/devices || true

# Print each function body (enough context to see any early-return/short-circuit behavior)
for fn in amdCheck rocmCheck nvmlCheck; do
  echo "---- ${fn} ----"
  rg -n -A140 -B5 "func ${fn}\\b" mcv/pkg/accelerator/devices || true
done

# Inspect registerDevices and any nearby helpers that might implement "first successful wins"
rg -n -A100 -B30 "func registerDevices" mcv/pkg/accelerator/devices/device.go || true

# Look for Registry methods/flags that might stop after first success
rg -n "type Registry\\b|func \\(r \\*Registry\\)|short.?circuit|first.?success|wins|Detected|success" -S mcv/pkg/accelerator/devices || true

Repository: redhat-et/GKM

Length of output: 33711


Update the runtime detection snippet to match the real device registration logic

registerDevices actually calls amdCheck, then rocmCheck, then nvmlCheck sequentially (no “first successful check wins” short-circuit). It also has a config.IsStubEnabled() branch that loads staticCheck(r) before the non-stub checks. Additionally, AMD detection is based on amd-smi availability (initAMDLib checks utils.HasApp("amd-smi")), not sysfs; AMD/ROCm exclusivity is handled only inside addDeviceInterface (AMD unregisters ROCM; ROCM is skipped if AMD is already registered).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcv/docs/unified-mcv-container.md` around lines 389 - 399, Update the
documented snippet to reflect actual registration order and logic: mention that
registerDevices calls staticCheck(r) first when config.IsStubEnabled() is true,
then calls amdCheck(r), rocmCheck(r), nvmlCheck(r) sequentially (no
short-circuit “first wins”), and note AMD detection relies on amd-smi via
initAMDLib which uses utils.HasApp("amd-smi"); also state that AMD/ROCm
exclusivity is handled in addDeviceInterface (AMD unregisters ROCm and ROCm is
skipped if AMD already registered).

@maryamtahhan maryamtahhan force-pushed the feat/mcv-no-gpu-create branch 2 times, most recently from f6632ed to 2b9884b Compare June 9, 2026 09:29

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
mcv/Makefile (1)

198-207: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

image-push omits the :unified tag that image-unified creates.

Line 207 pushes :latest, but there is no push for $(IMAGE_NAME):unified. That breaks the documented/public tag contract and can leave :unified missing or stale in registry.

Proposed fix
 image-push: ## Push all container images to registry
@@
 	$(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):minimal
 	$(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):amd
 	$(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):nvidia
+	$(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):unified
 	$(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):latest
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mcv/Makefile` around lines 198 - 207, The image-push target is missing a push
for the unified tag created by image-unified, so add a push for
$(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):unified to image-push;
update the image-push recipe (target image-push) to include a
$(CONTAINER_RUNTIME) push invocation for the :unified tag alongside :minimal,
:amd, :nvidia, and :latest so the :unified image in the registry stays current
with the image-unified build.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@mcv/docs/unified-mcv-container.md`:
- Around line 44-48: Trailing spaces in the markdown lines containing
"nvmlCheck()", "rocmCheck()", and "--no-gpu" are causing the pre-commit
trailing-whitespace hook to fail; open the markdown, remove any trailing spaces
at the ends of those lines (and the additional trailing spaces reported around
lines 99–103), save, and recommit so the pre-commit hook and CI will pass.

---

Outside diff comments:
In `@mcv/Makefile`:
- Around line 198-207: The image-push target is missing a push for the unified
tag created by image-unified, so add a push for
$(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):unified to image-push;
update the image-push recipe (target image-push) to include a
$(CONTAINER_RUNTIME) push invocation for the :unified tag alongside :minimal,
:amd, :nvidia, and :latest so the :unified image in the registry stays current
with the image-unified build.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 779fa98c-d988-4924-a938-ef10a49c125e

📥 Commits

Reviewing files that changed from the base of the PR and between 6023f77 and f6632ed.

📒 Files selected for processing (3)
  • mcv/Makefile
  • mcv/docs/unified-mcv-container.md
  • mcv/images/amd64.dockerfile
🚧 Files skipped from review as they are similar to previous changes (1)
  • mcv/images/amd64.dockerfile

Comment thread mcv/docs/unified-mcv-container.md
@maryamtahhan maryamtahhan force-pushed the feat/mcv-no-gpu-create branch 2 times, most recently from 551e46f to ec526bd Compare June 9, 2026 09:35
@maryamtahhan maryamtahhan force-pushed the feat/mcv-no-gpu-create branch from ec526bd to 47ca401 Compare June 9, 2026 09:36
Add mcv:unified container variant that includes both NVIDIA (CUDA/NVML)
and AMD (ROCm) GPU support. The container automatically detects the GPU
vendor at runtime, simplifying deployment across mixed GPU clusters.

Key changes:
- New mcv-unified target in amd64.dockerfile (CUDA 12.6.3 + ROCm 6.2.4)
- Updated Makefile: 'make image-unified' builds unified image
- Unified image tagged as both :unified and :latest (~1.2 GB)
- Comprehensive documentation with Kubernetes examples
- Auto-detection via MCV client library (dlopen for NVML, CLI for ROCm)

Runtime behavior:
- NVIDIA nodes: Uses NVML (libnvidia-ml.so.1) for GPU detection
- AMD nodes: Uses rocm-smi for GPU detection
- CPU nodes: Gracefully falls back to no-GPU mode

Container variants now available:
- mcv:unified (NEW DEFAULT) - NVIDIA + AMD support (~1.2 GB)
- mcv:minimal - No GPU libs (~176 MB)
- mcv:nvidia - NVIDIA only (~356 MB)
- mcv:amd - AMD only (~923 MB)

This change is backward-compatible. Existing mcv:nvidia and mcv:amd
deployments continue to work unchanged.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the feat/mcv-no-gpu-create branch 2 times, most recently from b53a988 to da8e2d9 Compare June 9, 2026 11:39

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
mcv/docs/unified-mcv-container.md (1)

44-48: ⚡ Quick win

Add language specifier to the fenced code block.

The code block at line 44 is missing a language specifier, causing markdownlint warnings.

📝 Suggested fix
 **Runtime Detection:**
-```
+```text
 On NVIDIA node: nvmlCheck() → ✓ uses NVML
 On AMD node:    rocmCheck()  → ✓ uses rocm-smi
 On CPU node:    both fail    → ✓ uses --no-gpu mode
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @mcv/docs/unified-mcv-container.md around lines 44 - 48, The fenced code
block containing the lines "On NVIDIA node: nvmlCheck() → ✓ uses NVML", "On AMD
node: rocmCheck() → ✓ uses rocm-smi", and "On CPU node: both fail → ✓
uses --no-gpu mode" should include a language specifier to satisfy markdownlint;
update that block to use "text" (or another appropriate language) instead of just "" so the snippet is explicitly marked as plain text.


</details>

<!-- cr-comment:v1:d47fce6780822b7ff400d3ce -->

_Source: Linters/SAST tools_

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @mcv/docs/unified-mcv-container.md:

  • Around line 44-48: The fenced code block containing the lines "On NVIDIA node:
    nvmlCheck() → ✓ uses NVML", "On AMD node: rocmCheck() → ✓ uses rocm-smi",
    and "On CPU node: both fail → ✓ uses --no-gpu mode" should include a
    language specifier to satisfy markdownlint; update that block to use "text" (or another appropriate language) instead of just "" so the snippet is
    explicitly marked as plain text.

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Enterprise

**Run ID**: `db5a69e6-e85a-48a7-bf7c-eb3e991075c8`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between f6632ed1c95782a5d18d038dfa65a1a12cfe1cc0 and da8e2d9573f55bc8ab91d5d67b5a792a3c90d6f3.

</details>

<details>
<summary>📒 Files selected for processing (4)</summary>

* `mcv/Makefile`
* `mcv/docs/unified-mcv-container.md`
* `mcv/images/amd64.dockerfile`
* `mcv/pkg/accelerator/devices/amd.go`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

- Update unified container size to 549 MB (actual measured size)
- Fix trailing whitespace in unified-mcv-container.md (pre-commit hook)
- Add :unified tag to image-push Makefile target
- Update image pull time estimate based on smaller size

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the feat/mcv-no-gpu-create branch from da8e2d9 to 7ca7223 Compare June 9, 2026 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant