feat(mcv): add --no-gpu support for cache creation without GPU hardware#138
feat(mcv): add --no-gpu support for cache creation without GPU hardware#138maryamtahhan wants to merge 19 commits into
Conversation
This commit enables MCV to create and extract cache images without GPU hardware by using cache metadata instead of hardware detection. Changes: - main.go: Move configureBoolFlags() before create check so --no-gpu works with --create - vllm.go: Add GPU detection bypass in detectActualGPUInfo() when --no-gpu is set - amd64.dockerfile: Add multi-stage build with minimal and full targets - mcv-minimal (~200MB): No GPU libraries, for --no-gpu workflows - mcv-full (~2GB): Includes ROCm for GPU validation - docs/no-gpu-usage.md: Comprehensive usage guide with examples - README.md: Added No-GPU Mode section with quick start How it works: When --no-gpu is used, MCV skips hardware detection and extracts GPU information (backend, architecture, warp size) from cache metadata that vLLM/Triton stores in cache files (VLLM_TARGET_DEVICE, VLLM_PAGED_ATTN_ARCH, VLLM_MAIN_CUDA_VERSION, etc.). Benefits: - CI/CD pipelines without GPU access - Containerized workflows without GPU passthrough - Minimal container images for cache distribution - Development environments without GPU hardware Tested with qwen-binary-cache example - successfully created cache image with GPU metadata extracted from cache files. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds multi-target container images (minimal, AMD ROCm, NVIDIA CUDA, unified), local Makefile targets and a CI workflow matrix to build/push them, moves flag configuration so --no-gpu applies to create flows, short-circuits GPU detection when disabled, updates the container entrypoint, and provides comprehensive no-GPU and unified-image documentation. ChangesMulti-image GPU variant support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Add mcv-nvidia target using NVIDIA CUDA base image for NVIDIA GPU support. This provides a third containerimage variant alongside minimal (no GPU) and AMD (ROCm) variants. Changes: - amd64.dockerfile: Add mcv-nvidia target based on nvcr.io/nvidia/cuda:12.6.3-base - README.md: Update to show all three variants (minimal, AMD, NVIDIA) - no-gpu-usage.md: Document NVIDIA variant with usage examples Image variants: - mcv:minimal (~200MB): No GPU libraries, for --no-gpu workflows - mcv:amd (~2GB): ROCm libraries for AMD GPUs - mcv:nvidia (~1.5GB): CUDA runtime + NVML for NVIDIA GPUs Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Update mcv-build-image.yml workflow to build all three MCV variants: - minimal (~200MB): No GPU libraries, for --no-gpu workflows - amd (~900MB): ROCm libraries for AMD GPU validation - nvidia (~1.5GB): CUDA runtime + NVML for NVIDIA GPU validation Changes: - Add matrix strategy with three variants - Add target parameter to specify build stage - Add variant-specific cache scopes - Add suffix tags for each variant (e.g., mcv:minimal, mcv:amd, mcv:nvidia) - Keep AMD as default (latest tag) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Added detailed documentation explaining when GPU access flags (--gpus all for NVIDIA, --device flags for AMD) are required: - GPU access flags ONLY needed for validation/preflight checks - NOT required when using --no-gpu flag for creation/extraction - Added "GPU Access Requirements" section to no-gpu-usage.md - Updated README.md to clarify when GPU flags are needed - Includes examples for both Docker and Podman Also includes minor formatting fixes from linter. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
8407bad to
a8e5710
Compare
Changed the default container image from AMD to minimal variant: - quay.io/gkm/mcv:latest now maps to minimal (~200MB) instead of AMD (~2GB) - Minimal is more versatile (works everywhere with --no-gpu) - Smallest image size, ideal for CI/CD pipelines - Users needing GPU validation explicitly choose :amd or :nvidia Changes: - Updated GitHub workflow to tag minimal with 'latest' - Updated Makefile to build minimal as default - Added image build targets: make image-minimal, image-amd, image-nvidia, images - Updated README.md and no-gpu-usage.md documentation - Added image tags to documentation for clarity Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
a8e5710 to
5df108d
Compare
… failures Moved container runtime (docker/podman) detection from Makefile parse time to target execution time. This prevents the error check from running during `make tidy-vendor` inside the Dockerfile build, where docker/podman don't exist. Changes: - Changed CONTAINER_RUNTIME from immediate (?=) to deferred (=) evaluation - Removed top-level error check that ran at parse time - Added runtime check inside each image-* target recipe - Now only validates container runtime when actually building images This fixes the build failure on EC2 where the Dockerfile's "RUN make tidy-vendor" step was failing with "No container runtime found". Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Removed --build-arg BUILDPLATFORM=linux/amd64 from all image build targets. The --platform flag already specifies the target platform, making the build arg redundant. This eliminates the warning: "one or more build args were not consumed: [BUILDPLATFORM]" Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Fixed image tags to use variant names directly instead of concatenating with IMAGE_TAG variable: Before: - quay.io/gkm/mcv:latest-minimal - quay.io/gkm/mcv:latest-amd - quay.io/gkm/mcv:latest-nvidia After: - quay.io/gkm/mcv:minimal (+ :latest for minimal only) - quay.io/gkm/mcv:amd - quay.io/gkm/mcv:nvidia This matches the GitHub workflow tag structure. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Changed all documentation from docker to podman commands: - Build commands: docker build → podman build - Run commands: docker run → podman run - Removed socket mounts, added --privileged where needed - NVIDIA GPU flag: --gpus all → --device nvidia.com/gpu=all Updated image sizes to reflect actual build results: - Minimal: ~200MB → ~176MB - AMD: ~2GB → ~923MB - NVIDIA: ~1.5GB → ~356MB Changes across: - mcv/docs/no-gpu-usage.md - mcv/README.md - mcv/Makefile Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Fixed entrypoint.sh to invoke /mcv with the provided arguments. Before: exec "$@" (tried to execute --extract as a command) After: exec /mcv "$@" (invokes /mcv --extract ...) This fixes the error: "/entrypoint.sh: 5: exec: --extract: not found" Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Added hwdata package to AMD and NVIDIA container images to provide /usr/share/hwdata/pci.ids database. This is required by the ghw library to identify PCI devices as GPU accelerators. Without this file, ghw returns 0 accelerators even when GPUs are present on the PCI bus, preventing NVML/ROCm initialization. Fixes GPU detection in containerized environments with manual device mounts. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Added missing --dir /cache parameter to the NVIDIA GPU validation example. The command was incomplete without specifying the output directory. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
5189125 to
5dd6939
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@mcv/docs/no-gpu-usage.md`:
- Around line 117-118: The examples use the unpublished image tag
"quay.io/gkm/mcv:full"; update those occurrences to the published AMD tag
"quay.io/gkm/mcv:amd" (replace "quay.io/gkm/mcv:full" with
"quay.io/gkm/mcv:amd") in the example commands so the extract/image flags (e.g.,
the lines showing "--extract --image quay.io/myorg/vllm-cache:v1 --dir /cache")
refer to the correct published AMD image; ensure all instances (including the
ones around the shown diff and the occurrences noted at lines ~117 and ~130) are
changed.
In `@mcv/images/amd64.dockerfile`:
- Around line 98-101: The Dockerfile is downloading the Ubuntu Jammy ROCm .deb
into the Debian Bookworm-based mcv-full stage which is brittle; replace the
Jammy-specific install with the Debian/Bookworm-compatible ROCm install flow:
remove the wget of the jammy URL and the local .deb install steps and instead
configure the official ROCm APT repository for Debian Bookworm (using
ROCM_VERSION/AMDGPU_VERSION variables as needed), apt-get update, and install
the ROCm packages (e.g., amd-smi-lib/rocm-smi-lib) from that repo; update the
RUN commands in mcv/images/amd64.dockerfile that reference the wget line and the
apt install ./*.deb sequence so the image uses the Debian repo-based installer
rather than the Ubuntu Jammy package.
- Around line 21-23: The Dockerfile currently downloads and extracts Go using
the RUN block that references GO_VERSION without verifying the tarball; update
the RUN step that uses wget/tar to also download the official SHA256 checksum
for the matching release, compute the SHA256 of /tmp/go.tgz (e.g., via
sha256sum), compare it to the expected checksum string
(957647d3d78995393c200542ab4c23c72b220c3848b6250787a2d48083818314 for go1.24.6),
and fail the build if mismatched before extracting: keep the same GO_VERSION
variable usage and ensure the verification occurs between the download (wget)
and the tar -xzf step in the RUN that currently removes /tmp/go.tgz.
In `@mcv/README.md`:
- Around line 533-536: The README has trailing whitespace in the updated
paragraph (around the line containing "quay.io/gkm/mcv:latest" /
"quay.io/gkm/mcv:amd" / "quay.io/gkm/mcv:nvidia"); run your project's pre-commit
hooks (or manually remove trailing spaces) to clean up trailing whitespace in
mcv/README.md, then commit the whitespace-only fix so the `trailing-whitespace`
pre-commit check passes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 9e898107-7999-476e-a08b-6ac16a821111
📒 Files selected for processing (10)
.github/workflows/mcv-build-image.ymlmcv/Makefilemcv/README.mdmcv/cmd/main.gomcv/docs/no-gpu-usage.mdmcv/images/amd64.dockerfilemcv/images/entrypoint.shmcv/pkg/accelerator/devices/amd.gomcv/pkg/cache/dummytritonkey.gomcv/pkg/cache/vllm.go
Fixed issues from code review: 1. **docs/no-gpu-usage.md**: Replaced unpublished tag references - Changed quay.io/gkm/mcv:full → quay.io/gkm/mcv:amd (lines 117, 130) - Ensures examples use the published AMD image tag 2. **images/amd64.dockerfile**: Added Go tarball SHA256 verification - Added checksum verification for go1.24.6 download (line 22) - Prevents installation of corrupted/tampered Go binaries - Checksum: 957647d3d78995393c200542ab4c23c72b220c3848b6250787a2d48083818314 3. **images/amd64.dockerfile**: Fixed ROCm installation for Debian - Replaced Ubuntu Jammy .deb download with Debian Bookworm APT repo (lines 98-102) - Uses official ROCm repository for proper Debian compatibility - Fixes brittle cross-distro package installation Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
|
Actionable comments posted: 0 |
Updated Dockerfile to use Go 1.25.0 (from 1.24.6) to match project requirements in go.mod (requires 1.25.5). Updated SHA256 checksum to match go1.25.0.linux-amd64.tar.gz: 2852af0cb20a13139b3448992e69b868e50ed0f8a1e5940ee1de9e19a123b613 This fixes the build failure where the checksum verification was failing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
…orted Reverted ROCm installation to use Ubuntu Jammy packages because ROCm does not officially support Debian. The AMD repository only provides packages for Ubuntu distributions. Error encountered: https://repo.radeon.com/rocm/apt/7.0.1 bookworm Release: 404 Not Found ROCm officially supports: - Ubuntu 20.04 (Focal) - Ubuntu 22.04 (Jammy) - RHEL/Rocky Using Ubuntu packages on Debian is the standard approach when upstream doesn't provide native Debian packages. The packages are compatible as both use the same base libraries. This addresses the code review comment but explains why the Ubuntu approach is necessary rather than brittle. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Renamed the Docker build target from "mcv-full" to "mcv-amd" to match the published image tag naming convention. Changes: - Dockerfile: Renamed stage "FROM mcv-minimal AS mcv-full" → "AS mcv-amd" - Dockerfile: Updated build command comments to use mcv-amd target - GitHub workflow: Updated target from mcv-full → mcv-amd - Makefile: Updated image-amd target to build mcv-amd - README: Updated build example to use mcv-amd target This ensures consistency between the build target name and the published image tags (quay.io/gkm/mcv:amd). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
mcv/Makefile (1)
197-207:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPush the explicit
:unifiedtag too.
image-unifiedbuilds both:unifiedand:latest, butimage-pushonly publishes:latest. That leaves the documented:unifiedartifact stale or missing even though local builds create it.Suggested patch
$(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):minimal $(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):amd $(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):nvidia + $(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):unified $(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):latest🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcv/Makefile` around lines 197 - 207, The image-push target currently pushes minimal, amd, nvidia and latest tags but omits the unified tag; update the image-push recipe (target: image-push) to also push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):unified (e.g., add a $(CONTAINER_RUNTIME) push ...:unified line alongside the other tag pushes) so the :unified artifact produced by image-unified is published.mcv/images/amd64.dockerfile (1)
118-128:⚠️ Potential issue | 🟠 MajorAlign GPGME package with Ubuntu 24.04 (Noble) base in
mcv-nvidia.
mcv-nvidia(based onnvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04) installslibgpgme11, butmcv-unifiedon the same Ubuntu 24.04 base installslibgpgme11t64. Ubuntu 24.04 replaceslibgpgme11withlibgpgme11t64, somcv-nvidiais likely to fail duringapt-get install.Suggested patch
RUN apt-get update && apt-get install -y --no-install-recommends \ - libgpgme11 \ + libgpgme11t64 \ libbtrfs0 \ libffi8 \ libc6 \🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcv/images/amd64.dockerfile` around lines 118 - 128, Replace the obsolete package name libgpgme11 with libgpgme11t64 in the apt-get install command in the Dockerfile RUN block (the line that currently installs libgpgme11 along with libbtrfs0, libffi8, libc6, etc.) so the image matches the Ubuntu 24.04 (Noble) package names used by mcv-unified and avoids apt install failures on the nvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04 base.
♻️ Duplicate comments (1)
mcv/images/amd64.dockerfile (1)
98-104:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftUse a ROCm-supported base for the ROCm-capable images.
mcv-amdadds a Jammy ROCm repo to Debian Bookworm, andmcv-unifiedadds the same Jammy repo to Ubuntu 24.04. In both stages,aptis resolving ROCm packages against a foreign distro, so these builds are brittle and can break on the next ROCm or base-image update. The ROCm variants should be built from a distro ROCm actually supports instead of layering Jammy packages onto Bookworm/Noble.Also applies to: 181-192
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcv/images/amd64.dockerfile` around lines 98 - 104, The Dockerfile currently adds Jammy ROCm packages onto Debian Bookworm/Ubuntu 24.04 (see RUN wget ... amdgpu-install_${AMDGPU_VERSION} and subsequent apt installs), which is unsupported and brittle; update the mcv-amd and mcv-unified build stages to use an ROCm-supported base image (e.g., an Ubuntu Jammy or official ROCm base image matching ROCM_VERSION/OPT_ROCM_VERSION) instead of layering Jammy repos onto Bookworm/Noble, remove the cross-distro apt repo additions and foreign-package installs (the RUN wget/apt install ./*.deb and ln -s steps), and ensure ROCM_VERSION/AMDGPU_VERSION/OPT_ROCM_VERSION are aligned with the chosen base so apt update/apt install resolve natively; rebuild and verify amd-smi/rocm-smi binaries exist in /opt/rocm-${OPT_ROCM_VERSION}/bin in those stages.
🧹 Nitpick comments (1)
mcv/docs/unified-mcv-container.md (1)
20-28: ⚖️ Poor tradeoffAdd security guidance for privileged container usage.
The documentation extensively recommends
--privileged(lines 24, 65, 78, 93, 109) andprivileged: true(lines 140, 195) without explaining security implications or providing alternatives. Privileged containers bypass kernel security boundaries (AppArmor, SELinux, seccomp) and grant access to all host devices.Consider adding:
- A security notice early in the document explaining privileged implications
- Alternative approaches where applicable (e.g., specific
--deviceflags for GPU access may be sufficient in some scenarios)- A note that
--privilegedis required for certain operations (like buildah in cache creation) but may be reducible for read-only operationsExample addition after line 28:
> **Security Note:** The `--privileged` flag grants the container extensive host access. > This is required for cache creation (buildah) and certain GPU operations. > For production deployments, evaluate whether specific device access (`--device`) > or reduced capabilities meet your security requirements.Also applies to: 59-103, 115-204
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcv/docs/unified-mcv-container.md` around lines 20 - 28, The document repeatedly recommends using --privileged and privileged: true without warnings; add a short security notice near the top of unified-mcv-container.md (after the Basic Usage block) that explains the security implications of --privileged, calls out that it bypasses AppArmor/SELinux/seccomp and grants host device access, and states that --privileged is only required for certain operations (e.g., buildah-based cache creation) while read-only or runtime GPU usage may be satisfied with targeted alternatives like --device flags or reduced capabilities; also annotate the other occurrences of --privileged / privileged: true in the file (the sections around lines 59-103 and 115-204) to recommend these alternatives and clarify when --privileged is necessary versus when device-level access suffices.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@mcv/docs/unified-mcv-container.md`:
- Around line 44-48: The fenced code block showing runtime detection logic lacks
a language specifier which prevents proper rendering; update the block around
the three lines referencing nvmlCheck(), rocmCheck(), and --no-gpu to include a
language identifier (for example use ```text) immediately after the opening
backticks so the block becomes ```text ... ``` to enable correct formatting and
highlighting.
- Around line 389-399: Update the documented snippet to reflect actual
registration order and logic: mention that registerDevices calls staticCheck(r)
first when config.IsStubEnabled() is true, then calls amdCheck(r), rocmCheck(r),
nvmlCheck(r) sequentially (no short-circuit “first wins”), and note AMD
detection relies on amd-smi via initAMDLib which uses utils.HasApp("amd-smi");
also state that AMD/ROCm exclusivity is handled in addDeviceInterface (AMD
unregisters ROCm and ROCm is skipped if AMD already registered).
---
Outside diff comments:
In `@mcv/images/amd64.dockerfile`:
- Around line 118-128: Replace the obsolete package name libgpgme11 with
libgpgme11t64 in the apt-get install command in the Dockerfile RUN block (the
line that currently installs libgpgme11 along with libbtrfs0, libffi8, libc6,
etc.) so the image matches the Ubuntu 24.04 (Noble) package names used by
mcv-unified and avoids apt install failures on the
nvcr.io/nvidia/cuda:12.6.3-base-ubuntu24.04 base.
In `@mcv/Makefile`:
- Around line 197-207: The image-push target currently pushes minimal, amd,
nvidia and latest tags but omits the unified tag; update the image-push recipe
(target: image-push) to also push
$(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):unified (e.g., add a
$(CONTAINER_RUNTIME) push ...:unified line alongside the other tag pushes) so
the :unified artifact produced by image-unified is published.
---
Duplicate comments:
In `@mcv/images/amd64.dockerfile`:
- Around line 98-104: The Dockerfile currently adds Jammy ROCm packages onto
Debian Bookworm/Ubuntu 24.04 (see RUN wget ... amdgpu-install_${AMDGPU_VERSION}
and subsequent apt installs), which is unsupported and brittle; update the
mcv-amd and mcv-unified build stages to use an ROCm-supported base image (e.g.,
an Ubuntu Jammy or official ROCm base image matching
ROCM_VERSION/OPT_ROCM_VERSION) instead of layering Jammy repos onto
Bookworm/Noble, remove the cross-distro apt repo additions and foreign-package
installs (the RUN wget/apt install ./*.deb and ln -s steps), and ensure
ROCM_VERSION/AMDGPU_VERSION/OPT_ROCM_VERSION are aligned with the chosen base so
apt update/apt install resolve natively; rebuild and verify amd-smi/rocm-smi
binaries exist in /opt/rocm-${OPT_ROCM_VERSION}/bin in those stages.
---
Nitpick comments:
In `@mcv/docs/unified-mcv-container.md`:
- Around line 20-28: The document repeatedly recommends using --privileged and
privileged: true without warnings; add a short security notice near the top of
unified-mcv-container.md (after the Basic Usage block) that explains the
security implications of --privileged, calls out that it bypasses
AppArmor/SELinux/seccomp and grants host device access, and states that
--privileged is only required for certain operations (e.g., buildah-based cache
creation) while read-only or runtime GPU usage may be satisfied with targeted
alternatives like --device flags or reduced capabilities; also annotate the
other occurrences of --privileged / privileged: true in the file (the sections
around lines 59-103 and 115-204) to recommend these alternatives and clarify
when --privileged is necessary versus when device-level access suffices.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: c3d769f1-4418-406d-974c-3d013a572bc2
📒 Files selected for processing (5)
.github/workflows/mcv-build-image.ymlmcv/Makefilemcv/README.mdmcv/docs/unified-mcv-container.mdmcv/images/amd64.dockerfile
✅ Files skipped from review due to trivial changes (1)
- mcv/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/mcv-build-image.yml
| ### Runtime Detection Logic | ||
|
|
||
| From `mcv/pkg/accelerator/devices/device.go`: | ||
| ```go | ||
| func registerDevices(r *Registry) { | ||
| amdCheck(r) // Checks for AMD GPUs via sysfs | ||
| rocmCheck(r) // Checks for rocm-smi CLI tool | ||
| nvmlCheck(r) // Checks for libnvidia-ml.so.1 library | ||
| // First successful check wins! | ||
| } | ||
| ``` |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the registerDevices function matches the documented behavior
# Search for the registerDevices function implementation
ast-grep --pattern $'func registerDevices($$$) {
$$$
}'
# Also check for the specific check function calls
rg -nP -A5 'func registerDevices' mcv/pkg/accelerator/devices/Repository: redhat-et/GKM
Length of output: 2520
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find implementations of amdCheck/rocmCheck/nvmlCheck
rg -n "func (amdCheck|rocmCheck|nvmlCheck)\b" -S mcv/pkg/accelerator/devices/device.go mcv/pkg/accelerator/devices || true
# Also print surrounding code for each call site area and definitions if found
for fn in amdCheck rocmCheck nvmlCheck; do
echo "---- $fn ----"
rg -n -A40 -B5 "func ${fn}\\b" mcv/pkg/accelerator/devices || true
done
# Find any logic in Registry about "first successful" / "success wins" / short-circuit
rg -n "type Registry|First successful|short.?circuit|wins|return" -S mcv/pkg/accelerator/devices || true
rg -n "type Registry|func \\(r \\*Registry\\)|Add|Register|Detect|Successful" -S mcv/pkg/accelerator/devices || trueRepository: redhat-et/GKM
Length of output: 23783
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find implementations of amdCheck/rocmCheck/nvmlCheck
rg -n "func (amdCheck|rocmCheck|nvmlCheck)\b" -S mcv/pkg/accelerator/devices || true
# For each function, show its body and any returns
for fn in amdCheck rocmCheck nvmlCheck; do
echo "---- ${fn} ----"
rg -n -A120 -B5 "func ${fn}\\b" mcv/pkg/accelerator/devices || true
done
# Look for Registry methods that might short-circuit
rg -n "type Registry\\b|func \\(r \\*Registry\\)|func \\(.*Registry\\)|Add\\b|Register\\b|Detected|success" -S mcv/pkg/accelerator/devices || true
# Also inspect device.go further near registerDevices and any shared helper that might implement short-circuit
rg -n -A80 -B20 "func registerDevices" mcv/pkg/accelerator/devices/device.go || trueRepository: redhat-et/GKM
Length of output: 29327
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate amdCheck/rocmCheck/nvmlCheck implementations
rg -n "func (amdCheck|rocmCheck|nvmlCheck)\b" -S mcv/pkg/accelerator/devices || true
# Print each function body (enough context to see any early-return/short-circuit behavior)
for fn in amdCheck rocmCheck nvmlCheck; do
echo "---- ${fn} ----"
rg -n -A140 -B5 "func ${fn}\\b" mcv/pkg/accelerator/devices || true
done
# Inspect registerDevices and any nearby helpers that might implement "first successful wins"
rg -n -A100 -B30 "func registerDevices" mcv/pkg/accelerator/devices/device.go || true
# Look for Registry methods/flags that might stop after first success
rg -n "type Registry\\b|func \\(r \\*Registry\\)|short.?circuit|first.?success|wins|Detected|success" -S mcv/pkg/accelerator/devices || trueRepository: redhat-et/GKM
Length of output: 33711
Update the runtime detection snippet to match the real device registration logic
registerDevices actually calls amdCheck, then rocmCheck, then nvmlCheck sequentially (no “first successful check wins” short-circuit). It also has a config.IsStubEnabled() branch that loads staticCheck(r) before the non-stub checks. Additionally, AMD detection is based on amd-smi availability (initAMDLib checks utils.HasApp("amd-smi")), not sysfs; AMD/ROCm exclusivity is handled only inside addDeviceInterface (AMD unregisters ROCM; ROCM is skipped if AMD is already registered).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@mcv/docs/unified-mcv-container.md` around lines 389 - 399, Update the
documented snippet to reflect actual registration order and logic: mention that
registerDevices calls staticCheck(r) first when config.IsStubEnabled() is true,
then calls amdCheck(r), rocmCheck(r), nvmlCheck(r) sequentially (no
short-circuit “first wins”), and note AMD detection relies on amd-smi via
initAMDLib which uses utils.HasApp("amd-smi"); also state that AMD/ROCm
exclusivity is handled in addDeviceInterface (AMD unregisters ROCm and ROCm is
skipped if AMD already registered).
f6632ed to
2b9884b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
mcv/Makefile (1)
198-207:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
image-pushomits the:unifiedtag thatimage-unifiedcreates.Line 207 pushes
:latest, but there is no push for$(IMAGE_NAME):unified. That breaks the documented/public tag contract and can leave:unifiedmissing or stale in registry.Proposed fix
image-push: ## Push all container images to registry @@ $(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):minimal $(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):amd $(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):nvidia + $(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):unified $(CONTAINER_RUNTIME) push $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):latest🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@mcv/Makefile` around lines 198 - 207, The image-push target is missing a push for the unified tag created by image-unified, so add a push for $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):unified to image-push; update the image-push recipe (target image-push) to include a $(CONTAINER_RUNTIME) push invocation for the :unified tag alongside :minimal, :amd, :nvidia, and :latest so the :unified image in the registry stays current with the image-unified build.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@mcv/docs/unified-mcv-container.md`:
- Around line 44-48: Trailing spaces in the markdown lines containing
"nvmlCheck()", "rocmCheck()", and "--no-gpu" are causing the pre-commit
trailing-whitespace hook to fail; open the markdown, remove any trailing spaces
at the ends of those lines (and the additional trailing spaces reported around
lines 99–103), save, and recommit so the pre-commit hook and CI will pass.
---
Outside diff comments:
In `@mcv/Makefile`:
- Around line 198-207: The image-push target is missing a push for the unified
tag created by image-unified, so add a push for
$(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY)/$(IMAGE_NAME):unified to image-push;
update the image-push recipe (target image-push) to include a
$(CONTAINER_RUNTIME) push invocation for the :unified tag alongside :minimal,
:amd, :nvidia, and :latest so the :unified image in the registry stays current
with the image-unified build.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 779fa98c-d988-4924-a938-ef10a49c125e
📒 Files selected for processing (3)
mcv/Makefilemcv/docs/unified-mcv-container.mdmcv/images/amd64.dockerfile
🚧 Files skipped from review as they are similar to previous changes (1)
- mcv/images/amd64.dockerfile
551e46f to
ec526bd
Compare
ec526bd to
47ca401
Compare
Add mcv:unified container variant that includes both NVIDIA (CUDA/NVML) and AMD (ROCm) GPU support. The container automatically detects the GPU vendor at runtime, simplifying deployment across mixed GPU clusters. Key changes: - New mcv-unified target in amd64.dockerfile (CUDA 12.6.3 + ROCm 6.2.4) - Updated Makefile: 'make image-unified' builds unified image - Unified image tagged as both :unified and :latest (~1.2 GB) - Comprehensive documentation with Kubernetes examples - Auto-detection via MCV client library (dlopen for NVML, CLI for ROCm) Runtime behavior: - NVIDIA nodes: Uses NVML (libnvidia-ml.so.1) for GPU detection - AMD nodes: Uses rocm-smi for GPU detection - CPU nodes: Gracefully falls back to no-GPU mode Container variants now available: - mcv:unified (NEW DEFAULT) - NVIDIA + AMD support (~1.2 GB) - mcv:minimal - No GPU libs (~176 MB) - mcv:nvidia - NVIDIA only (~356 MB) - mcv:amd - AMD only (~923 MB) This change is backward-compatible. Existing mcv:nvidia and mcv:amd deployments continue to work unchanged. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
b53a988 to
da8e2d9
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
mcv/docs/unified-mcv-container.md (1)
44-48: ⚡ Quick winAdd language specifier to the fenced code block.
The code block at line 44 is missing a language specifier, causing markdownlint warnings.
📝 Suggested fix
**Runtime Detection:** -``` +```text On NVIDIA node: nvmlCheck() → ✓ uses NVML On AMD node: rocmCheck() → ✓ uses rocm-smi On CPU node: both fail → ✓ uses --no-gpu mode</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@mcv/docs/unified-mcv-container.mdaround lines 44 - 48, The fenced code
block containing the lines "On NVIDIA node: nvmlCheck() → ✓ uses NVML", "On AMD
node: rocmCheck() → ✓ uses rocm-smi", and "On CPU node: both fail → ✓
uses --no-gpu mode" should include a language specifier to satisfy markdownlint;
update that block to use "text" (or another appropriate language) instead of just "" so the snippet is explicitly marked as plain text.</details> <!-- cr-comment:v1:d47fce6780822b7ff400d3ce --> _Source: Linters/SAST tools_ </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.Nitpick comments:
In@mcv/docs/unified-mcv-container.md:
- Around line 44-48: The fenced code block containing the lines "On NVIDIA node:
nvmlCheck() → ✓ uses NVML", "On AMD node: rocmCheck() → ✓ uses rocm-smi",
and "On CPU node: both fail → ✓ uses --no-gpu mode" should include a
language specifier to satisfy markdownlint; update that block to use "text" (or another appropriate language) instead of just "" so the snippet is
explicitly marked as plain text.</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: CHILL **Plan**: Enterprise **Run ID**: `db5a69e6-e85a-48a7-bf7c-eb3e991075c8` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between f6632ed1c95782a5d18d038dfa65a1a12cfe1cc0 and da8e2d9573f55bc8ab91d5d67b5a792a3c90d6f3. </details> <details> <summary>📒 Files selected for processing (4)</summary> * `mcv/Makefile` * `mcv/docs/unified-mcv-container.md` * `mcv/images/amd64.dockerfile` * `mcv/pkg/accelerator/devices/amd.go` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
- Update unified container size to 549 MB (actual measured size) - Fix trailing whitespace in unified-mcv-container.md (pre-commit hook) - Add :unified tag to image-push Makefile target - Update image pull time estimate based on smaller size Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
da8e2d9 to
7ca7223
Compare
feat(mcv): Add
--no-gpusupport for cache creation without GPU hardwareSummary
This PR enables MCV to create and extract cache images without GPU hardware by using cache metadata instead of hardware detection. This is useful for CI/CD pipelines, development environments, and containerized workflows where GPU access isn't available.
Problem Statement
Previously, MCV required GPU hardware (NVIDIA or AMD) during cache image creation because it detected GPU information (architecture, backend, warp size) directly from the system. This made it impossible to:
Solution
Core Changes
--no-gpuFlag Support (main.go)--no-gpuworks with--createconfigureBoolFlags()before operation dispatchGPU Detection Bypass (vllm.go)
config.IsGPUEnabled()indetectActualGPUInfo()--no-gpuis setVLLM_TARGET_DEVICE: Backend (cuda/rocm)VLLM_PAGED_ATTN_ARCH: Architecture (sm_75, gfx1100, etc.)VLLM_MAIN_CUDA_VERSION/ROCM_VERSION: Toolkit versionsMulti-Variant Container Images (amd64.dockerfile)
--no-gpuworkflowsCI/CD Integration (mcv-build-image.yml)
Documentation
Testing
Local Testing
Result: ✅ Successfully created 99.8MB cache image with GPU metadata extracted from cache files:
cudasm_7512.932EC2 Testing
Built and verified all container variants on EC2 (x86_64):
Results:
How It Works
When
--no-gpuis used:cache_key_factors.json)The vLLM/Triton cache files already contain all necessary GPU information stored by vLLM when the cache was generated on a GPU system.
Use Cases
--no-gpu--no-gpu--no-gpuContainer Image Usage
CI/CD Example
Benefits
✅ No GPU required for cache image creation
✅ Smaller images for distribution (~176MB vs ~900MB)
✅ CI/CD friendly - runs in standard GitHub Actions runners
✅ Flexible deployment - Choose GPU-specific or minimal variants
✅ Backwards compatible - Existing workflows continue to work
Breaking Changes
None. All changes are additive:
--no-gpuis a new optional flagLimitations
When using
--no-gpu:Recommendation: Use
--no-gpufor building/distribution, but validate with actual GPU hardware before production deployment.Commits
6ff77bb3feat(mcv): add --no-gpu support for cache creation without GPU hardwarec200115afeat(mcv): add NVIDIA/CUDA target to multi-stage Dockerfile1241116dci(mcv): build all three image variants in GitHub Actions8407badcchore: fix trailing whitespace in no-gpu-usage.mdRelated Issues
Checklist
Summary by CodeRabbit
New Features
--no-gpumode and introduced image variants: minimal (default/:latest), AMD (:amd), NVIDIA (:nvidia), and a unified GPU-support image.Build & Release
Bug Fixes
--no-gputo avoid unintended GPU initialization.Documentation