Skip to content

ci: add HIP debug probes for DO runner#1399

Open
gyohuangxin wants to merge 1 commit into
mainfrom
ci/atom-do-torch-dist-smoke
Open

ci: add HIP debug probes for DO runner#1399
gyohuangxin wants to merge 1 commit into
mainfrom
ci/atom-do-torch-dist-smoke

Conversation

@gyohuangxin

Copy link
Copy Markdown
Member

No description provided.

Copilot AI review requested due to automatic review settings June 29, 2026 09:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR augments the atom-test GitHub Actions workflow with additional ROCm/HIP diagnostics specifically for the DigitalOcean MI350x-8 runner, aiming to surface early device/driver/runtime issues and retain debug artifacts for postmortem analysis when failures occur.

Changes:

  • Adds a single-process HIP/PyTorch initialization probe inside the CI container (environment + device mapping + basic allocations).
  • Adds a TP=4 distributed smoke test using torch.distributed all-reduce to validate multi-GPU comms early.
  • Collects and uploads HIP/ROCm debug artifacts (logs, env snapshot, device nodes, rocm-smi outputs) on all outcomes for the targeted runner.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zufayu zufayu requested a review from valarLip June 30, 2026 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants