diff --git a/README.md b/README.md index ceb106584..616b765be 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ ## 📢 News - **[2026/06] Featured ROCm Blog:** [DP Attention and TBO for DeepSeek-V4 on MI355X](https://rocm.blogs.amd.com/software-tools-optimization/atom-optimiztion/README.html) highlights how ATOM optimizes DeepSeek-V4 inference on AMD Instinct MI355X GPUs with DP Attention using all-gather/reduce-scatter and Two-Batch Overlap, achieving strongly competitive DeepSeek-V4 inference performance. -- **[2026/06]** Experimental **Navi 4 (RDNA4 / gfx1201)** support — AMD Radeon RX 9070 / RX 9070 XT and Radeon AI PRO R9700. See the [Qwen3-8B-FP8](recipes/Qwen3-8B-FP8.md) and [Ministral-3-8B](recipes/Ministral-3-8B.md) recipes. +- **[2026/06]** Experimental **Navi 4 (RDNA4 / gfx1200, gfx1201)** support — AMD Radeon RX 9060 / RX 9060 XT (Navi 44 / gfx1200) and RX 9070 / RX 9070 XT and Radeon AI PRO R9700 (Navi 48 / gfx1201). Both chips share the same Triton fallback path; build aiter for the matching arch (`GPU_ARCHS=gfx1200` or `gfx1201`). See the [Qwen3-8B-FP8](recipes/Qwen3-8B-FP8.md) and [Ministral-3-8B](recipes/Ministral-3-8B.md) recipes. - **[2026/06]** ATOM now supports **GLM-5.2** (`glm_moe_dsa`) in FP8, including the new **IndexShare** DSA schedule (shared layers reuse the preceding full layer's indexer). See [GLM-5.2 recipe](recipes/GLM-5.md#glm-52-indexshare). - **[2026/05]** ATOM now supports **Qwen3.5 multimodal image+text inference** on the native engine and OpenAI-compatible chat API. See [Qwen3.5 multimodal recipe](recipes/Qwen3.5_multimodel.md). - **[2026/05]** ATOM now supports **online quantization** — re-quantize unquantized or FP8-block source checkpoints to PTPC-FP8 / MXFP4 mixed precision at load time via `--online_quant_config`, no offline re-packing required. See [online quantization guide](docs/online_quantization_guide.md). diff --git a/recipes/Ministral-3-8B.md b/recipes/Ministral-3-8B.md index 448cbe213..2c6caf3e9 100644 --- a/recipes/Ministral-3-8B.md +++ b/recipes/Ministral-3-8B.md @@ -5,10 +5,13 @@ RDNA4 GPU. ATOM runs attention and GEMM through Triton (`ATOM_USE_UNIFIED_ATTN=1`, `ATOM_USE_TRITON_GEMM=1`); the KV-cache write, RoPE and norms use native aiter HIP kernels. -> **Navi (gfx1201) prerequisite:** aiter must be built for the arch — see +> **Navi (gfx1200 / gfx1201) prerequisite:** aiter must be built for the arch — see > [ROCm/aiter#3846](https://github.com/ROCm/aiter/issues/3846). Short-term -> fix: build aiter from source with `GPU_ARCHS=gfx1201` (a native build on -> the card does this automatically). +> fix: build aiter from source with `GPU_ARCHS=gfx1201` (Navi 48: RX 9070 / +> RX 9070 XT / AI PRO R9700) or `GPU_ARCHS=gfx1200` (Navi 44: RX 9060 / +> RX 9060 XT). A native build on the card does this automatically. Both are +> RDNA4 and use the same Triton path below; the benchmarks here were +> measured on gfx1201. ## Model diff --git a/recipes/Qwen3-8B-FP8.md b/recipes/Qwen3-8B-FP8.md index f6981325b..f961f7f4a 100644 --- a/recipes/Qwen3-8B-FP8.md +++ b/recipes/Qwen3-8B-FP8.md @@ -1,9 +1,12 @@ # Qwen3-8B-FP8 (block-128) on RX 9070 XT (gfx1201) via ROCm/ATOM Verified path on RX 9070 XT (gfx1201). Attention and GEMM run through -Triton; same backend setup and the **build-aiter-for-gfx1201** prerequisite +Triton; same backend setup and the **build-aiter-for-the-arch** prerequisite ([ROCm/aiter#3846](https://github.com/ROCm/aiter/issues/3846)) as the -[Ministral-3-8B recipe](./Ministral-3-8B.md). +[Ministral-3-8B recipe](./Ministral-3-8B.md) — build aiter with +`GPU_ARCHS=gfx1201` (Navi 48) or `GPU_ARCHS=gfx1200` (Navi 44: RX 9060 / +RX 9060 XT). Both RDNA4 chips share this Triton path; the numbers below are +from gfx1201. ## Model @@ -34,7 +37,7 @@ export ATOM_ENABLE_ALLREDUCE_RMSNORM_FUSION=0 `ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_RMSNORM_QUANT=1` and `ATOM_LLAMA_ENABLE_AITER_TRITON_FUSED_SILU_MUL_QUANT=1` to fuse normalization/activation with FP8 quantization. Requires HIP -`rmsnorm_quant` to JIT-compile on gfx1201 — test before enabling. +`rmsnorm_quant` to JIT-compile on gfx1200 / gfx1201 — test before enabling. ## Required CLI flags