feat: RDMA multinode GPU support (chain SDK pieces)#315
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughThis PR adds comprehensive RDMA (Remote Direct Memory Access) support throughout the Akash chain SDK. Changes span protobuf schema extensions for RDMA capabilities, safe deep-copy operations in inventory services, SDL YAML parsing with cross-field validation, manifest propagation, and extensive test coverage validating signal preservation and validation invariants. ChangesRDMA Support Implementation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@go/inventory/v1/resources.go`:
- Line 11: NodeResources.Dup() currently calls s.RDMA.Dup() unconditionally
which panics for zero-value RDMA; fix by making ResourcePair.Dup() zero-safe: in
ResourcePair.Dup() detect the zero/uninitialized receiver and return a
zero-value ResourcePair (or a safe initialized copy) instead of dereferencing
nil/internal fields, or alternatively change NodeResources.Dup() to guard the
call (e.g., if s.RDMA.IsZero() { copy.RDMA = ResourcePair{} } else { copy.RDMA =
s.RDMA.Dup() }); update ResourcePair.Dup() and/or add an IsZero helper so RDMA
is safe to duplicate without panics.
In `@go/sdl/v2.go`:
- Around line 561-566: The RDMA validation currently triggers based solely on
gpu.Attributes/rdma_group even for profiles with zero GPUs; update the logic in
the sdl.Profiles.Compute loop (the compute.Resources.GPU handling) to only
consider RDMA attributes or RDMAGroup when the GPU actually has Units > 0, or
alternatively reject rdma/rdma_group when gpu.Units == 0 in
v2ResourceGPU.UnmarshalYAML; specifically change the condition that calls
gpuAttributesHaveRDMA and checks gpu.RDMAGroup to also require gpu.Units > 0 (or
add a guard earlier to return an error if rdma/rdma_group is present while
gpu.Units == 0) so zero-unit GPU profiles do not appear RDMA-enabled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7b5b662f-91e0-4b06-b1ef-7a719b1a350d
⛔ Files ignored due to path filters (3)
go/inventory/v1/node.pb.gois excluded by!**/*.pb.gogo/inventory/v1/resources.pb.gois excluded by!**/*.pb.gogo/manifest/v2beta3/service.pb.gois excluded by!**/*.pb.go
📒 Files selected for processing (14)
go/inventory/v1/node.gogo/inventory/v1/node_test.gogo/inventory/v1/resources.gogo/inventory/v1/resources_test.gogo/node/deployment/v1beta4/rdma_commit_audit_test.gogo/sdl/gpu.gogo/sdl/groupBuilder_v2.gogo/sdl/groupBuilder_v2_1.gogo/sdl/rdma_gpu_test.gogo/sdl/rdma_validation_test.gogo/sdl/v2.goproto/provider/akash/inventory/v1/node.protoproto/provider/akash/inventory/v1/resources.protoproto/provider/akash/manifest/v2beta3/service.proto
6024641 to
5252ecc
Compare
c0301cb to
4c19e6b
Compare
Implements the akash-chain-sdk slice of the RDMA spec at provider/_docs/infiniband-implementation-spec.md, covering Linear tickets AKT-401..AKT-403, AKT-405, AKT-406 (CS-1, CS-2, CS-3, CS-5, CS-6). The shared-storage track (CS-4) is intentionally dropped: per spec decision #3 every service gets its own RWO PVC and the provider does not produce ReadWriteMany volumes. Everything in this commit is off-chain — no on-chain proto messages change, no validator upgrade required. CS-1 — Inventory v1 (proto/provider/akash/inventory/v1/{node,resources}.proto + go/inventory/v1/{node,resources}.pb.go regenerated via `make proto-gen-go`): * Add `ResourcePair rdma = 7` to `NodeResources`. * Add `rdma_resource_name`, `rdma_fabric`, `nccl_hca_prefix` strings to `NodeCapabilities` with gogoproto.customname annotations (`RDMAResourceName`, `RDMAFabric`, `NCCLHCAPrefix`). * Extend `Dup()` helpers in node.go and resources.go. * Tests: `node_test.go`, `resources_test.go` round-trip the new fields through Dup(). CS-2 — SDL parser: `gpu.attributes.rdma: true` (go/sdl/gpu.go): * Accept `rdma: true|false` under `gpu.attributes`. When true, emit a flat on-chain GPU attribute `rdma=true` so providers that advertise `capabilities/gpu/rdma=true` match. * Tests: `rdma_gpu_test.go::TestV2ResourceGPU_RDMAFlag` plus the existing `TestV2ResourceGPU` regression continues to pass. CS-3 — SDL parser: `gpu.attributes.rdma_group: <name>` → off-chain manifest field (go/sdl/gpu.go, proto/provider/akash/manifest/v2beta3/service.proto regenerated to go/manifest/v2beta3/service.pb.go, go/sdl/groupBuilder_v2{,_1}.go): * Parser captures rdma_group via an internal sentinel attribute key (`__rdma_group__`) inside `v2GPUAttributes.UnmarshalYAML`. The parent `v2ResourceGPU.UnmarshalYAML` strips the sentinel before the slice ever reaches on-chain `Resources.GPU.attributes` and surfaces the value on a dedicated `v2ResourceGPU.RDMAGroup` field. `Validate()` is deferred to the parent so the sentinel can be removed before the attribute-key regex runs. * Manifest `Service.proto` gains `rdma_group = 11` with `gogoproto.customname = "RDMAGroup"`. Bindings regenerated via `make proto-gen-go`. * Both v2 / v2.1 group builders now read `compute.Resources.GPU.RDMAGroup` and propagate it onto `manifest.Service.RDMAGroup`. * Tests: `TestV2ResourceGPU_RDMAGroupRoutedOffChain` (asserts the sentinel never escapes) and `TestV2ResourceGPU_RDMAGroupOmitted`. CS-5 — Parser cross-field validations (go/sdl/v2.go, validate() → new validateRDMA()): 1. Any compute profile with `gpu.attributes.rdma: true` requires its placement attributes to include `capabilities/rdma=true`. 2. Any compute profile with `gpu.attributes.rdma_group` set must also declare `gpu.attributes.rdma: true` on the same profile. 3. Within one placement, no implicit-default-plus-explicit mixing: if any profile sets rdma_group, every RDMA-using profile must. Helpers `gpuAttributesHaveRDMA` and `placementRequiresRDMA` are kept package-local so the SDL parser owns the policy. * Tests: `rdma_validation_test.go` (6 positive + negative fixtures). CS-6 — Reservation commit path audit (go/node/deployment/v1beta4/rdma_commit_audit_test.go): Table-driven regression test pinning down that `GroupSpec.Dup()` and the four concrete `ResourceGroup`-shaped values the provider's reservation path can hold (`*Group`, `Group`, `*GroupSpec`, `GroupSpec`) all preserve `Requirements.Attributes` (carrying `capabilities/rdma=true`) AND each resource's `GPU.Attributes` (carrying `rdma=true`). A future change that silently drops either slice — the exact failure mode the spec calls out — fails this test loudly. All `.pb.go` files were regenerated via `make proto-gen-go` (buf v1.47.2, protoc v29.1, gogoproto v1.7.2). Running `make proto-gen-go` against this tree should be a no-op. Tests (go test ./...): - pkg.akt.dev/go/inventory/v1 PASS - pkg.akt.dev/go/manifest/v2beta3 PASS - pkg.akt.dev/go/node/deployment/v1beta4 PASS (+ new CS-6 audit) - pkg.akt.dev/go/sdl PASS (+ new CS-2/CS-3/CS-5) Follow-ups for reviewers: - TypeScript bindings (`ts/`) are not touched here and need a separate `make proto-gen-ts` pass before the TS SDK consumes the new fields. Linear: AKT-401, AKT-402, AKT-403, AKT-405, AKT-406 (AKT-404 / CS-4 cancelled — see spec decision #3) Fix make/setup-cache.mk: defer GOLANGCI_LINT_MAJOR computation to recipe-execute time and depend on $(SEMVER), so the install isn't given an empty major and the broken module path .../golangci-lint/v/cmd/golangci-lint. This bug had `lint/go` red on main for several commits prior.
4c19e6b to
5745d21
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
go/inventory/v1/resourcepair_zero_test.go (1)
38-49: ⚡ Quick winAdd a regression test for partial-nil
ResourcePairduplication.Please add a case where one of
Capacity/Allocatable/Allocatedis nil and verifyDup()keeps that field nil after copy (instead of converting it to zero). This will prevent future contract regressions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@go/inventory/v1/resourcepair_zero_test.go` around lines 38 - 49, Add a regression test that ensures ResourcePair.Dup() preserves nil fields: in resourcepair_zero_test.go (e.g., extend or add a test alongside TestResourcePair_Dup_PopulatedRoundTrips) construct a ResourcePair via NewResourcePair or literal where one of Capacity/Allocatable/Allocated is nil, call rp.Dup(), then assert the corresponding field on the returned copy is still nil (not zero), and also mutate the copy and assert the original remains unchanged to confirm deep-copy semantics for partially-nil ResourcePair fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@go/inventory/v1/resourcepair.go`:
- Around line 70-85: The Dup implementation in resourcepair.go currently always
returns non-nil pointers (&capacity etc.), losing nil presence; change it so
Capacity, Allocatable and Allocated are set to nil when the source
m.Capacity/m.Allocatable/m.Allocated are nil, and only allocate and assign a
DeepCopy pointer when the corresponding source field is non-nil (e.g., check
m.Capacity != nil then copy and set Capacity to that pointer, otherwise set
Capacity to nil); keep Attributes copied via m.Attributes.Dup() as is.
---
Nitpick comments:
In `@go/inventory/v1/resourcepair_zero_test.go`:
- Around line 38-49: Add a regression test that ensures ResourcePair.Dup()
preserves nil fields: in resourcepair_zero_test.go (e.g., extend or add a test
alongside TestResourcePair_Dup_PopulatedRoundTrips) construct a ResourcePair via
NewResourcePair or literal where one of Capacity/Allocatable/Allocated is nil,
call rp.Dup(), then assert the corresponding field on the returned copy is still
nil (not zero), and also mutate the copy and assert the original remains
unchanged to confirm deep-copy semantics for partially-nil ResourcePair fields.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 89166079-2d14-4230-b011-b2b98830a3cb
⛔ Files ignored due to path filters (7)
go/inventory/v1/node.pb.gois excluded by!**/*.pb.gogo/inventory/v1/resources.pb.gois excluded by!**/*.pb.gogo/manifest/v2beta3/service.pb.gois excluded by!**/*.pb.gots/src/generated/protos/akash/inventory/v1/node.tsis excluded by!**/generated/**ts/src/generated/protos/akash/inventory/v1/resources.tsis excluded by!**/generated/**ts/src/generated/protos/akash/manifest/v2beta3/service.tsis excluded by!**/generated/**ts/src/sdl/manifest/__snapshots__/generateManifestVersion.spec.ts.snapis excluded by!**/*.snap
📒 Files selected for processing (29)
go/inventory/v1/node.gogo/inventory/v1/node_test.gogo/inventory/v1/resourcepair.gogo/inventory/v1/resourcepair_zero_test.gogo/inventory/v1/resources.gogo/inventory/v1/resources_test.gogo/node/deployment/v1beta4/rdma_commit_audit_test.gogo/sdl/gpu.gogo/sdl/groupBuilder_v2.gogo/sdl/groupBuilder_v2_1.gogo/sdl/rdma_gpu_test.gogo/sdl/rdma_validation_test.gogo/sdl/v2.goproto/provider/akash/inventory/v1/node.protoproto/provider/akash/inventory/v1/resources.protoproto/provider/akash/manifest/v2beta3/service.prototestdata/sdl/output-fixtures/v2.0/gpu-basic/manifest.jsontestdata/sdl/output-fixtures/v2.0/http-options/manifest.jsontestdata/sdl/output-fixtures/v2.0/ip-endpoint/manifest.jsontestdata/sdl/output-fixtures/v2.0/multiple-services/manifest.jsontestdata/sdl/output-fixtures/v2.0/persistent-storage/manifest.jsontestdata/sdl/output-fixtures/v2.0/placement/manifest.jsontestdata/sdl/output-fixtures/v2.0/port-ranges/manifest.jsontestdata/sdl/output-fixtures/v2.0/pricing/manifest.jsontestdata/sdl/output-fixtures/v2.0/simple/manifest.jsontestdata/sdl/output-fixtures/v2.0/storage-classes/manifest.jsontestdata/sdl/output-fixtures/v2.1/credentials/manifest.jsontestdata/sdl/output-fixtures/v2.1/ip-endpoint/manifest.jsontestdata/sdl/output-fixtures/v2.1/shared-ip/manifest.json
✅ Files skipped from review due to trivial changes (10)
- testdata/sdl/output-fixtures/v2.0/gpu-basic/manifest.json
- go/inventory/v1/resources.go
- testdata/sdl/output-fixtures/v2.0/port-ranges/manifest.json
- testdata/sdl/output-fixtures/v2.0/ip-endpoint/manifest.json
- testdata/sdl/output-fixtures/v2.1/shared-ip/manifest.json
- testdata/sdl/output-fixtures/v2.0/storage-classes/manifest.json
- testdata/sdl/output-fixtures/v2.0/placement/manifest.json
- testdata/sdl/output-fixtures/v2.1/ip-endpoint/manifest.json
- testdata/sdl/output-fixtures/v2.0/simple/manifest.json
- testdata/sdl/output-fixtures/v2.0/pricing/manifest.json
🚧 Files skipped from review as they are similar to previous changes (12)
- go/sdl/groupBuilder_v2_1.go
- go/inventory/v1/node.go
- proto/provider/akash/inventory/v1/node.proto
- proto/provider/akash/manifest/v2beta3/service.proto
- proto/provider/akash/inventory/v1/resources.proto
- go/inventory/v1/resources_test.go
- go/sdl/groupBuilder_v2.go
- go/node/deployment/v1beta4/rdma_commit_audit_test.go
- go/inventory/v1/node_test.go
- go/sdl/v2.go
- go/sdl/gpu.go
- go/sdl/rdma_validation_test.go
Six review items from chalabi2 + CodeRabbit: - proto/service.proto + regenerated .pb.go/.ts: add omitempty to Service.RDMAGroup JSON/YAML tags. The on-chain manifest version hash is a SHA over the JSON-serialized off-chain manifest; without omitempty every pre-RDMA service serialized "rdmaGroup": "" and shifted the hash for existing leases, breaking send-manifest validation. Fixtures regenerated — diff vs main is zero on every non-RDMA testdata manifest, confirming hash stability. - sdl/v2.go + v2_1.go: extract validateRDMA() into a free function taking (profiles, deployments) and call it from both v2 and v2.1 validate(). v2.1 inherits the full RDMA SDL grammar from v2 (parser promotes gpu.attributes.rdma + rdma_group through to the manifest), so the cross-field invariants (rule 1: rdma-required workload reaches rdma-capable provider; rule 2: rdma_group ⇒ rdma=true; rule 3: no implicit/explicit group mixing within one deployment) must apply symmetrically. Without this, invalid v2.1 SDLs bypass the new rules. - sdl/sdl-input.schema.yaml: declare rdma (boolean) and rdma_group (string) under gpu.attributes. The parser accepts them but the schema had additionalProperties: false and only allowed `vendor`, rejecting valid RDMA SDLs at schema-validation time. - inventory/v1/resourcepair.go: ResourcePair.Dup() now preserves the nil/non-nil shape of Capacity/Allocatable/Allocated rather than always returning &zeroQuantity. Returning non-nil pointers for originally-nil fields changes protobuf field-presence semantics and shifts the JSON serialization (which feeds the manifest hash). Regression-pinned by TestResourcePair_Dup_PreservesNilPointers. - sdl/v2.go validateRDMA: defense-in-depth gate on gpu.Units > 0. The parser already rejects rdma/rdma_group on zero-GPU profiles, but the validator should not classify a zero-GPU profile as RDMA-enabled if that parser path is ever bypassed. The TS proto-regen as part of #6 incidentally addresses the TS bindings gap — ts/src/generated/protos/akash/manifest/v2beta3/service.ts now parses and emits rdma_group with the same omitempty semantics.
CI sdl-parity job failed on the prior commit because the Go fixtures (with omitempty) drop rdmaGroup for non-RDMA services, while TS's manifestReplacer kept emitting "rdmaGroup":"". Mismatch on every v2.0 fixture. manifestReplacer already had OMITTED_MANIFEST_KEYS for empty arrays / zero numbers. Adding a parallel OMITTED_WHEN_EMPTY_STRING_KEYS set so fields that use Go's `omitempty` semantics for string values can be declared the same way on the TS side. rdmaGroup is the only entry for now; future "omitempty string" fields just add their key here. Updated the generateManifestVersion snapshot to match the corrected serialization (no more "rdmaGroup":"" leaking into the hash input).
Prep for AKT-443 (provider bid-engine group-aware Adjust). The provider needs to enforce per-rdma_group node separation at fit time — today the workload builder's hostname pod anti-affinity is the only gate, which fires after the bid has been accepted. The chain-SDK previously stripped rdma_group from the GPU attribute slice and surfaced it only on the off-chain Service.RDMAGroup field, so the bid engine had no signal. Change: emit `rdma_group=<value>` as a regular on-chain GPU attribute alongside `rdma=true`, while still lifting the value into v2ResourceGPU.RDMAGroup → Service.RDMAGroup for the workload builder. Both consumers see the same value via their respective paths. - go/sdl/gpu.go: drop the gpuAttributeRDMAGroupSentinel layer. Emit rdma_group directly (the key matches the attribute key regex; no underscore prefix needed). Lift-but-keep instead of lift-and-strip. - go/sdl/rdma_gpu_test.go: update assertions — the on-chain slice now CONTAINS rdma_group when set, AND v2ResourceGPU.RDMAGroup holds the same value. New name reflects the end-to-end flow. - ts/src/sdl/manifest/manifestUtils.ts: transformGpuAttributes now emits both rdma and rdma_group keys, and sorts the result to match Go's sort.Sort(res) byte ordering. - ts/src/sdl/manifest/generateManifest.ts: pass rdmaGroup from compute.resources.gpu.attributes.rdma_group into Service.fromPartial so it surfaces on the off-chain Service.rdmaGroup field. - ts/src/sdl/validateSDL/validateSDLInput.ts: regenerated from the updated input schema (already had rdma + rdma_group properties from the prior commit). - testdata/sdl/input/v2.0/gpu-rdma-group/: new fixture exercising the end-to-end path. Each service uses its own profile to sidestep an orthogonal Go-vs-TS resource-ID-assignment parity gap. Rollout note: providers running pre-skip chain-sdk versions will reject orders carrying the new rdma_group attribute (ParseGPUAttributes returns "invalid GPU attribute"). The provider PR already lands the skip; any provider wanting to bid on RDMA orders must upgrade past it.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
go/sdl/gpu.go (1)
156-160: ⚡ Quick winUpdate stale sentinel comments to match current
rdma_groupflow.Line 156 and Line 209 still describe a sentinel that gets stripped, but the code now keeps
rdma_groupin on-chain GPU attributes. Please align comments with actual behavior to avoid future mis-implementation.Suggested comment-only diff
- // gpu.attributes.rdma_group: string (peer group name). Captured - // here and emitted into the slice as a sentinel attribute that - // v2ResourceGPU.UnmarshalYAML strips before it reaches chain - // state. See gpuAttributeRDMAGroupSentinel. + // gpu.attributes.rdma_group: string (peer group name). Captured + // here and emitted directly as an on-chain GPU attribute + // (GPUAttributeRDMAGroup). v2ResourceGPU.UnmarshalYAML also lifts + // the same value into v2ResourceGPU.RDMAGroup for manifest routing. - // Validate() is deferred to v2ResourceGPU.UnmarshalYAML so the - // rdma_group sentinel can be stripped from the slice before the - // attribute-key regex runs against it. + // Validate() is deferred to v2ResourceGPU.UnmarshalYAML so the + // final assembled attribute slice is validated once in one place.Also applies to: 209-211
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@go/sdl/gpu.go` around lines 156 - 160, The comments near the GPU YAML decode (around the node.Content[i+1].Decode(&rdmaGroup) block) and the comment referencing gpuAttributeRDMAGroupSentinel / v2ResourceGPU.UnmarshalYAML are stale: they state rdma_group is emitted as a sentinel that v2ResourceGPU.UnmarshalYAML strips, but the code now preserves rdma_group in on-chain GPU attributes. Update those comments to reflect the current flow (rdma_group is decoded and retained in the GPU attributes on-chain) and remove or reword any mention of a sentinel being stripped; keep references to gpuAttributeRDMAGroupSentinel and v2ResourceGPU.UnmarshalYAML only if describing their actual behavior in the current implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ts/src/sdl/manifest/manifestUtils.ts`:
- Around line 78-87: The code currently pushes attributes.rdma_group into the
on-chain GPU attributes (via the result.push({ key: "rdma_group", ... }) call),
but rdma_group must remain off-chain; remove the block that emits rdma_group
into Resources.GPU.Attributes in manifestUtils (i.e., delete or disable the if
(attributes.rdma_group && attributes.rdma_group.length > 0) result.push(...)
branch) and keep only the rdma boolean emission (the if (attributes.rdma ===
true) result.push({ key: "rdma", value: "true" }) branch); ensure any callers or
type definitions still treat attributes.rdma_group as an off-chain-only field
and do not rely on it being serialized into the result array.
---
Nitpick comments:
In `@go/sdl/gpu.go`:
- Around line 156-160: The comments near the GPU YAML decode (around the
node.Content[i+1].Decode(&rdmaGroup) block) and the comment referencing
gpuAttributeRDMAGroupSentinel / v2ResourceGPU.UnmarshalYAML are stale: they
state rdma_group is emitted as a sentinel that v2ResourceGPU.UnmarshalYAML
strips, but the code now preserves rdma_group in on-chain GPU attributes. Update
those comments to reflect the current flow (rdma_group is decoded and retained
in the GPU attributes on-chain) and remove or reword any mention of a sentinel
being stripped; keep references to gpuAttributeRDMAGroupSentinel and
v2ResourceGPU.UnmarshalYAML only if describing their actual behavior in the
current implementation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a3e93caa-5b29-4589-9819-2890ceac7099
⛔ Files ignored due to path filters (2)
go/manifest/v2beta3/service.pb.gois excluded by!**/*.pb.gots/src/generated/protos/akash/manifest/v2beta3/service.tsis excluded by!**/generated/**
📒 Files selected for processing (15)
go/inventory/v1/resourcepair.gogo/inventory/v1/resourcepair_zero_test.gogo/sdl/gpu.gogo/sdl/rdma_gpu_test.gogo/sdl/sdl-input.schema.yamlgo/sdl/v2.gogo/sdl/v2_1.goproto/provider/akash/manifest/v2beta3/service.prototestdata/sdl/input/v2.0/gpu-rdma-group/input.yamltestdata/sdl/output-fixtures/v2.0/gpu-rdma-group/group-specs.jsontestdata/sdl/output-fixtures/v2.0/gpu-rdma-group/manifest.jsonts/src/sdl/manifest/generateManifest.tsts/src/sdl/manifest/generateManifestVersion.tsts/src/sdl/manifest/manifestUtils.tsts/src/sdl/validateSDL/validateSDLInput.ts
✅ Files skipped from review due to trivial changes (1)
- testdata/sdl/output-fixtures/v2.0/gpu-rdma-group/group-specs.json
🚧 Files skipped from review as they are similar to previous changes (2)
- go/inventory/v1/resourcepair.go
- go/sdl/v2.go
TS validateSDL had no RDMA semantic checks; tenants using the TS SDK
could broadcast SDLs that the Go parser would reject outright (chain
doesn't validate SDL semantics, so the failure would surface later as
broken pods).
Adds a #validateRDMA method on SDLValidator, called from validate()
after the per-service loop. Mirrors the Go-side validateRDMA in
go/sdl/v2.go and enforces the same three cross-field rules:
1. A profile with gpu.attributes.rdma=true must be deployed under a
placement whose attributes require capabilities/rdma=true.
2. A profile with gpu.attributes.rdma_group set must also have
gpu.attributes.rdma=true on the same profile.
3. Within one deployment (placement), if any profile sets
rdma_group, every rdma=true profile under that placement must
also set rdma_group — no implicit-default-plus-explicit mixing.
The units==0 + rdma/rdma_group case is already rejected by the
schema-level gpuAttributesRequireUnitsGt0 rule (any attribute requires
units > 0), so no semantic check needed there. Pinned by two new tests
that assert the schema path, so a future schema relaxation can't
silently reopen the hole.
Five new tests cover the rejection paths plus a happy path. Existing
SDL parity tests stay at 34/34.
Implements the akash-chain-sdk slice of the RDMA spec at provider/_docs/infiniband-implementation-spec.md, covering Linear tickets AKT-401..AKT-403, AKT-405, AKT-406 (CS-1, CS-2, CS-3, CS-5, CS-6). The shared-storage track (CS-4) is intentionally dropped: per spec decision #3 every service gets its own RWO PVC and the provider does not produce ReadWriteMany volumes.
Everything in this commit is off-chain — no on-chain proto messages change, no validator upgrade required.
CS-1 — Inventory v1
(proto/provider/akash/inventory/v1/{node,resources}.proto +
go/inventory/v1/{node,resources}.pb.go regenerated via
make proto-gen-go):ResourcePair rdma = 7toNodeResources.rdma_resource_name,rdma_fabric,nccl_hca_prefixstrings toNodeCapabilitieswith gogoproto.customname annotations (RDMAResourceName,RDMAFabric,NCCLHCAPrefix).Dup()helpers in node.go and resources.go.node_test.go,resources_test.goround-trip the new fields through Dup().CS-2 — SDL parser:
gpu.attributes.rdma: true(go/sdl/gpu.go):rdma: true|falseundergpu.attributes. When true, emit a flat on-chain GPU attributerdma=trueso providers that advertisecapabilities/gpu/rdma=truematch.rdma_gpu_test.go::TestV2ResourceGPU_RDMAFlagplus the existingTestV2ResourceGPUregression continues to pass.CS-3 — SDL parser:
gpu.attributes.rdma_group: <name>→off-chain manifest field
(go/sdl/gpu.go,
proto/provider/akash/manifest/v2beta3/service.proto regenerated to
go/manifest/v2beta3/service.pb.go,
go/sdl/groupBuilder_v2{,_1}.go):
__rdma_group__) insidev2GPUAttributes.UnmarshalYAML. The parentv2ResourceGPU.UnmarshalYAMLstrips the sentinel before the slice ever reaches on-chainResources.GPU.attributesand surfaces the value on a dedicatedv2ResourceGPU.RDMAGroupfield.Validate()is deferred to the parent so the sentinel can be removed before the attribute-key regex runs.Service.protogainsrdma_group = 11withgogoproto.customname = "RDMAGroup". Bindings regenerated viamake proto-gen-go.compute.Resources.GPU.RDMAGroupand propagate it ontomanifest.Service.RDMAGroup.TestV2ResourceGPU_RDMAGroupRoutedOffChain(asserts the sentinel never escapes) andTestV2ResourceGPU_RDMAGroupOmitted.CS-5 — Parser cross-field validations
(go/sdl/v2.go, validate() → new validateRDMA()):
gpu.attributes.rdma: truerequires its placement attributes to includecapabilities/rdma=true.gpu.attributes.rdma_groupset must also declaregpu.attributes.rdma: trueon the same profile.gpuAttributesHaveRDMAandplacementRequiresRDMAare kept package-local so the SDL parser owns the policy.rdma_validation_test.go(6 positive + negative fixtures).CS-6 — Reservation commit path audit
(go/node/deployment/v1beta4/rdma_commit_audit_test.go):
Table-driven regression test pinning down that
GroupSpec.Dup()and thefour concrete
ResourceGroup-shaped values the provider's reservationpath can hold (
*Group,Group,*GroupSpec,GroupSpec) all preserveRequirements.Attributes(carryingcapabilities/rdma=true) AND eachresource's
GPU.Attributes(carryingrdma=true). A future change thatsilently drops either slice — the exact failure mode the spec calls
out — fails this test loudly.
All
.pb.gofiles were regenerated viamake proto-gen-go(buf v1.47.2, protoc v29.1, gogoproto v1.7.2). Runningmake proto-gen-goagainst this tree should be a no-op.Tests (go test ./...):
Follow-ups for reviewers:
ts/) are not touched here and need a separatemake proto-gen-tspass before the TS SDK consumes the new fields.Linear: AKT-401, AKT-402, AKT-403, AKT-405, AKT-406
(AKT-404 / CS-4 cancelled — see spec decision #3)
📝 Description
[Explain what this PR does in 2-3 sentences. Include context about the feature or problem being solved]
🔧 Purpose of the Change
📌 Related Issues
✅ Checklist
📎 Notes for Reviewers
[Include any additional context, architectural decisions, or specific areas to focus on]