Skip to content

feat: RDMA multinode GPU support (chain SDK pieces)#315

Open
Zblocker64 wants to merge 6 commits into
mainfrom
feat/rdma-multinode
Open

feat: RDMA multinode GPU support (chain SDK pieces)#315
Zblocker64 wants to merge 6 commits into
mainfrom
feat/rdma-multinode

Conversation

@Zblocker64

Copy link
Copy Markdown
Contributor

Implements the akash-chain-sdk slice of the RDMA spec at provider/_docs/infiniband-implementation-spec.md, covering Linear tickets AKT-401..AKT-403, AKT-405, AKT-406 (CS-1, CS-2, CS-3, CS-5, CS-6). The shared-storage track (CS-4) is intentionally dropped: per spec decision #3 every service gets its own RWO PVC and the provider does not produce ReadWriteMany volumes.

Everything in this commit is off-chain — no on-chain proto messages change, no validator upgrade required.

CS-1 — Inventory v1
(proto/provider/akash/inventory/v1/{node,resources}.proto +
go/inventory/v1/{node,resources}.pb.go regenerated via make proto-gen-go):

  • Add ResourcePair rdma = 7 to NodeResources.
  • Add rdma_resource_name, rdma_fabric, nccl_hca_prefix strings to NodeCapabilities with gogoproto.customname annotations (RDMAResourceName, RDMAFabric, NCCLHCAPrefix).
  • Extend Dup() helpers in node.go and resources.go.
  • Tests: node_test.go, resources_test.go round-trip the new fields through Dup().

CS-2 — SDL parser: gpu.attributes.rdma: true (go/sdl/gpu.go):

  • Accept rdma: true|false under gpu.attributes. When true, emit a flat on-chain GPU attribute rdma=true so providers that advertise capabilities/gpu/rdma=true match.
  • Tests: rdma_gpu_test.go::TestV2ResourceGPU_RDMAFlag plus the existing TestV2ResourceGPU regression continues to pass.

CS-3 — SDL parser: gpu.attributes.rdma_group: <name>
off-chain manifest field
(go/sdl/gpu.go,
proto/provider/akash/manifest/v2beta3/service.proto regenerated to
go/manifest/v2beta3/service.pb.go,
go/sdl/groupBuilder_v2{,_1}.go):

  • Parser captures rdma_group via an internal sentinel attribute key (__rdma_group__) inside v2GPUAttributes.UnmarshalYAML. The parent v2ResourceGPU.UnmarshalYAML strips the sentinel before the slice ever reaches on-chain Resources.GPU.attributes and surfaces the value on a dedicated v2ResourceGPU.RDMAGroup field. Validate() is deferred to the parent so the sentinel can be removed before the attribute-key regex runs.
  • Manifest Service.proto gains rdma_group = 11 with gogoproto.customname = "RDMAGroup". Bindings regenerated via make proto-gen-go.
  • Both v2 / v2.1 group builders now read compute.Resources.GPU.RDMAGroup and propagate it onto manifest.Service.RDMAGroup.
  • Tests: TestV2ResourceGPU_RDMAGroupRoutedOffChain (asserts the sentinel never escapes) and TestV2ResourceGPU_RDMAGroupOmitted.

CS-5 — Parser cross-field validations
(go/sdl/v2.go, validate() → new validateRDMA()):

  1. Any compute profile with gpu.attributes.rdma: true requires its placement attributes to include capabilities/rdma=true.
  2. Any compute profile with gpu.attributes.rdma_group set must also declare gpu.attributes.rdma: true on the same profile.
  3. Within one placement, no implicit-default-plus-explicit mixing: if any profile sets rdma_group, every RDMA-using profile must. Helpers gpuAttributesHaveRDMA and placementRequiresRDMA are kept package-local so the SDL parser owns the policy.
  • Tests: rdma_validation_test.go (6 positive + negative fixtures).

CS-6 — Reservation commit path audit
(go/node/deployment/v1beta4/rdma_commit_audit_test.go):
Table-driven regression test pinning down that GroupSpec.Dup() and the
four concrete ResourceGroup-shaped values the provider's reservation
path can hold (*Group, Group, *GroupSpec, GroupSpec) all preserve
Requirements.Attributes (carrying capabilities/rdma=true) AND each
resource's GPU.Attributes (carrying rdma=true). A future change that
silently drops either slice — the exact failure mode the spec calls
out — fails this test loudly.

All .pb.go files were regenerated via make proto-gen-go (buf v1.47.2, protoc v29.1, gogoproto v1.7.2). Running make proto-gen-go against this tree should be a no-op.

Tests (go test ./...):

  • pkg.akt.dev/go/inventory/v1 PASS
  • pkg.akt.dev/go/manifest/v2beta3 PASS
  • pkg.akt.dev/go/node/deployment/v1beta4 PASS (+ new CS-6 audit)
  • pkg.akt.dev/go/sdl PASS (+ new CS-2/CS-3/CS-5)

Follow-ups for reviewers:

  • TypeScript bindings (ts/) are not touched here and need a separate make proto-gen-ts pass before the TS SDK consumes the new fields.

Linear: AKT-401, AKT-402, AKT-403, AKT-405, AKT-406
(AKT-404 / CS-4 cancelled — see spec decision #3)

📝 Description

[Explain what this PR does in 2-3 sentences. Include context about the feature or problem being solved]

🔧 Purpose of the Change

  • New feature implementation
  • Bug fix
  • Documentation update
  • Code refactoring
  • Dependency upgrade
  • Other: [specify]

📌 Related Issues

  • Closes #ISSUE_NUMBER
  • References #ISSUE_NUMBER

✅ Checklist

  • I've updated relevant documentation
  • Code follows Akash Network's style guide
  • I've added/updated relevant unit tests
  • Dependencies have been properly updated
  • I agree and adhered to the Contribution Guidelines

📎 Notes for Reviewers

[Include any additional context, architectural decisions, or specific areas to focus on]

@Zblocker64 Zblocker64 requested a review from a team as a code owner June 2, 2026 19:16
@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a2d76be8-8850-447e-8dd9-8c24fafbcfd9

📥 Commits

Reviewing files that changed from the base of the PR and between c1bbce4 and 5bf26e0.

📒 Files selected for processing (2)
  • ts/src/sdl/validateSDL/validateSDL.spec.ts
  • ts/src/sdl/validateSDL/validateSDL.ts

Walkthrough

This PR adds comprehensive RDMA (Remote Direct Memory Access) support throughout the Akash chain SDK. Changes span protobuf schema extensions for RDMA capabilities, safe deep-copy operations in inventory services, SDL YAML parsing with cross-field validation, manifest propagation, and extensive test coverage validating signal preservation and validation invariants.

Changes

RDMA Support Implementation

Layer / File(s) Summary
Protobuf Schema Extensions
proto/provider/akash/inventory/v1/node.proto, proto/provider/akash/inventory/v1/resources.proto, proto/provider/akash/manifest/v2beta3/service.proto
NodeCapabilities adds three RDMA capability fields (rdma_resource_name, rdma_fabric, nccl_hca_prefix). NodeResources adds RDMA ResourcePair. Service adds rdma_group field with omitempty serialization for off-chain peer grouping.
Inventory Duplication & Deep-Copy Safety
go/inventory/v1/node.go, go/inventory/v1/node_test.go, go/inventory/v1/resources.go, go/inventory/v1/resources_test.go, go/inventory/v1/resourcepair.go, go/inventory/v1/resourcepair_zero_test.go
ResourcePair.IsZero() detects uninitialized state. Dup() safely handles nil/zero receivers and deep-copies quantities/attributes. NodeCapabilities.Dup() and NodeResources.Dup() now copy RDMA fields. Tests verify nil-safety, zero-value handling, deep-copy semantics, and pointer preservation.
Provider Deployment RDMA Audit
go/node/deployment/v1beta4/rdma_commit_audit_test.go
Tests validate RDMA opt-in attributes survive GroupSpec.Dup(), remain accessible through four ResourceGroup concrete shapes via provider accessor patterns, and GetResourceUnits() preserves GPU RDMA attributes.
SDL GPU RDMA Parsing & Routing
go/sdl/gpu.go, go/sdl/rdma_gpu_test.go, go/sdl/sdl-input.schema.yaml
v2ResourceGPU gains RDMAGroup. v2GPUAttributes.UnmarshalYAML recognizes rdma (bool) and rdma_group (string) and emits on-chain attributes. v2ResourceGPU.UnmarshalYAML extracts rdma_group into RDMAGroup, defers validation to parent, and rejects units: 0 when RDMA config is present. Tests validate flag emission, off-chain/on-chain routing, zero-units rejection, and attribute key strictness.
Manifest Building: RDMAGroup Propagation
go/sdl/groupBuilder_v2.go, go/sdl/groupBuilder_v2_1.go, ts/src/sdl/manifest/generateManifest.ts
Go builders conditionally copy GPU.RDMAGroup into generated manifest.Service.RDMAGroup. TypeScript buildManifestService propagates rdma_group from compute GPU attributes into the protobuf Service.rdmaGroup field.
SDL RDMA Cross-Field Validation
go/sdl/v2.go, go/sdl/v2_1.go, go/sdl/rdma_validation_test.go
validateRDMA() enforces three invariants: RDMA profiles require placement capabilities/rdma=true; rdma_group requires rdma:true on the same profile; placements must not mix explicit and implicit rdma_group. Tests cover negative and positive cases for all rules.
Manifest Serialization
ts/src/sdl/manifest/generateManifestVersion.ts, ts/src/sdl/manifest/manifestUtils.ts
rdmaGroup omitted from JSON when empty string. transformGpuAttributes refactored to imperative builder consolidating vendor and RDMA attributes and sorting by key for deterministic output.
Test Fixtures
testdata/sdl/input/v2.0/gpu-rdma-group/input.yaml, testdata/sdl/output-fixtures/v2.0/gpu-rdma-group/group-specs.json, testdata/sdl/output-fixtures/v2.0/gpu-rdma-group/manifest.json
SDL v2.0 fixture defines multi-service GPU/RDMA peer group with two services sharing rdma_group: pair0. Output fixtures include group-specs and manifest JSON with RDMA configuration, GPU attributes, and network mappings.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

  • troian

Poem

🐰 A rabbit hops through RDMA fields,
Deep-copying safely, never yields,
Three validation rules so clean,
Peer groups cluster—a distributed dream,
Fast paths align with manifest gleam! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: RDMA multinode GPU support (chain SDK pieces)' accurately describes the main objective of implementing RDMA support in the SDK.
Description check ✅ Passed The description is comprehensive, covering objectives, implementation details across CS-1 through CS-6, technical changes, testing, and follow-ups. However, the template sections are not all completed—checkboxes remain unchecked and some sections (Related Issues, etc.) use placeholder text.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/rdma-multinode

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/inventory/v1/resources.go`:
- Line 11: NodeResources.Dup() currently calls s.RDMA.Dup() unconditionally
which panics for zero-value RDMA; fix by making ResourcePair.Dup() zero-safe: in
ResourcePair.Dup() detect the zero/uninitialized receiver and return a
zero-value ResourcePair (or a safe initialized copy) instead of dereferencing
nil/internal fields, or alternatively change NodeResources.Dup() to guard the
call (e.g., if s.RDMA.IsZero() { copy.RDMA = ResourcePair{} } else { copy.RDMA =
s.RDMA.Dup() }); update ResourcePair.Dup() and/or add an IsZero helper so RDMA
is safe to duplicate without panics.

In `@go/sdl/v2.go`:
- Around line 561-566: The RDMA validation currently triggers based solely on
gpu.Attributes/rdma_group even for profiles with zero GPUs; update the logic in
the sdl.Profiles.Compute loop (the compute.Resources.GPU handling) to only
consider RDMA attributes or RDMAGroup when the GPU actually has Units > 0, or
alternatively reject rdma/rdma_group when gpu.Units == 0 in
v2ResourceGPU.UnmarshalYAML; specifically change the condition that calls
gpuAttributesHaveRDMA and checks gpu.RDMAGroup to also require gpu.Units > 0 (or
add a guard earlier to return an error if rdma/rdma_group is present while
gpu.Units == 0) so zero-unit GPU profiles do not appear RDMA-enabled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7b5b662f-91e0-4b06-b1ef-7a719b1a350d

📥 Commits

Reviewing files that changed from the base of the PR and between bf983bc and 5ce7643.

⛔ Files ignored due to path filters (3)
  • go/inventory/v1/node.pb.go is excluded by !**/*.pb.go
  • go/inventory/v1/resources.pb.go is excluded by !**/*.pb.go
  • go/manifest/v2beta3/service.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (14)
  • go/inventory/v1/node.go
  • go/inventory/v1/node_test.go
  • go/inventory/v1/resources.go
  • go/inventory/v1/resources_test.go
  • go/node/deployment/v1beta4/rdma_commit_audit_test.go
  • go/sdl/gpu.go
  • go/sdl/groupBuilder_v2.go
  • go/sdl/groupBuilder_v2_1.go
  • go/sdl/rdma_gpu_test.go
  • go/sdl/rdma_validation_test.go
  • go/sdl/v2.go
  • proto/provider/akash/inventory/v1/node.proto
  • proto/provider/akash/inventory/v1/resources.proto
  • proto/provider/akash/manifest/v2beta3/service.proto

Comment thread go/inventory/v1/resources.go
Comment thread go/sdl/v2.go Outdated
@Zblocker64 Zblocker64 force-pushed the feat/rdma-multinode branch 2 times, most recently from 6024641 to 5252ecc Compare June 2, 2026 19:31
@Zblocker64 Zblocker64 force-pushed the feat/rdma-multinode branch 2 times, most recently from c0301cb to 4c19e6b Compare June 2, 2026 20:03
@github-actions github-actions Bot added the C:ts label Jun 2, 2026
Implements the akash-chain-sdk slice of the RDMA spec at
provider/_docs/infiniband-implementation-spec.md, covering Linear tickets
AKT-401..AKT-403, AKT-405, AKT-406 (CS-1, CS-2, CS-3, CS-5, CS-6). The
shared-storage track (CS-4) is intentionally dropped: per spec decision #3
every service gets its own RWO PVC and the provider does not produce
ReadWriteMany volumes.

Everything in this commit is off-chain — no on-chain proto messages change,
no validator upgrade required.

CS-1 — Inventory v1
       (proto/provider/akash/inventory/v1/{node,resources}.proto +
        go/inventory/v1/{node,resources}.pb.go regenerated via `make proto-gen-go`):
  * Add `ResourcePair rdma = 7` to `NodeResources`.
  * Add `rdma_resource_name`, `rdma_fabric`, `nccl_hca_prefix` strings to
    `NodeCapabilities` with gogoproto.customname annotations
    (`RDMAResourceName`, `RDMAFabric`, `NCCLHCAPrefix`).
  * Extend `Dup()` helpers in node.go and resources.go.
  * Tests: `node_test.go`, `resources_test.go` round-trip the new fields
    through Dup().

CS-2 — SDL parser: `gpu.attributes.rdma: true` (go/sdl/gpu.go):
  * Accept `rdma: true|false` under `gpu.attributes`. When true, emit a
    flat on-chain GPU attribute `rdma=true` so providers that advertise
    `capabilities/gpu/rdma=true` match.
  * Tests: `rdma_gpu_test.go::TestV2ResourceGPU_RDMAFlag` plus the existing
    `TestV2ResourceGPU` regression continues to pass.

CS-3 — SDL parser: `gpu.attributes.rdma_group: <name>` →
       off-chain manifest field
       (go/sdl/gpu.go,
        proto/provider/akash/manifest/v2beta3/service.proto regenerated to
        go/manifest/v2beta3/service.pb.go,
        go/sdl/groupBuilder_v2{,_1}.go):
  * Parser captures rdma_group via an internal sentinel attribute key
    (`__rdma_group__`) inside `v2GPUAttributes.UnmarshalYAML`. The parent
    `v2ResourceGPU.UnmarshalYAML` strips the sentinel before the slice ever
    reaches on-chain `Resources.GPU.attributes` and surfaces the value on
    a dedicated `v2ResourceGPU.RDMAGroup` field. `Validate()` is deferred
    to the parent so the sentinel can be removed before the
    attribute-key regex runs.
  * Manifest `Service.proto` gains `rdma_group = 11` with
    `gogoproto.customname = "RDMAGroup"`. Bindings regenerated via
    `make proto-gen-go`.
  * Both v2 / v2.1 group builders now read `compute.Resources.GPU.RDMAGroup`
    and propagate it onto `manifest.Service.RDMAGroup`.
  * Tests: `TestV2ResourceGPU_RDMAGroupRoutedOffChain` (asserts the
    sentinel never escapes) and `TestV2ResourceGPU_RDMAGroupOmitted`.

CS-5 — Parser cross-field validations
       (go/sdl/v2.go, validate() → new validateRDMA()):
  1. Any compute profile with `gpu.attributes.rdma: true` requires its
     placement attributes to include `capabilities/rdma=true`.
  2. Any compute profile with `gpu.attributes.rdma_group` set must also
     declare `gpu.attributes.rdma: true` on the same profile.
  3. Within one placement, no implicit-default-plus-explicit mixing: if
     any profile sets rdma_group, every RDMA-using profile must.
  Helpers `gpuAttributesHaveRDMA` and `placementRequiresRDMA` are kept
  package-local so the SDL parser owns the policy.
  * Tests: `rdma_validation_test.go` (6 positive + negative fixtures).

CS-6 — Reservation commit path audit
       (go/node/deployment/v1beta4/rdma_commit_audit_test.go):
  Table-driven regression test pinning down that `GroupSpec.Dup()` and the
  four concrete `ResourceGroup`-shaped values the provider's reservation
  path can hold (`*Group`, `Group`, `*GroupSpec`, `GroupSpec`) all preserve
  `Requirements.Attributes` (carrying `capabilities/rdma=true`) AND each
  resource's `GPU.Attributes` (carrying `rdma=true`). A future change that
  silently drops either slice — the exact failure mode the spec calls
  out — fails this test loudly.

All `.pb.go` files were regenerated via `make proto-gen-go` (buf v1.47.2,
protoc v29.1, gogoproto v1.7.2). Running `make proto-gen-go` against this
tree should be a no-op.

Tests (go test ./...):
  - pkg.akt.dev/go/inventory/v1                  PASS
  - pkg.akt.dev/go/manifest/v2beta3              PASS
  - pkg.akt.dev/go/node/deployment/v1beta4       PASS (+ new CS-6 audit)
  - pkg.akt.dev/go/sdl                           PASS (+ new CS-2/CS-3/CS-5)

Follow-ups for reviewers:
  - TypeScript bindings (`ts/`) are not touched here and need a separate
    `make proto-gen-ts` pass before the TS SDK consumes the new fields.

Linear: AKT-401, AKT-402, AKT-403, AKT-405, AKT-406
        (AKT-404 / CS-4 cancelled — see spec decision #3)

Fix make/setup-cache.mk: defer GOLANGCI_LINT_MAJOR computation to
recipe-execute time and depend on $(SEMVER), so the install isn't
given an empty major and the broken module path
.../golangci-lint/v/cmd/golangci-lint. This bug had `lint/go` red on
main for several commits prior.
@Zblocker64 Zblocker64 force-pushed the feat/rdma-multinode branch from 4c19e6b to 5745d21 Compare June 2, 2026 20:14

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
go/inventory/v1/resourcepair_zero_test.go (1)

38-49: ⚡ Quick win

Add a regression test for partial-nil ResourcePair duplication.

Please add a case where one of Capacity/Allocatable/Allocated is nil and verify Dup() keeps that field nil after copy (instead of converting it to zero). This will prevent future contract regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/inventory/v1/resourcepair_zero_test.go` around lines 38 - 49, Add a
regression test that ensures ResourcePair.Dup() preserves nil fields: in
resourcepair_zero_test.go (e.g., extend or add a test alongside
TestResourcePair_Dup_PopulatedRoundTrips) construct a ResourcePair via
NewResourcePair or literal where one of Capacity/Allocatable/Allocated is nil,
call rp.Dup(), then assert the corresponding field on the returned copy is still
nil (not zero), and also mutate the copy and assert the original remains
unchanged to confirm deep-copy semantics for partially-nil ResourcePair fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/inventory/v1/resourcepair.go`:
- Around line 70-85: The Dup implementation in resourcepair.go currently always
returns non-nil pointers (&capacity etc.), losing nil presence; change it so
Capacity, Allocatable and Allocated are set to nil when the source
m.Capacity/m.Allocatable/m.Allocated are nil, and only allocate and assign a
DeepCopy pointer when the corresponding source field is non-nil (e.g., check
m.Capacity != nil then copy and set Capacity to that pointer, otherwise set
Capacity to nil); keep Attributes copied via m.Attributes.Dup() as is.

---

Nitpick comments:
In `@go/inventory/v1/resourcepair_zero_test.go`:
- Around line 38-49: Add a regression test that ensures ResourcePair.Dup()
preserves nil fields: in resourcepair_zero_test.go (e.g., extend or add a test
alongside TestResourcePair_Dup_PopulatedRoundTrips) construct a ResourcePair via
NewResourcePair or literal where one of Capacity/Allocatable/Allocated is nil,
call rp.Dup(), then assert the corresponding field on the returned copy is still
nil (not zero), and also mutate the copy and assert the original remains
unchanged to confirm deep-copy semantics for partially-nil ResourcePair fields.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 89166079-2d14-4230-b011-b2b98830a3cb

📥 Commits

Reviewing files that changed from the base of the PR and between 5ce7643 and 060c02f.

⛔ Files ignored due to path filters (7)
  • go/inventory/v1/node.pb.go is excluded by !**/*.pb.go
  • go/inventory/v1/resources.pb.go is excluded by !**/*.pb.go
  • go/manifest/v2beta3/service.pb.go is excluded by !**/*.pb.go
  • ts/src/generated/protos/akash/inventory/v1/node.ts is excluded by !**/generated/**
  • ts/src/generated/protos/akash/inventory/v1/resources.ts is excluded by !**/generated/**
  • ts/src/generated/protos/akash/manifest/v2beta3/service.ts is excluded by !**/generated/**
  • ts/src/sdl/manifest/__snapshots__/generateManifestVersion.spec.ts.snap is excluded by !**/*.snap
📒 Files selected for processing (29)
  • go/inventory/v1/node.go
  • go/inventory/v1/node_test.go
  • go/inventory/v1/resourcepair.go
  • go/inventory/v1/resourcepair_zero_test.go
  • go/inventory/v1/resources.go
  • go/inventory/v1/resources_test.go
  • go/node/deployment/v1beta4/rdma_commit_audit_test.go
  • go/sdl/gpu.go
  • go/sdl/groupBuilder_v2.go
  • go/sdl/groupBuilder_v2_1.go
  • go/sdl/rdma_gpu_test.go
  • go/sdl/rdma_validation_test.go
  • go/sdl/v2.go
  • proto/provider/akash/inventory/v1/node.proto
  • proto/provider/akash/inventory/v1/resources.proto
  • proto/provider/akash/manifest/v2beta3/service.proto
  • testdata/sdl/output-fixtures/v2.0/gpu-basic/manifest.json
  • testdata/sdl/output-fixtures/v2.0/http-options/manifest.json
  • testdata/sdl/output-fixtures/v2.0/ip-endpoint/manifest.json
  • testdata/sdl/output-fixtures/v2.0/multiple-services/manifest.json
  • testdata/sdl/output-fixtures/v2.0/persistent-storage/manifest.json
  • testdata/sdl/output-fixtures/v2.0/placement/manifest.json
  • testdata/sdl/output-fixtures/v2.0/port-ranges/manifest.json
  • testdata/sdl/output-fixtures/v2.0/pricing/manifest.json
  • testdata/sdl/output-fixtures/v2.0/simple/manifest.json
  • testdata/sdl/output-fixtures/v2.0/storage-classes/manifest.json
  • testdata/sdl/output-fixtures/v2.1/credentials/manifest.json
  • testdata/sdl/output-fixtures/v2.1/ip-endpoint/manifest.json
  • testdata/sdl/output-fixtures/v2.1/shared-ip/manifest.json
✅ Files skipped from review due to trivial changes (10)
  • testdata/sdl/output-fixtures/v2.0/gpu-basic/manifest.json
  • go/inventory/v1/resources.go
  • testdata/sdl/output-fixtures/v2.0/port-ranges/manifest.json
  • testdata/sdl/output-fixtures/v2.0/ip-endpoint/manifest.json
  • testdata/sdl/output-fixtures/v2.1/shared-ip/manifest.json
  • testdata/sdl/output-fixtures/v2.0/storage-classes/manifest.json
  • testdata/sdl/output-fixtures/v2.0/placement/manifest.json
  • testdata/sdl/output-fixtures/v2.1/ip-endpoint/manifest.json
  • testdata/sdl/output-fixtures/v2.0/simple/manifest.json
  • testdata/sdl/output-fixtures/v2.0/pricing/manifest.json
🚧 Files skipped from review as they are similar to previous changes (12)
  • go/sdl/groupBuilder_v2_1.go
  • go/inventory/v1/node.go
  • proto/provider/akash/inventory/v1/node.proto
  • proto/provider/akash/manifest/v2beta3/service.proto
  • proto/provider/akash/inventory/v1/resources.proto
  • go/inventory/v1/resources_test.go
  • go/sdl/groupBuilder_v2.go
  • go/node/deployment/v1beta4/rdma_commit_audit_test.go
  • go/inventory/v1/node_test.go
  • go/sdl/v2.go
  • go/sdl/gpu.go
  • go/sdl/rdma_validation_test.go

Comment thread go/inventory/v1/resourcepair.go Outdated
Comment thread go/sdl/groupBuilder_v2_1.go
Comment thread proto/provider/akash/manifest/v2beta3/service.proto Outdated
Comment thread go/sdl/gpu.go
Comment thread ts/src/generated/protos/akash/manifest/v2beta3/service.ts
Six review items from chalabi2 + CodeRabbit:

- proto/service.proto + regenerated .pb.go/.ts: add omitempty to
  Service.RDMAGroup JSON/YAML tags. The on-chain manifest version hash
  is a SHA over the JSON-serialized off-chain manifest; without
  omitempty every pre-RDMA service serialized "rdmaGroup": "" and
  shifted the hash for existing leases, breaking send-manifest
  validation. Fixtures regenerated — diff vs main is zero on every
  non-RDMA testdata manifest, confirming hash stability.

- sdl/v2.go + v2_1.go: extract validateRDMA() into a free function
  taking (profiles, deployments) and call it from both v2 and v2.1
  validate(). v2.1 inherits the full RDMA SDL grammar from v2 (parser
  promotes gpu.attributes.rdma + rdma_group through to the manifest),
  so the cross-field invariants (rule 1: rdma-required workload reaches
  rdma-capable provider; rule 2: rdma_group ⇒ rdma=true; rule 3: no
  implicit/explicit group mixing within one deployment) must apply
  symmetrically. Without this, invalid v2.1 SDLs bypass the new rules.

- sdl/sdl-input.schema.yaml: declare rdma (boolean) and rdma_group
  (string) under gpu.attributes. The parser accepts them but the schema
  had additionalProperties: false and only allowed `vendor`, rejecting
  valid RDMA SDLs at schema-validation time.

- inventory/v1/resourcepair.go: ResourcePair.Dup() now preserves the
  nil/non-nil shape of Capacity/Allocatable/Allocated rather than
  always returning &zeroQuantity. Returning non-nil pointers for
  originally-nil fields changes protobuf field-presence semantics and
  shifts the JSON serialization (which feeds the manifest hash).
  Regression-pinned by TestResourcePair_Dup_PreservesNilPointers.

- sdl/v2.go validateRDMA: defense-in-depth gate on gpu.Units > 0.
  The parser already rejects rdma/rdma_group on zero-GPU profiles, but
  the validator should not classify a zero-GPU profile as RDMA-enabled
  if that parser path is ever bypassed.

The TS proto-regen as part of #6 incidentally addresses the TS bindings
gap — ts/src/generated/protos/akash/manifest/v2beta3/service.ts now
parses and emits rdma_group with the same omitempty semantics.
CI sdl-parity job failed on the prior commit because the Go fixtures
(with omitempty) drop rdmaGroup for non-RDMA services, while TS's
manifestReplacer kept emitting "rdmaGroup":"". Mismatch on every v2.0
fixture.

manifestReplacer already had OMITTED_MANIFEST_KEYS for empty arrays /
zero numbers. Adding a parallel OMITTED_WHEN_EMPTY_STRING_KEYS set so
fields that use Go's `omitempty` semantics for string values can be
declared the same way on the TS side. rdmaGroup is the only entry for
now; future "omitempty string" fields just add their key here.

Updated the generateManifestVersion snapshot to match the corrected
serialization (no more "rdmaGroup":"" leaking into the hash input).
Prep for AKT-443 (provider bid-engine group-aware Adjust). The provider
needs to enforce per-rdma_group node separation at fit time — today the
workload builder's hostname pod anti-affinity is the only gate, which
fires after the bid has been accepted. The chain-SDK previously stripped
rdma_group from the GPU attribute slice and surfaced it only on the
off-chain Service.RDMAGroup field, so the bid engine had no signal.

Change: emit `rdma_group=<value>` as a regular on-chain GPU attribute
alongside `rdma=true`, while still lifting the value into
v2ResourceGPU.RDMAGroup → Service.RDMAGroup for the workload builder.
Both consumers see the same value via their respective paths.

- go/sdl/gpu.go: drop the gpuAttributeRDMAGroupSentinel layer. Emit
  rdma_group directly (the key matches the attribute key regex; no
  underscore prefix needed). Lift-but-keep instead of lift-and-strip.
- go/sdl/rdma_gpu_test.go: update assertions — the on-chain slice now
  CONTAINS rdma_group when set, AND v2ResourceGPU.RDMAGroup holds the
  same value. New name reflects the end-to-end flow.
- ts/src/sdl/manifest/manifestUtils.ts: transformGpuAttributes now
  emits both rdma and rdma_group keys, and sorts the result to match
  Go's sort.Sort(res) byte ordering.
- ts/src/sdl/manifest/generateManifest.ts: pass rdmaGroup from
  compute.resources.gpu.attributes.rdma_group into Service.fromPartial
  so it surfaces on the off-chain Service.rdmaGroup field.
- ts/src/sdl/validateSDL/validateSDLInput.ts: regenerated from the
  updated input schema (already had rdma + rdma_group properties from
  the prior commit).
- testdata/sdl/input/v2.0/gpu-rdma-group/: new fixture exercising the
  end-to-end path. Each service uses its own profile to sidestep an
  orthogonal Go-vs-TS resource-ID-assignment parity gap.

Rollout note: providers running pre-skip chain-sdk versions will reject
orders carrying the new rdma_group attribute (ParseGPUAttributes returns
"invalid GPU attribute"). The provider PR already lands the skip; any
provider wanting to bid on RDMA orders must upgrade past it.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
go/sdl/gpu.go (1)

156-160: ⚡ Quick win

Update stale sentinel comments to match current rdma_group flow.

Line 156 and Line 209 still describe a sentinel that gets stripped, but the code now keeps rdma_group in on-chain GPU attributes. Please align comments with actual behavior to avoid future mis-implementation.

Suggested comment-only diff
-			// gpu.attributes.rdma_group: string (peer group name). Captured
-			// here and emitted into the slice as a sentinel attribute that
-			// v2ResourceGPU.UnmarshalYAML strips before it reaches chain
-			// state. See gpuAttributeRDMAGroupSentinel.
+			// gpu.attributes.rdma_group: string (peer group name). Captured
+			// here and emitted directly as an on-chain GPU attribute
+			// (GPUAttributeRDMAGroup). v2ResourceGPU.UnmarshalYAML also lifts
+			// the same value into v2ResourceGPU.RDMAGroup for manifest routing.

-	// Validate() is deferred to v2ResourceGPU.UnmarshalYAML so the
-	// rdma_group sentinel can be stripped from the slice before the
-	// attribute-key regex runs against it.
+	// Validate() is deferred to v2ResourceGPU.UnmarshalYAML so the
+	// final assembled attribute slice is validated once in one place.

Also applies to: 209-211

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/sdl/gpu.go` around lines 156 - 160, The comments near the GPU YAML decode
(around the node.Content[i+1].Decode(&rdmaGroup) block) and the comment
referencing gpuAttributeRDMAGroupSentinel / v2ResourceGPU.UnmarshalYAML are
stale: they state rdma_group is emitted as a sentinel that
v2ResourceGPU.UnmarshalYAML strips, but the code now preserves rdma_group in
on-chain GPU attributes. Update those comments to reflect the current flow
(rdma_group is decoded and retained in the GPU attributes on-chain) and remove
or reword any mention of a sentinel being stripped; keep references to
gpuAttributeRDMAGroupSentinel and v2ResourceGPU.UnmarshalYAML only if describing
their actual behavior in the current implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ts/src/sdl/manifest/manifestUtils.ts`:
- Around line 78-87: The code currently pushes attributes.rdma_group into the
on-chain GPU attributes (via the result.push({ key: "rdma_group", ... }) call),
but rdma_group must remain off-chain; remove the block that emits rdma_group
into Resources.GPU.Attributes in manifestUtils (i.e., delete or disable the if
(attributes.rdma_group && attributes.rdma_group.length > 0) result.push(...)
branch) and keep only the rdma boolean emission (the if (attributes.rdma ===
true) result.push({ key: "rdma", value: "true" }) branch); ensure any callers or
type definitions still treat attributes.rdma_group as an off-chain-only field
and do not rely on it being serialized into the result array.

---

Nitpick comments:
In `@go/sdl/gpu.go`:
- Around line 156-160: The comments near the GPU YAML decode (around the
node.Content[i+1].Decode(&rdmaGroup) block) and the comment referencing
gpuAttributeRDMAGroupSentinel / v2ResourceGPU.UnmarshalYAML are stale: they
state rdma_group is emitted as a sentinel that v2ResourceGPU.UnmarshalYAML
strips, but the code now preserves rdma_group in on-chain GPU attributes. Update
those comments to reflect the current flow (rdma_group is decoded and retained
in the GPU attributes on-chain) and remove or reword any mention of a sentinel
being stripped; keep references to gpuAttributeRDMAGroupSentinel and
v2ResourceGPU.UnmarshalYAML only if describing their actual behavior in the
current implementation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a3e93caa-5b29-4589-9819-2890ceac7099

📥 Commits

Reviewing files that changed from the base of the PR and between 060c02f and c1bbce4.

⛔ Files ignored due to path filters (2)
  • go/manifest/v2beta3/service.pb.go is excluded by !**/*.pb.go
  • ts/src/generated/protos/akash/manifest/v2beta3/service.ts is excluded by !**/generated/**
📒 Files selected for processing (15)
  • go/inventory/v1/resourcepair.go
  • go/inventory/v1/resourcepair_zero_test.go
  • go/sdl/gpu.go
  • go/sdl/rdma_gpu_test.go
  • go/sdl/sdl-input.schema.yaml
  • go/sdl/v2.go
  • go/sdl/v2_1.go
  • proto/provider/akash/manifest/v2beta3/service.proto
  • testdata/sdl/input/v2.0/gpu-rdma-group/input.yaml
  • testdata/sdl/output-fixtures/v2.0/gpu-rdma-group/group-specs.json
  • testdata/sdl/output-fixtures/v2.0/gpu-rdma-group/manifest.json
  • ts/src/sdl/manifest/generateManifest.ts
  • ts/src/sdl/manifest/generateManifestVersion.ts
  • ts/src/sdl/manifest/manifestUtils.ts
  • ts/src/sdl/validateSDL/validateSDLInput.ts
✅ Files skipped from review due to trivial changes (1)
  • testdata/sdl/output-fixtures/v2.0/gpu-rdma-group/group-specs.json
🚧 Files skipped from review as they are similar to previous changes (2)
  • go/inventory/v1/resourcepair.go
  • go/sdl/v2.go

Comment thread ts/src/sdl/manifest/manifestUtils.ts
TS validateSDL had no RDMA semantic checks; tenants using the TS SDK
could broadcast SDLs that the Go parser would reject outright (chain
doesn't validate SDL semantics, so the failure would surface later as
broken pods).

Adds a #validateRDMA method on SDLValidator, called from validate()
after the per-service loop. Mirrors the Go-side validateRDMA in
go/sdl/v2.go and enforces the same three cross-field rules:

  1. A profile with gpu.attributes.rdma=true must be deployed under a
     placement whose attributes require capabilities/rdma=true.
  2. A profile with gpu.attributes.rdma_group set must also have
     gpu.attributes.rdma=true on the same profile.
  3. Within one deployment (placement), if any profile sets
     rdma_group, every rdma=true profile under that placement must
     also set rdma_group — no implicit-default-plus-explicit mixing.

The units==0 + rdma/rdma_group case is already rejected by the
schema-level gpuAttributesRequireUnitsGt0 rule (any attribute requires
units > 0), so no semantic check needed there. Pinned by two new tests
that assert the schema path, so a future schema relaxation can't
silently reopen the hole.

Five new tests cover the rejection paths plus a happy path. Existing
SDL parity tests stay at 34/34.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants