Skip to content

feat: add network segmentation#1754

Draft
micpapal wants to merge 90 commits into
mainfrom
1729-add-network-segmentation
Draft

feat: add network segmentation#1754
micpapal wants to merge 90 commits into
mainfrom
1729-add-network-segmentation

Conversation

@micpapal

@micpapal micpapal commented Jun 18, 2026

Copy link
Copy Markdown
Member

Description

ref: #1729

Type of Change

  • Bugfix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

msardara and others added 30 commits June 9, 2026 12:07
Signed-off-by: Mauro Sardara <msardara@cisco.com>
…onization

Implement Phase 3 of distributed SLIM replicas: the PeerSyncManager that
handles peer-to-peer subscription synchronization within a deployment.

Key components:

- PeerSyncManager (peer_sync/manager.rs): Main event loop that coordinates
  peer discovery events with subscription forwarding. Uses deterministic
  tie-breaking (lower peer ID dials) to ensure exactly one connection per
  peer pair.

- PeerState (peer_sync/state.rs): Tracks connected peers with bidirectional
  lookup (peer_id ↔ conn_id). Handles both incoming and outgoing connections.

- Sync protocol (peer_sync/sync.rs): Builds Subscribe/Unsubscribe messages
  for peers. Supports full sync (snapshot of all local subscriptions on join)
  and incremental broadcast (aggregate transitions only).

- Subscription event broadcast: MessageProcessor emits SubscriptionEvent on
  aggregate local transitions (0→1 first subscriber, 1→0 last unsubscriber).
  PeerSyncManager subscribes to these events and forwards to all peers.
  On broadcast lag, triggers full resync to recover consistency.

- ConnectionType-aware connections: register_remote_connection now reads
  ClientConfig.connection_type to use ConnType::Peer for peer connections,
  ensuring correct routing (EXCLUDE_PEER filter for 1-hop rule).

Signed-off-by: Mauro Sardara <msardara@cisco.com>
…ntConfig

PeerConfig now has a static_peers: Vec<ClientConfig> field instead of
reusing the dataplane.clients section. Each peer entry supports the full
ClientConfig (TLS, auth, keepalive, etc.). The connection_type field is
forced to Peer regardless of what is configured in the entry.

StaticPeerDiscovery gains a from_client_configs() constructor that
builds PeerInfo entries from the ClientConfig list, filtering out self.
- Add connection_type field to LinkNegotiationPayload proto
- Server-side peer detection: upgrade connection type on negotiation
- Add IncomingPeerEvent channel from MessageProcessor to PeerSyncManager
- PeerSyncManager handles incoming peers with full sync
- Restore tie-breaking: only smaller-ID node dials out
- Refactor node_id from Option<String> to String with UUID default
- Remove self_id from PeerConfig; derive identity from server endpoints
- Add ConnectionTable::update() for post-creation type mutation
- Fix bindings tests for new node_id semantics
- Add PeerTopology enum (FullMesh default, HubAndSpoke) to PeerConfig
- In HubAndSpoke: only hub (smallest lexicographic ID) dials out
- Hub relays subscriptions between spokes via PeerRelayEvent channel
- Hub uses MatchFilter::ALL for publish routing (relays data between spokes)
- Hub full sync includes peer-learned subscriptions for new spokes
- Add set_peer_hub() / set_peer_relay_channel() to MessageProcessor
- Emit PeerRelayEvent when subscriptions arrive on peer connections
- Add topology deserialization tests
- test_hub_spoke_topology_connections: verifies hub has 2 conns, spokes have 1
- test_hub_spoke_subscription_relay: subscription propagates spoke→hub→spoke
- test_hub_spoke_message_delivery: publish routes spoke_a→hub→spoke_b
…ests

Refactor integration tests to use the higher-level create_app + subscribe
API for subscription-only tests, keeping raw register_local_connection for
message delivery tests where we need to capture received messages directly.

- test_subscription_propagation_to_peers: uses create_app + subscribe
- test_subscription_not_propagated_to_remote: uses create_app + subscribe
- test_hub_spoke_subscription_relay: uses create_app + subscribe
- Message delivery tests retain raw connections for verifiable receipt
- Clean up imports (add ApplicationPayload, Message at top level)

Signed-off-by: Mauro Sardara <msardara@cisco.com>
… node-

Instead of using the link_id (which is identical on both sides of a
connection) as the peer identifier in IncomingPeerEvent, exchange the
node_id during link negotiation. This allows the server side to know
exactly which peer connected.

Changes:
- Add node_id field (tag 6) to LinkNegotiationPayload proto
- Include service_id (node identity) in both client request and server reply
- Change IncomingPeerEvent from link_id to node_id
- PeerSyncManager uses the remote node_id for peer state registration
- Default node_id now has 'node-' prefix (e.g. node-<uuid>) for clarity
Replace lazily-initialized unbounded channels with bounded channels
(capacity 64) created eagerly in MessageProcessor::new_internal().
Senders are stored directly (no Option/Mutex on the send path), and
receivers are taken once via take_incoming_peer_rx() / take_peer_relay_rx().

This eliminates lock contention on every send and provides backpressure.
…tion events

The subscription_event channel had a single consumer (PeerSyncManager),
so broadcast was unnecessary overhead. Converted to a bounded mpsc
channel (capacity 1024) created at construction, consistent with the
other peer sync channels.

Removed full_resync_all_peers() which was only reachable from the
now-removed broadcast Lagged error handler.
…verification

- Include peer_group in both client and server negotiation messages
- Server rejects peer upgrade when peer_group is configured but doesn't match
- Pass peer_group from PeerConfig through to MessageProcessor constructor
- Update all test payloads with peer_group field
…transitions

The subscription table's add/remove methods now return a bool indicating
whether a 0→1 (first subscriber) or 1→0 (last subscriber gone) transition
occurred for the given (name, category) pair. This eliminates the need for
a separate local_sub_counts HashMap in MessageProcessorInternal.
Signed-off-by: Mauro Sardara <msardara@cisco.com>
Signed-off-by: Mauro Sardara <msardara@cisco.com>
Signed-off-by: Mauro Sardara <msardara@cisco.com>
The field better reflects its semantic meaning — it identifies which
deployment a set of replicas belongs to, not just a peer grouping.

Signed-off-by: Mauro Sardara <msardara@cisco.com>
Unsubscribes bypass the has_seen_sub_id loop check because they must
be relayed to cancel the prior subscribe. Loops are bounded by TTL
and remove operations are idempotent, so no state corruption occurs.
Unsubscribes bypass the has_seen_sub_id loop check because they must
be relayed to cancel the prior subscribe. Loops are bounded by TTL
and remove operations are idempotent, so no state corruption occurs.

Signed-off-by: Mauro Sardara <msardara@cisco.com>
Add ServiceConfiguration::with_node_id() builder method. Update
control-plane integration tests to set node_id explicitly so the
controller sees the expected node identity rather than an auto-generated
UUID.

Signed-off-by: Mauro Sardara <msardara@cisco.com>
The server-with-controller.yaml config lacked a node_id field, causing
the service to use an auto-generated UUID. The Go integration test
expects deterministic node names (slim/node-a, slim/node-b), so add a
placeholder node_id that gets replaced per instance.
The server-a-config-cp.yaml and server-b-config-cp.yaml configs lacked
a node_id field, causing data-plane nodes to register with the control
plane using auto-generated UUIDs. Tests that query routes by node name
(slimctl controller route list -n "slim/a") then fail because the
controller only knows the UUID-based names.

Add explicit node_id matching the service key so nodes register with
the expected identity.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mauro Sardara <msardara@cisco.com>
Signed-off-by: Mauro Sardara <msardara@cisco.com>
Signed-off-by: Mauro Sardara <msardara@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
micpapal and others added 26 commits June 11, 2026 16:51
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Mauro Sardara <msardara@cisco.com>
…tation

Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
…ns' into feat/segmentation

Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
…subscriptions

Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
…ns' into feat/segmentation

Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
# Description

Major control-plane refactor to support routing between deployments with
configurable topology. The CP now manages inter-group connectivity only
(intra-group is handled by the data plane), uses Shortest Path Tree
(SPT) routing for multi-hop route expansion, and supports gateway
failover for high availability.

Key Changes

Group-Based Routing & Topology

- CP operates on groups (deployments), not individual nodes — one
inter-group link per group pair
 - Configurable topology via adjacency list
- SPT-based route expansion computes loop-free forwarding trees across
groups
- Wildcard route templates (source=*) auto-expand across all reachable
groups

Gateway Selection & Failover

- Gateway node (holds inter-group links) is chosen randomly to
distribute load
- On gateway crash: outgoing links reassigned to random sibling;
incoming links recreated via claim mechanism
- Single-node group departures: links soft-deleted, rebuilt when new
node joins

Link Claim Mechanism

- Two-phase link creation for ingress/load-balancer scenarios where dest
node is unknown
- Source connects to shared endpoint → dest node claims link via
ConfigCommand → routes expanded

Route Lifecycle

- Removed dead restore_route code — apps re-subscribe naturally via
wildcard expansion
- expand_all_wildcard_routes triggered after both node_registered and
claim_link
 - Route cleanup on crash disconnect (not just graceful deregistration)

Integration Tests

- 13 end-to-end tests covering: link creation/claim, route expansion,
subscription routing, gateway failover (source + dest), node crash
recovery, wildcard deletion, app disconnect, multicast dedup

Documentation

 - Comprehensive README.md for control-plane architecture

ref: #1728

## Type of Change

- [ ] Bugfix
- [x] New Feature
- [x] Breaking Change
- [x] Refactor
- [x] Documentation
- [ ] Other (please describe)

## Checklist

- [x] I have read the [contributing
guidelines](/agntcy/repo-template/blob/main/CONTRIBUTING.md)
- [x] Existing issues have been referenced (where applicable)
- [x] I have verified this change is not present in other open pull
requests
- [x] Functionality is documented
- [x] All code style checks pass
- [x] New code contribution is covered by automated tests
- [x] All new and existing tests pass

---------

Signed-off-by: Mauro Sardara <msardara@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Co-authored-by: Mauro Sardara <msardara@cisco.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

The latest Buf updates on your PR. Results from workflow ci-buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 18, 2026, 3:58 PM

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.07856% with 67 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ta-plane/control-plane/src/route_service/routes.rs 79.92% 51 Missing ⚠️
data-plane/control-plane/src/config.rs 96.09% 16 Missing ⚠️

📢 Thoughts on this report? Let us know!

micpapal added 2 commits June 18, 2026 17:57
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants