feat: add network segmentation#1754
Draft
micpapal wants to merge 90 commits into
Draft
Conversation
Signed-off-by: Mauro Sardara <msardara@cisco.com>
…onization Implement Phase 3 of distributed SLIM replicas: the PeerSyncManager that handles peer-to-peer subscription synchronization within a deployment. Key components: - PeerSyncManager (peer_sync/manager.rs): Main event loop that coordinates peer discovery events with subscription forwarding. Uses deterministic tie-breaking (lower peer ID dials) to ensure exactly one connection per peer pair. - PeerState (peer_sync/state.rs): Tracks connected peers with bidirectional lookup (peer_id ↔ conn_id). Handles both incoming and outgoing connections. - Sync protocol (peer_sync/sync.rs): Builds Subscribe/Unsubscribe messages for peers. Supports full sync (snapshot of all local subscriptions on join) and incremental broadcast (aggregate transitions only). - Subscription event broadcast: MessageProcessor emits SubscriptionEvent on aggregate local transitions (0→1 first subscriber, 1→0 last unsubscriber). PeerSyncManager subscribes to these events and forwards to all peers. On broadcast lag, triggers full resync to recover consistency. - ConnectionType-aware connections: register_remote_connection now reads ClientConfig.connection_type to use ConnType::Peer for peer connections, ensuring correct routing (EXCLUDE_PEER filter for 1-hop rule). Signed-off-by: Mauro Sardara <msardara@cisco.com>
…ntConfig PeerConfig now has a static_peers: Vec<ClientConfig> field instead of reusing the dataplane.clients section. Each peer entry supports the full ClientConfig (TLS, auth, keepalive, etc.). The connection_type field is forced to Peer regardless of what is configured in the entry. StaticPeerDiscovery gains a from_client_configs() constructor that builds PeerInfo entries from the ClientConfig list, filtering out self.
- Add connection_type field to LinkNegotiationPayload proto - Server-side peer detection: upgrade connection type on negotiation - Add IncomingPeerEvent channel from MessageProcessor to PeerSyncManager - PeerSyncManager handles incoming peers with full sync - Restore tie-breaking: only smaller-ID node dials out - Refactor node_id from Option<String> to String with UUID default - Remove self_id from PeerConfig; derive identity from server endpoints - Add ConnectionTable::update() for post-creation type mutation - Fix bindings tests for new node_id semantics
- Add PeerTopology enum (FullMesh default, HubAndSpoke) to PeerConfig - In HubAndSpoke: only hub (smallest lexicographic ID) dials out - Hub relays subscriptions between spokes via PeerRelayEvent channel - Hub uses MatchFilter::ALL for publish routing (relays data between spokes) - Hub full sync includes peer-learned subscriptions for new spokes - Add set_peer_hub() / set_peer_relay_channel() to MessageProcessor - Emit PeerRelayEvent when subscriptions arrive on peer connections - Add topology deserialization tests
- test_hub_spoke_topology_connections: verifies hub has 2 conns, spokes have 1 - test_hub_spoke_subscription_relay: subscription propagates spoke→hub→spoke - test_hub_spoke_message_delivery: publish routes spoke_a→hub→spoke_b
…ests Refactor integration tests to use the higher-level create_app + subscribe API for subscription-only tests, keeping raw register_local_connection for message delivery tests where we need to capture received messages directly. - test_subscription_propagation_to_peers: uses create_app + subscribe - test_subscription_not_propagated_to_remote: uses create_app + subscribe - test_hub_spoke_subscription_relay: uses create_app + subscribe - Message delivery tests retain raw connections for verifiable receipt - Clean up imports (add ApplicationPayload, Message at top level) Signed-off-by: Mauro Sardara <msardara@cisco.com>
… node- Instead of using the link_id (which is identical on both sides of a connection) as the peer identifier in IncomingPeerEvent, exchange the node_id during link negotiation. This allows the server side to know exactly which peer connected. Changes: - Add node_id field (tag 6) to LinkNegotiationPayload proto - Include service_id (node identity) in both client request and server reply - Change IncomingPeerEvent from link_id to node_id - PeerSyncManager uses the remote node_id for peer state registration - Default node_id now has 'node-' prefix (e.g. node-<uuid>) for clarity
Replace lazily-initialized unbounded channels with bounded channels (capacity 64) created eagerly in MessageProcessor::new_internal(). Senders are stored directly (no Option/Mutex on the send path), and receivers are taken once via take_incoming_peer_rx() / take_peer_relay_rx(). This eliminates lock contention on every send and provides backpressure.
…tion events The subscription_event channel had a single consumer (PeerSyncManager), so broadcast was unnecessary overhead. Converted to a bounded mpsc channel (capacity 1024) created at construction, consistent with the other peer sync channels. Removed full_resync_all_peers() which was only reachable from the now-removed broadcast Lagged error handler.
…verification - Include peer_group in both client and server negotiation messages - Server rejects peer upgrade when peer_group is configured but doesn't match - Pass peer_group from PeerConfig through to MessageProcessor constructor - Update all test payloads with peer_group field
…transitions The subscription table's add/remove methods now return a bool indicating whether a 0→1 (first subscriber) or 1→0 (last subscriber gone) transition occurred for the given (name, category) pair. This eliminates the need for a separate local_sub_counts HashMap in MessageProcessorInternal.
Signed-off-by: Mauro Sardara <msardara@cisco.com>
Signed-off-by: Mauro Sardara <msardara@cisco.com>
Signed-off-by: Mauro Sardara <msardara@cisco.com>
The field better reflects its semantic meaning — it identifies which deployment a set of replicas belongs to, not just a peer grouping. Signed-off-by: Mauro Sardara <msardara@cisco.com>
Unsubscribes bypass the has_seen_sub_id loop check because they must be relayed to cancel the prior subscribe. Loops are bounded by TTL and remove operations are idempotent, so no state corruption occurs.
Unsubscribes bypass the has_seen_sub_id loop check because they must be relayed to cancel the prior subscribe. Loops are bounded by TTL and remove operations are idempotent, so no state corruption occurs. Signed-off-by: Mauro Sardara <msardara@cisco.com>
Add ServiceConfiguration::with_node_id() builder method. Update control-plane integration tests to set node_id explicitly so the controller sees the expected node identity rather than an auto-generated UUID. Signed-off-by: Mauro Sardara <msardara@cisco.com>
The server-with-controller.yaml config lacked a node_id field, causing the service to use an auto-generated UUID. The Go integration test expects deterministic node names (slim/node-a, slim/node-b), so add a placeholder node_id that gets replaced per instance.
The server-a-config-cp.yaml and server-b-config-cp.yaml configs lacked a node_id field, causing data-plane nodes to register with the control plane using auto-generated UUIDs. Tests that query routes by node name (slimctl controller route list -n "slim/a") then fail because the controller only knows the UUID-based names. Add explicit node_id matching the service key so nodes register with the expected identity. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mauro Sardara <msardara@cisco.com>
Signed-off-by: Mauro Sardara <msardara@cisco.com>
Signed-off-by: Mauro Sardara <msardara@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Mauro Sardara <msardara@cisco.com>
…tation Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
…ns' into feat/segmentation Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
…subscriptions Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
…ns' into feat/segmentation Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
# Description Major control-plane refactor to support routing between deployments with configurable topology. The CP now manages inter-group connectivity only (intra-group is handled by the data plane), uses Shortest Path Tree (SPT) routing for multi-hop route expansion, and supports gateway failover for high availability. Key Changes Group-Based Routing & Topology - CP operates on groups (deployments), not individual nodes — one inter-group link per group pair - Configurable topology via adjacency list - SPT-based route expansion computes loop-free forwarding trees across groups - Wildcard route templates (source=*) auto-expand across all reachable groups Gateway Selection & Failover - Gateway node (holds inter-group links) is chosen randomly to distribute load - On gateway crash: outgoing links reassigned to random sibling; incoming links recreated via claim mechanism - Single-node group departures: links soft-deleted, rebuilt when new node joins Link Claim Mechanism - Two-phase link creation for ingress/load-balancer scenarios where dest node is unknown - Source connects to shared endpoint → dest node claims link via ConfigCommand → routes expanded Route Lifecycle - Removed dead restore_route code — apps re-subscribe naturally via wildcard expansion - expand_all_wildcard_routes triggered after both node_registered and claim_link - Route cleanup on crash disconnect (not just graceful deregistration) Integration Tests - 13 end-to-end tests covering: link creation/claim, route expansion, subscription routing, gateway failover (source + dest), node crash recovery, wildcard deletion, app disconnect, multicast dedup Documentation - Comprehensive README.md for control-plane architecture ref: #1728 ## Type of Change - [ ] Bugfix - [x] New Feature - [x] Breaking Change - [x] Refactor - [x] Documentation - [ ] Other (please describe) ## Checklist - [x] I have read the [contributing guidelines](/agntcy/repo-template/blob/main/CONTRIBUTING.md) - [x] Existing issues have been referenced (where applicable) - [x] I have verified this change is not present in other open pull requests - [x] Functionality is documented - [x] All code style checks pass - [x] New code contribution is covered by automated tests - [x] All new and existing tests pass --------- Signed-off-by: Mauro Sardara <msardara@cisco.com> Signed-off-by: Michele Papalini <micpapal@cisco.com> Co-authored-by: Mauro Sardara <msardara@cisco.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Contributor
|
The latest Buf updates on your PR. Results from workflow ci-buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Signed-off-by: Michele Papalini <micpapal@cisco.com>
Signed-off-by: Michele Papalini <micpapal@cisco.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
ref: #1729
Type of Change
Checklist