From 0ecffb8732c549869a56ccce3ac749c40933549c Mon Sep 17 00:00:00 2001 From: Yue Gao Date: Tue, 5 May 2026 14:29:02 -0700 Subject: [PATCH 1/4] Create vlan-bvi HLD Signed-off-by: Yue Gao --- docs/vlan-bvi-hld.md | 355 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 355 insertions(+) create mode 100644 docs/vlan-bvi-hld.md diff --git a/docs/vlan-bvi-hld.md b/docs/vlan-bvi-hld.md new file mode 100644 index 0000000..ed1eab0 --- /dev/null +++ b/docs/vlan-bvi-hld.md @@ -0,0 +1,355 @@ +# VLAN Bridge Domain on VPP — High-Level Design + +## 1. Problem Statement + +SONiC on VPP needs to support L2 bridging with L3 SVI (VLAN interfaces) for both +tagged and untagged VLAN members. The standard SONiC data model creates: + +- A VLAN (e.g., Vlan10) +- An SVI with an IP address (e.g., 10.0.0.1/24 on Vlan10) +- Member ports with tagging mode (tagged or untagged) + +VPP's bridge-domain (BD) and BVI (Bridge Virtual Interface) constructs map to this +model, but require explicit wiring to connect VPP's data plane to the Linux control +plane (for ARP, DHCP, LLDP, LACP, routing protocols, etc.). + +### Challenges + +1. **BVI ↔ Linux connectivity**: VPP's BVI has no automatic path to the Linux + network stack. Punted traffic (ARP, DHCP) from the BD/BVI must reach the + kernel `Vlan10` interface. + +2. **Tagged member dispatch**: VPP must create dot1q sub-interfaces for tagged + members, strip the tag on ingress to the BD, and push it back on egress. + The host must also see a corresponding VLAN sub-interface. + +3. **Untagged member handling**: The physical interface itself joins the BD + directly. Control protocols (LLDP, LACP, ARP) arriving untagged must be + punted to the host without being consumed by the BD flood path. + +4. **Promiscuous mode**: VPP's virtio/DPDK backend filters tagged frames at + the device level unless promiscuous mode is enabled on the parent interface. + +--- + +## 2. SONiC Configuration Example + +```json +// config_db.json excerpts + +// Create VLAN +"VLAN": { + "Vlan10": { + "vlanid": "10" + } +} + +// Assign IP to VLAN interface (SVI) +"VLAN_INTERFACE": { + "Vlan10": {}, + "Vlan10|10.0.0.1/24": {} +} + +// Add members +"VLAN_MEMBER": { + "Vlan10|Ethernet0": { + "tagging_mode": "tagged" + }, + "Vlan10|Ethernet4": { + "tagging_mode": "untagged" + } +} +``` + +CLI equivalents: +```bash +config vlan add 10 +config vlan member add 10 Ethernet0 # tagged by default +config vlan member add -u 10 Ethernet4 # untagged +config interface ip add Vlan10 10.0.0.1/24 +``` + +--- + +## 3. Design Principles + +### 3.1 BVI LCP Pair + +The BVI (`bvi10`) is the L3 endpoint of the bridge domain. An LCP (Linux Control +Plane) pair is created between `bvi10` and a Linux tap device (`tap_Vlan10`). A tc +filter redirects traffic between `tap_Vlan10` and the kernel `Vlan10` netdev. + +This allows: +- **ARP/DHCP punted via BVI** reach the kernel as untagged Ethernet frames +- **Routing decisions** made by the kernel are sent back through the tap → BVI → BD +- The BD floods BUM (Broadcast/Unknown-unicast/Multicast) to all members +- Known unicast is forwarded based on the MAC table + +### 3.2 Frames in the BD Are Always Untagged + +All frames inside the bridge domain are **untagged**: +- Tagged members: VPP sub-interface pops the outer tag on ingress (`pop 1`) + and pushes it back on egress (symmetric VTR) +- Untagged members: frames arrive without a tag and stay untagged +- BVI: exchanges untagged frames with the BD (no VTR on BVI) + +This matches the Linux kernel model where the `Vlan10` SVI sees untagged frames. + +### 3.3 Control Protocol Punt (`lcp-punt-l2-ethertype`) + +#### Problem + +When a physical interface has an LCP pair (e.g., `bobm0 → Ethernet0`) and +operates in **L3 mode**, VPP's existing `linux-cp-punt-xc` node handles +control traffic naturally: packets arriving on `bobm0` are cross-connected +to the host tap `Ethernet0` via the LCP punt path. The kernel receives +LLDP, LACP, ARP, etc. directly. + +However, when the interface is placed into a **bridge domain** as an untagged +member (`set interface l2 bridge bobm0 10`), the interface transitions from +L3 mode to L2 mode. At this point: + +1. VPP's `ip4-input`/`ip6-input` arc is **no longer in the path** — the + interface enters `l2-input` instead. +2. `linux-cp-punt-xc` operates on the **L3 punt path** (ip4-punt, ip6-punt), + not the L2 path. It never sees L2-only protocols. +3. The L2 bridge treats LLDP/LACP destination MACs (`01:80:c2:00:00:0e`, + `01:80:c2:00:00:02`) as regular multicast and **floods them** to all BD + members plus the BVI — but these are link-local protocols that must NOT + be forwarded beyond the directly-connected link. +4. Even if the frame reaches the BVI and gets punted to the host, it arrives + on `Vlan10` (the SVI) instead of `Ethernet0` (the physical port) — the + control plane (lldpd, teamd) cannot associate the protocol message with + the correct physical interface. + +#### Solution: `lcp-punt-l2-ethertype` + +A new VPP feature node is inserted on the **L2 input arc** (between +`l2-input` and `l2-learn`) that matches frames by ethertype: + +``` +l2-input → [lcp-punt-l2-ethertype] → l2-learn → l2-fwd → ... +``` + +When a frame matches a configured ethertype, the node: +- **Punt mode (1)**: Redirects the frame to the interface's LCP host tap + and drops it from the L2 flood path. Used for LLDP and LACP — these must + reach the host but never be flooded to other BD members. +- **Punt-copy mode (2)**: Clones the frame — the copy goes to the LCP host + tap while the original continues through the L2 bridge for normal flooding. + Used for ARP — the host needs it for the control plane, but other BD + members also need the broadcast for L2 reachability. + +The punt delivers the frame to the **LCP tap of the ingress interface** +(e.g., `Ethernet0` for untagged member `bobm0`, or `Ethernet0.10` for tagged +sub-interface `bobm0.10`), preserving the correct per-port association that +the control plane expects. + +#### Configuration + +Ethertypes are configured globally at VPP client init: + +| Ethertype | Protocol | Mode | Behavior | +|-----------|----------|------|----------| +| 0x88cc | LLDP | punt (1) | To host only, not flooded | +| 0x8809 | LACP | punt (1) | To host only, not flooded | + +Per-interface enable/disable controls which BD members participate: +- Untagged members: explicitly enabled via `vpp_lcp_punt_l2_ethertype_set()` +- Tagged members (sub-interfaces): inherit from parent's LCP pair + +#### Why Not Use `group_fwd_mask` on Linux Bridge? + +The Linux kernel bridge's `group_fwd_mask` only applies to the **kernel** +bridge (`Bridge` in SONiC). In the VPP data path, bridging happens entirely +in VPP's bridge domain — frames never traverse the kernel bridge for +forwarding decisions. The kernel bridge only sees traffic that has already +been punted/redirected via LCP taps. Therefore, `group_fwd_mask` cannot +solve the VPP-side LLDP/LACP flooding problem. + +### 3.4 Promiscuous Mode on Physical Interfaces + +VPP's virtio/DPDK backend implements `VIRTIO_NET_F_CTRL_VLAN` which filters +tagged frames at the device level. Promiscuous mode is enabled on every +physical interface at LCP pair creation time so that tagged frames pass +through to VPP's `ethernet-input` for sub-interface dispatch. + +### 3.5 Auto Sub-Interface (lcp-auto-subint) + +VPP's linux-cp plugin `lcp-auto-subint` feature is enabled. When SAI creates +a VPP sub-interface (e.g., `bobm0.10`), linux-cp automatically creates the +corresponding Linux VLAN device (`Ethernet0.10`) on the host side. This +eliminates the need for explicit `configure_lcp_interface()` calls for +sub-interfaces. + +--- + +## 4. Implementation Changes + +### 4.1 BVI LCP Pair + TC Filter + +**File**: `SwitchVppFdb.cpp` — `vpp_create_bvi_interface()` + +When a VLAN SVI is created (SAI `ROUTER_INTERFACE_TYPE_VLAN`): +1. Create BVI: `create_bvi_interface(mac, vlan_id)` +2. Add BVI to BD: `set_sw_interface_l2_bridge(bvi10, vlan_id, true, BVI)` +3. Enable arp-term on BD: `set_bridge_domain_flags(bd_id, ARP_TERM, true)` +4. Create LCP pair: `configure_lcp_interface("bvi10", "tap_Vlan10", true)` +5. Bring up the tap: `interface_set_state("tap_Vlan10", true)` +6. TC redirect: `add_tc_filter_redirect("tap_Vlan10", "Vlan10")` + +Teardown mirrors creation in reverse order. + +### 4.2 Tagged Member: Sub-Interface + VTR Pop-1 + +**File**: `SwitchVppFdb.cpp` — `vpp_create_vlan_member()` (TAGGED path) + +1. Create VPP sub-interface: `create_sub_interface(bobm0, 10, 10)` + - `lcp-auto-subint` automatically creates `Ethernet0.10` on the host +2. Add sub-interface to BD: `set_sw_interface_l2_bridge(bobm0.10, 10, true, NORMAL)` +3. Set VTR pop-1: `set_l2_interface_vlan_tag_rewrite(bobm0.10, 10, ~0, DOT1Q, POP_1)` +4. Admin up: `interface_set_state(bobm0.10, true)` + +### 4.3 Punt L2 Ethertype for BD Members + +**File**: `SaiVppXlate.c` — `init_vpp_client()` + +Global mode-set at VPP client initialization: +- `vpp_lcp_punt_l2_ethertype_mode_set(0x8809, 1)` — LACP: punt (no flood) +- `vpp_lcp_punt_l2_ethertype_mode_set(0x88cc, 1)` — LLDP: punt (no flood) +- `vpp_lcp_punt_l2_ethertype_mode_set(0x0806, 2)` — ARP: punt-copy (clone + flood) + +**VPP patch**: `0009-linux-cp-punt-l2-ethertype.patch` adds the L2-arc punt +node that intercepts matching ethertypes on interfaces in L2 mode. + +### 4.4 Untagged Member: Parent Interface in BD + +**File**: `SwitchVppFdb.cpp` — `vpp_create_vlan_member()` (UNTAGGED path) + +1. Add parent phy directly to BD: `set_sw_interface_l2_bridge(bobm0, 10, true, NORMAL)` +2. Enable per-interface punt: `vpp_lcp_punt_l2_ethertype_set(bobm0, true)` + - Ensures LLDP/LACP/ARP reach the host tap for the parent + +No sub-interface or VTR needed — wire frames are already untagged. + +### 4.5 Promiscuous Mode on Every Physical Interface + +**File**: `SwitchVppHostif.cpp` + +At LCP pair creation for each physical port: +```cpp +configure_lcp_interface(hwif_name, dev, true); +interface_set_promiscuous(hwif_name, true); // <-- added +``` + +**File**: `SaiVppXlate.c` — new `interface_set_promiscuous()` wrapper using +VPP's `sw_interface_set_promisc` API. + +--- + +## 5. Packet Flow + +### 5.1 Ingress: Tagged Member (e.g., Ethernet0 in Vlan10, tagged) + +``` +Wire (802.1Q tag=10) → bobm0 (dpdk-input, promisc on) + → ethernet-input: matches dot1q → dispatches to bobm0.10 + → l2-input (BD 10) + → l2-input-vtr: POP outer tag (frame now untagged in BD) + → l2-learn: learn src MAC on bobm0.10 + → l2-fwd: + ├─ Known unicast → output to destination member + └─ BUM → l2-flood → all BD members + BVI + → BVI (bvi10) → linux-cp-punt-xc → tap_Vlan10 → tc → Vlan10 (kernel) +``` + +### 5.2 Ingress: Untagged Member (e.g., Ethernet4 in Vlan10, untagged) + +``` +Wire (no tag) → bobm0 (dpdk-input, promisc on) + → ethernet-input: no tag → parent sw_if_index + → l2-input (BD 10) + → l2-input-vtr: no-op (no rewrite configured) + → l2-learn: learn src MAC on bobm0 + → [LLDP/LACP?] → lcp-punt-l2-ethertype: punt to host (no flood) + → l2-fwd: + ├─ Known unicast → output to destination member + └─ BUM → l2-flood → all BD members + BVI + → BVI (bvi10) → linux-cp-punt-xc → tap_Vlan10 → tc → Vlan10 (kernel) +``` + +### 5.3 Egress: Kernel → BD (e.g., ARP reply from Vlan10 SVI) + +``` +Kernel (Vlan10) → tc redirect → tap_Vlan10 + → linux-cp-punt-xc → bvi10 + → l2-input (BD 10, BVI port) + → l2-fwd: + ├─ Known unicast MAC on bobm0.10 (tagged member): + │ → l2-output → l2-output-vtr: PUSH tag 10 → bobm0.10 → wire (tagged) + ├─ Known unicast MAC on bobm0 (untagged member): + │ → l2-output → no VTR → bobm0 → wire (untagged) + └─ BUM (broadcast ARP request): + → l2-flood → all members: + → bobm0.10: PUSH tag 10 → wire (tagged) + → bobm0: no VTR → wire (untagged) +``` + +### 5.4 Egress: Unicast Forwarding Between Members + +``` +bobm0.10 (ingress, tagged member) → l2-input → POP tag + → l2-learn + l2-fwd + → destination MAC known on bobm0 (untagged member): + → l2-output → no VTR → bobm0 → wire (untagged) + +bobm0 (ingress, untagged member) → l2-input → no VTR + → l2-learn + l2-fwd + → destination MAC known on bobm0.10 (tagged member): + → l2-output → l2-output-vtr: PUSH tag 10 → bobm0.10 → wire (tagged) +``` + +--- + +## 6. VPP Configuration Summary (Runtime State) + +After SONiC applies the configuration above, VPP state looks like: + +``` +# Bridge domain 10 with BVI, tagged member, untagged member +vppctl show bridge-domain 10 detail + BD-ID 10, flood, learn, arp-term + bvi10 (BVI, sw_if_index 25) + bobm0.10 (tagged member, sw_if_index 23, vtr pop-1) + bobm0 (untagged member, sw_if_index 1) + +# LCP pairs +vppctl show lcp + bvi10 → tap_Vlan10 + bobm0 → Ethernet0 (physical) + bobm0.10 → Ethernet0.10 (auto-subint) + +# Promiscuous mode +vppctl show interface bobm0 + flags: ... promisc ... + +# Punt L2 ethertypes (global) + LACP (0x8809): punt + LLDP (0x88cc): punt +``` + +--- + +## 7. Files Modified + +| File | Change | +|------|--------| +| `platform/vpp/docker-syncd-vpp/conf/startup.conf.tmpl` | Enable `lcp-auto-subint` | +| `platform/vpp/docker-sonic-vpp/conf/startup.conf.tmpl` | Enable `lcp-auto-subint` | +| `platform/vpp/vppbld/patches/0009-linux-cp-punt-l2-ethertype.patch` | VPP plugin: L2 punt by ethertype | +| `src/sonic-sairedis/vslib/vpp/vppxlate/SaiVppXlate.c` | `interface_set_promiscuous()`, punt mode-set API, sub-if cache | +| `src/sonic-sairedis/vslib/vpp/vppxlate/SaiVppXlate.h` | Extern declarations | +| `src/sonic-sairedis/vslib/vpp/SwitchVppFdb.cpp` | BVI LCP pair, tagged/untagged member handling, delete fix | +| `src/sonic-sairedis/vslib/vpp/SwitchVppRif.cpp` | SUB_PORT RIF: rely on auto-subint | +| `src/sonic-sairedis/vslib/vpp/SwitchVppHostif.cpp` | Promisc on every phy at LCP creation | +| `src/sonic-sairedis/vslib/vpp/SwitchVpp.h` | `m_bvi_vlan_lcp_map` member | From 8a4f06eb3cbed3fdf043250f6269dfc476f74caa Mon Sep 17 00:00:00 2001 From: Yue Gao Date: Wed, 6 May 2026 14:45:38 -0700 Subject: [PATCH 2/4] Update vlan-bvi-hld 1. Move LLDP/LACP punt via punt-l2-ether-type to a separate HLD 2. Add packet flow for unicast packet to and from sonic Signed-off-by: Yue Gao --- docs/HLD/vlan-bvi-hld.md | 387 +++++++++++++++++++++++++++++++++++++++ docs/vlan-bvi-hld.md | 355 ----------------------------------- 2 files changed, 387 insertions(+), 355 deletions(-) create mode 100644 docs/HLD/vlan-bvi-hld.md delete mode 100644 docs/vlan-bvi-hld.md diff --git a/docs/HLD/vlan-bvi-hld.md b/docs/HLD/vlan-bvi-hld.md new file mode 100644 index 0000000..f7f75aa --- /dev/null +++ b/docs/HLD/vlan-bvi-hld.md @@ -0,0 +1,387 @@ +# VLAN Bridge Domain with BVI — High-Level Design + +## 1. Problem Statement + +SONiC on VPP needs to support L2 bridging with L3 SVI (VLAN interfaces) for both +tagged and untagged VLAN members. The standard SONiC data model creates: + +- A VLAN (e.g., Vlan10) +- An SVI with an IP address (e.g., 10.0.0.1/24 on Vlan10) +- Member ports with tagging mode (tagged or untagged) + +VPP's bridge-domain (BD) and BVI (Bridge Virtual Interface) constructs map to this +model, but require explicit wiring to connect VPP's data plane to the Linux control +plane for ARP resolution and IP forwarding. + +### Challenges + +1. **BVI ↔ Linux connectivity**: VPP's BVI has no automatic path to the Linux + network stack. ARP requests/replies and IP packets destined to the SVI must + reach the kernel `Vlan10` interface. + +2. **ARP resolution**: Hosts in the BD send ARP requests (broadcast) to resolve + the gateway (BVI IP). The BD must flood these to the BVI, and the BVI must + punt them to the kernel for the ARP daemon to reply. + +3. **Tagged member dispatch**: VPP must create dot1q sub-interfaces for tagged + members, strip the tag on ingress to the BD, and push it back on egress. + +4. **Untagged member handling**: The physical interface itself joins the BD + directly — no sub-interface or VLAN tag rewrite needed. + +5. **Promiscuous mode**: VPP's virtio/DPDK backend filters tagged frames at + the device level unless promiscuous mode is enabled on the parent interface. + +--- + +## 2. SONiC Configuration Example + +```json +// config_db.json excerpts + +"VLAN": { + "Vlan10": { "vlanid": "10" } +} + +"VLAN_INTERFACE": { + "Vlan10": {}, + "Vlan10|10.0.0.1/24": {} +} + +"VLAN_MEMBER": { + "Vlan10|Ethernet0": { "tagging_mode": "tagged" }, + "Vlan10|Ethernet4": { "tagging_mode": "untagged" } +} +``` + +CLI equivalents: +```bash +config vlan add 10 +config vlan member add 10 Ethernet0 # tagged by default +config vlan member add -u 10 Ethernet4 # untagged +config interface ip add Vlan10 10.0.0.1/24 +``` + +--- + +## 3. Design Principles + +### 3.1 BVI LCP Pair + +The BVI (`bvi10`) is the L3 endpoint of the bridge domain. An LCP (Linux Control +Plane) pair is created between `bvi10` and a Linux tap device (`tap_Vlan10`). A tc +filter redirects traffic between `tap_Vlan10` and the kernel `Vlan10` netdev. + +This allows: +- **ARP requests** flooded in the BD reach the BVI, which punts to kernel +- **ARP replies** from kernel travel back through tap → BVI → BD → member +- **IP packets** destined to the SVI IP are routed to `ip4-local` and punted +- **Routed traffic** from kernel exits through BVI into the BD + +### 3.2 Frames in the BD Are Always Untagged + +All frames inside the bridge domain are **untagged**: +- Tagged members: VPP sub-interface pops the outer tag on ingress (`pop 1`) + and pushes it back on egress (symmetric VTR) +- Untagged members: frames arrive without a tag and stay untagged +- BVI: exchanges untagged frames with the BD (no VTR on BVI) + +This matches the Linux kernel model where the `Vlan10` SVI sees untagged frames. + +### 3.3 ARP Handling via BVI + +ARP in a bridge domain with BVI works like below: + +**BUM flood to BVI**: ARP requests (broadcast, dst ff:ff:ff:ff:ff:ff) are + flooded to all BD members including the BVI. When the frame reaches the BVI: + - `bvi-input` transitions from L2 to L3 + - The ARP request is punted to the host via + `linux-cp-punt-xc → tap_Vlan10 → tc → Vlan10` + - The kernel's ARP daemon generates a reply + +ARP is **not** punted directly from member interface. The broadcast ARP naturally +floods to the BVI through the bridge domain's normal BUM flooding path. + +### 3.4 Promiscuous Mode on Physical Interfaces + +VPP's virtio/DPDK backend implements `VIRTIO_NET_F_CTRL_VLAN` which filters +tagged frames at the device level. Promiscuous mode is enabled on every +physical interface at LCP pair creation time so that tagged frames pass +through to VPP's `ethernet-input` for sub-interface dispatch. + +### 3.5 Auto Sub-Interface (lcp-auto-subint) + +VPP's linux-cp plugin `lcp-auto-subint` feature is enabled. When SAI creates +a VPP sub-interface (e.g., `bobm0.10`), linux-cp automatically creates the +corresponding Linux VLAN device (`Ethernet0.10`) on the host side. This +eliminates the need for explicit `configure_lcp_interface()` calls for +sub-interfaces. + +--- + +## 4. Implementation Changes + +### 4.1 BVI LCP Pair + TC Filter + +**File**: `SwitchVppFdb.cpp` — `vpp_create_bvi_interface()` + +When a VLAN SVI is created (SAI `ROUTER_INTERFACE_TYPE_VLAN`): +1. Create BVI: `create_bvi_interface(mac, vlan_id)` +2. Add BVI to BD: `set_sw_interface_l2_bridge(bvi10, vlan_id, true, BVI)` +3. Create LCP pair: `configure_lcp_interface("bvi10", "tap_Vlan10", true)` +4. Bring up the tap: `interface_set_state("tap_Vlan10", true)` +5. TC redirect: `add_tc_filter_redirect("tap_Vlan10", "Vlan10")` + +Teardown mirrors creation in reverse order. + +### 4.2 Tagged Member: Sub-Interface + VTR Pop-1 + +**File**: `SwitchVppFdb.cpp` — `vpp_create_vlan_member()` (TAGGED path) + +1. Create VPP sub-interface: `create_sub_interface(bobm0, 10, 10)` + - `lcp-auto-subint` automatically creates `Ethernet0.10` on the host +2. Add sub-interface to BD: `set_sw_interface_l2_bridge(bobm0.10, 10, true, NORMAL)` +3. Set VTR pop-1: `set_l2_interface_vlan_tag_rewrite(bobm0.10, 10, ~0, DOT1Q, POP_1)` +4. Admin up: `interface_set_state(bobm0.10, true)` + +### 4.3 Untagged Member: Parent Interface in BD + +**File**: `SwitchVppFdb.cpp` — `vpp_create_vlan_member()` (UNTAGGED path) + +1. Add parent phy directly to BD: `set_sw_interface_l2_bridge(bobm0, 10, true, NORMAL)` + +No sub-interface or VTR needed — wire frames are already untagged. + +### 4.4 Promiscuous Mode on Every Physical Interface + +**File**: `SwitchVppHostif.cpp` + +At LCP pair creation for each physical port: +```cpp +configure_lcp_interface(hwif_name, dev, true); +interface_set_promiscuous(hwif_name, true); // <-- added +``` + +**File**: `SaiVppXlate.c` — new `interface_set_promiscuous()` wrapper using +VPP's `sw_interface_set_promisc` API. + +--- + +## 5. Packet Flow + +### 5.1 ARP Request from Tagged Member + +A host on tagged member Ethernet0 (bobm0.10) sends an ARP request to resolve +the gateway 10.0.0.1 (BVI IP): + +``` +Wire (802.1Q tag=10, ARP who-has 10.0.0.1, dst ff:ff:ff:ff:ff:ff) + → bobm0 (dpdk-input, promisc on) + → ethernet-input: dot1q tag=10 → dispatches to bobm0.10 + → l2-input (BD 10) + → l2-input-vtr: POP tag (frame now untagged in BD) + → l2-learn: learn src MAC on bobm0.10 + → l2-fwd: dst ff:ff:ff:ff:ff:ff → broadcast + → l2-flood: flood to all BD members except source intf + BVI + └─ bvi10 (BVI): + → linux-cp-punt-xc + → tap_Vlan10 → tc → Vlan10 (kernel) + → kernel ARP daemon sends reply +``` + +### 5.2 ARP Request from Untagged Member + +A host on untagged member Ethernet4 (bobm0) sends an ARP request: + +``` +Wire (no tag, ARP who-has 10.0.0.1, dst ff:ff:ff:ff:ff:ff) + → bobm0 (dpdk-input, promisc on) + → ethernet-input: no tag → parent sw_if_index (bobm0) + → l2-input (BD 10) + → l2-input-vtr: no-op + → l2-learn: learn src MAC on bobm0 + → l2-fwd: dst ff:ff:ff:ff:ff:ff → broadcast + → l2-flood: flood to all BD members except source intf + BVI + └─ bvi10 (BVI): + → bvi-input: L2→L3 transition + → ip4-punt → linux-cp-punt-xc + → tap_Vlan10 → tc → Vlan10 (kernel) + → kernel ARP daemon sends reply +``` + +### 5.3 ARP Reply from Kernel to Member + +The kernel ARP daemon generates a unicast reply on `Vlan10`. The reply does +**not** go through VPP's bridge domain — it exits through the Linux bridge +directly to the member's LCP tap: + +``` +Kernel ARP reply (src MAC = BVI MAC, dst MAC = requester's MAC) + → Vlan10 → Linux bridge FDB lookup for dst MAC: + ├─ If learned on Ethernet8: unicast → Ethernet8 tap + └─ If unknown: flood to all bridge member taps + → Ethernet8 tap (tap-input, hw_if_index 18) + → linux-cp-punt-xc: 18 → 2 (bobm1) + → bobm1-output → bobm1-tx → wire (untagged) +``` + +For a **tagged** member: +``` + → Ethernet0.10 tap (tap-input) + → linux-cp-punt-xc → bobm0.10 + → bobm0.10-output → wire (tagged, 802.1Q tag added by sub-if) +``` + +> **Note — Linux bridge FDB flooding issue:** +> +> The Linux bridge does not learn remote MACs on the correct member port +> because all L2 forwarding happens in VPP's bridge domain — no traffic from +> remote hosts ever arrives on member LCP taps to trigger kernel FDB learning. +> As a result, the first ARP reply (and all subsequent kernel-originated unicast +> frames) are **flooded** by the Linux bridge to all member taps as unknown +> unicast. This is functionally correct (the requester receives the reply) but +> wasteful. +> +> **Planned fix:** Implement SAI FDB event notifications (`SAI_FDB_EVENT_LEARNED`, +> `SAI_FDB_EVENT_AGED`, `SAI_FDB_EVENT_MOVE`) in VPP SAI. When VPP's `l2-learn` +> adds a MAC to its l2fib, the SAI layer generates a learned event. SONiC's +> fdborch processes the event and fdbsyncd programs the entry into the kernel +> bridge FDB. After that, the Linux bridge forwards unicast frames directly to +> the correct member tap without flooding. + +### 5.4 Unicast Destined to BVI (e.g., ping to Vlan10 SVI IP) + +Once ARP is resolved, the host sends IP packets with dst-MAC = BVI MAC. The +L2 forwarding table has learned the BVI MAC on the BVI port: + +``` +Wire (tagged, dst-MAC = BVI MAC, dst-IP = 10.0.0.1) → bobm0 (dpdk-input) + → ethernet-input → bobm0.10 + → l2-input (BD 10) + → l2-input-vtr: POP tag + → l2-learn (src MAC on bobm0.10) + → l2-fwd: dst MAC known on bvi10 (BVI port) + → l2-output → bvi10 + → bvi-input: L2→L3 transition + → ip4-input → ip4-lookup → ip4-local (dest is local SVI IP) + → linux-cp-punt-xc → tap_Vlan10 → tc → Vlan10 (kernel) + → kernel processes (ICMP reply, TCP, etc.) +``` + +For an **untagged** member the path is the same except no VTR pop: + +``` +Wire (untagged, dst-MAC = BVI MAC) → bobm0 (dpdk-input) + → ethernet-input → bobm0 (parent sw_if_index) + → l2-input (BD 10) + → l2-learn (src MAC on bobm0) + → l2-fwd: dst MAC known on bvi10 + → l2-output → bvi10 + → bvi-input: L2→L3 transition + → ip4-input → ip4-lookup → ip4-local + → linux-cp-punt-xc → tap_Vlan10 → tc → Vlan10 (kernel) +``` + +The key insight: unlike BUM traffic (which reaches the BVI via `l2-flood`), +known-unicast to the BVI MAC reaches it via normal `l2-fwd` lookup. In both +cases, once the packet enters `bvi-input`, it is treated as an L3 packet. + +### 5.5 Routed Traffic with BVI as Next-Hop (Inter-VLAN / External) + +When a host in BD 10 sends IP traffic to a destination on a **different +subnet** (e.g., another VLAN or an external route), the host's default +gateway is the BVI IP (Vlan10 SVI). The host resolves the gateway MAC via +ARP and sends the frame with dst-MAC = BVI MAC. + +**Ingress (L2 → BVI → L3 routing):** + +``` +Wire (dst-MAC = BVI MAC, dst-IP = 10.0.20.1) → bobm0.10 + → l2-input (BD 10) → l2-input-vtr: POP tag + → l2-learn → l2-fwd: dst MAC = BVI MAC → bvi10 + → bvi-input: L2→L3 transition + → ip4-input → ip4-lookup: + dst 10.0.20.1 → next-hop via bvi20 (another VLAN SVI) + → or next-hop via bobm2 (L3 routed port) + → or next-hop via default route (upstream) +``` + +**Case A: Inter-VLAN routing (destination in BD 20):** + +``` + → ip4-lookup → next-hop 10.0.20.1 reachable via bvi20 + → ip4-rewrite: rewrite dst-MAC to target host MAC, src-MAC to bvi20 MAC + → bvi-output (bvi20): L3→L2 transition into BD 20 + → l2-input (BD 20, from BVI port) + → l2-fwd: dst MAC known on member in BD 20 + → l2-output → [VTR if tagged member] → wire +``` + +**Case B: Routing to an L3 port (no bridge domain):** + +``` + → ip4-lookup → next-hop via bobm2 (L3 interface) + → ip4-rewrite: rewrite MACs + → interface-output → bobm2 → wire +``` + +### 5.6 L2 Unicast Forwarding Between Members + +``` +bobm0.10 (ingress, tagged member) → l2-input → POP tag + → l2-learn + l2-fwd + → destination MAC known on bobm0 (untagged member): + → l2-output → no VTR → bobm0 → wire (untagged) + +bobm0 (ingress, untagged member) → l2-input → no VTR + → l2-learn + l2-fwd + → destination MAC known on bobm0.10 (tagged member): + → l2-output → l2-output-vtr: PUSH tag 10 → bobm0.10 → wire (tagged) +``` + +--- + +## 6. VPP Configuration Summary (Runtime State) + +After SONiC applies the configuration above, VPP state looks like: + +``` +# Bridge domain 10 with BVI, tagged member, untagged member +vppctl show bridge-domain 10 detail + BD-ID 10, flood, learn + bvi10 (BVI, sw_if_index 25) + bobm0.10 (tagged member, sw_if_index 23, vtr pop-1) + bobm0 (untagged member, sw_if_index 1) + +# LCP pairs +vppctl show lcp + bvi10 → tap_Vlan10 + bobm0 → Ethernet0 (physical) + bobm0.10 → Ethernet0.10 (auto-subint) + +# Promiscuous mode +vppctl show interface bobm0 + flags: ... promisc ... +``` + +--- + +## 7. Related Documents + +- [VLAN BVI L2 Punt HLD](vlan-bvi-l2-punt-hld.md) — LLDP/LACP punt via + `lcp-punt-l2-ethertype` for bridged members + +--- + +## 8. Files Modified + +| File | Change | +|------|--------| +| `platform/vpp/docker-syncd-vpp/conf/startup.conf.tmpl` | Enable `lcp-auto-subint` | +| `platform/vpp/docker-sonic-vpp/conf/startup.conf.tmpl` | Enable `lcp-auto-subint` | +| `src/sonic-sairedis/vslib/vpp/vppxlate/SaiVppXlate.c` | `interface_set_promiscuous()` wrapper | +| `src/sonic-sairedis/vslib/vpp/vppxlate/SaiVppXlate.h` | Extern declarations | +| `src/sonic-sairedis/vslib/vpp/SwitchVppFdb.cpp` | BVI LCP pair, tagged/untagged member handling | +| `src/sonic-sairedis/vslib/vpp/SwitchVppRif.cpp` | SUB_PORT RIF: rely on auto-subint | +| `src/sonic-sairedis/vslib/vpp/SwitchVppHostif.cpp` | Promisc on every phy at LCP creation | +| `src/sonic-sairedis/vslib/vpp/SwitchVpp.h` | `m_bvi_vlan_lcp_map` member | diff --git a/docs/vlan-bvi-hld.md b/docs/vlan-bvi-hld.md deleted file mode 100644 index ed1eab0..0000000 --- a/docs/vlan-bvi-hld.md +++ /dev/null @@ -1,355 +0,0 @@ -# VLAN Bridge Domain on VPP — High-Level Design - -## 1. Problem Statement - -SONiC on VPP needs to support L2 bridging with L3 SVI (VLAN interfaces) for both -tagged and untagged VLAN members. The standard SONiC data model creates: - -- A VLAN (e.g., Vlan10) -- An SVI with an IP address (e.g., 10.0.0.1/24 on Vlan10) -- Member ports with tagging mode (tagged or untagged) - -VPP's bridge-domain (BD) and BVI (Bridge Virtual Interface) constructs map to this -model, but require explicit wiring to connect VPP's data plane to the Linux control -plane (for ARP, DHCP, LLDP, LACP, routing protocols, etc.). - -### Challenges - -1. **BVI ↔ Linux connectivity**: VPP's BVI has no automatic path to the Linux - network stack. Punted traffic (ARP, DHCP) from the BD/BVI must reach the - kernel `Vlan10` interface. - -2. **Tagged member dispatch**: VPP must create dot1q sub-interfaces for tagged - members, strip the tag on ingress to the BD, and push it back on egress. - The host must also see a corresponding VLAN sub-interface. - -3. **Untagged member handling**: The physical interface itself joins the BD - directly. Control protocols (LLDP, LACP, ARP) arriving untagged must be - punted to the host without being consumed by the BD flood path. - -4. **Promiscuous mode**: VPP's virtio/DPDK backend filters tagged frames at - the device level unless promiscuous mode is enabled on the parent interface. - ---- - -## 2. SONiC Configuration Example - -```json -// config_db.json excerpts - -// Create VLAN -"VLAN": { - "Vlan10": { - "vlanid": "10" - } -} - -// Assign IP to VLAN interface (SVI) -"VLAN_INTERFACE": { - "Vlan10": {}, - "Vlan10|10.0.0.1/24": {} -} - -// Add members -"VLAN_MEMBER": { - "Vlan10|Ethernet0": { - "tagging_mode": "tagged" - }, - "Vlan10|Ethernet4": { - "tagging_mode": "untagged" - } -} -``` - -CLI equivalents: -```bash -config vlan add 10 -config vlan member add 10 Ethernet0 # tagged by default -config vlan member add -u 10 Ethernet4 # untagged -config interface ip add Vlan10 10.0.0.1/24 -``` - ---- - -## 3. Design Principles - -### 3.1 BVI LCP Pair - -The BVI (`bvi10`) is the L3 endpoint of the bridge domain. An LCP (Linux Control -Plane) pair is created between `bvi10` and a Linux tap device (`tap_Vlan10`). A tc -filter redirects traffic between `tap_Vlan10` and the kernel `Vlan10` netdev. - -This allows: -- **ARP/DHCP punted via BVI** reach the kernel as untagged Ethernet frames -- **Routing decisions** made by the kernel are sent back through the tap → BVI → BD -- The BD floods BUM (Broadcast/Unknown-unicast/Multicast) to all members -- Known unicast is forwarded based on the MAC table - -### 3.2 Frames in the BD Are Always Untagged - -All frames inside the bridge domain are **untagged**: -- Tagged members: VPP sub-interface pops the outer tag on ingress (`pop 1`) - and pushes it back on egress (symmetric VTR) -- Untagged members: frames arrive without a tag and stay untagged -- BVI: exchanges untagged frames with the BD (no VTR on BVI) - -This matches the Linux kernel model where the `Vlan10` SVI sees untagged frames. - -### 3.3 Control Protocol Punt (`lcp-punt-l2-ethertype`) - -#### Problem - -When a physical interface has an LCP pair (e.g., `bobm0 → Ethernet0`) and -operates in **L3 mode**, VPP's existing `linux-cp-punt-xc` node handles -control traffic naturally: packets arriving on `bobm0` are cross-connected -to the host tap `Ethernet0` via the LCP punt path. The kernel receives -LLDP, LACP, ARP, etc. directly. - -However, when the interface is placed into a **bridge domain** as an untagged -member (`set interface l2 bridge bobm0 10`), the interface transitions from -L3 mode to L2 mode. At this point: - -1. VPP's `ip4-input`/`ip6-input` arc is **no longer in the path** — the - interface enters `l2-input` instead. -2. `linux-cp-punt-xc` operates on the **L3 punt path** (ip4-punt, ip6-punt), - not the L2 path. It never sees L2-only protocols. -3. The L2 bridge treats LLDP/LACP destination MACs (`01:80:c2:00:00:0e`, - `01:80:c2:00:00:02`) as regular multicast and **floods them** to all BD - members plus the BVI — but these are link-local protocols that must NOT - be forwarded beyond the directly-connected link. -4. Even if the frame reaches the BVI and gets punted to the host, it arrives - on `Vlan10` (the SVI) instead of `Ethernet0` (the physical port) — the - control plane (lldpd, teamd) cannot associate the protocol message with - the correct physical interface. - -#### Solution: `lcp-punt-l2-ethertype` - -A new VPP feature node is inserted on the **L2 input arc** (between -`l2-input` and `l2-learn`) that matches frames by ethertype: - -``` -l2-input → [lcp-punt-l2-ethertype] → l2-learn → l2-fwd → ... -``` - -When a frame matches a configured ethertype, the node: -- **Punt mode (1)**: Redirects the frame to the interface's LCP host tap - and drops it from the L2 flood path. Used for LLDP and LACP — these must - reach the host but never be flooded to other BD members. -- **Punt-copy mode (2)**: Clones the frame — the copy goes to the LCP host - tap while the original continues through the L2 bridge for normal flooding. - Used for ARP — the host needs it for the control plane, but other BD - members also need the broadcast for L2 reachability. - -The punt delivers the frame to the **LCP tap of the ingress interface** -(e.g., `Ethernet0` for untagged member `bobm0`, or `Ethernet0.10` for tagged -sub-interface `bobm0.10`), preserving the correct per-port association that -the control plane expects. - -#### Configuration - -Ethertypes are configured globally at VPP client init: - -| Ethertype | Protocol | Mode | Behavior | -|-----------|----------|------|----------| -| 0x88cc | LLDP | punt (1) | To host only, not flooded | -| 0x8809 | LACP | punt (1) | To host only, not flooded | - -Per-interface enable/disable controls which BD members participate: -- Untagged members: explicitly enabled via `vpp_lcp_punt_l2_ethertype_set()` -- Tagged members (sub-interfaces): inherit from parent's LCP pair - -#### Why Not Use `group_fwd_mask` on Linux Bridge? - -The Linux kernel bridge's `group_fwd_mask` only applies to the **kernel** -bridge (`Bridge` in SONiC). In the VPP data path, bridging happens entirely -in VPP's bridge domain — frames never traverse the kernel bridge for -forwarding decisions. The kernel bridge only sees traffic that has already -been punted/redirected via LCP taps. Therefore, `group_fwd_mask` cannot -solve the VPP-side LLDP/LACP flooding problem. - -### 3.4 Promiscuous Mode on Physical Interfaces - -VPP's virtio/DPDK backend implements `VIRTIO_NET_F_CTRL_VLAN` which filters -tagged frames at the device level. Promiscuous mode is enabled on every -physical interface at LCP pair creation time so that tagged frames pass -through to VPP's `ethernet-input` for sub-interface dispatch. - -### 3.5 Auto Sub-Interface (lcp-auto-subint) - -VPP's linux-cp plugin `lcp-auto-subint` feature is enabled. When SAI creates -a VPP sub-interface (e.g., `bobm0.10`), linux-cp automatically creates the -corresponding Linux VLAN device (`Ethernet0.10`) on the host side. This -eliminates the need for explicit `configure_lcp_interface()` calls for -sub-interfaces. - ---- - -## 4. Implementation Changes - -### 4.1 BVI LCP Pair + TC Filter - -**File**: `SwitchVppFdb.cpp` — `vpp_create_bvi_interface()` - -When a VLAN SVI is created (SAI `ROUTER_INTERFACE_TYPE_VLAN`): -1. Create BVI: `create_bvi_interface(mac, vlan_id)` -2. Add BVI to BD: `set_sw_interface_l2_bridge(bvi10, vlan_id, true, BVI)` -3. Enable arp-term on BD: `set_bridge_domain_flags(bd_id, ARP_TERM, true)` -4. Create LCP pair: `configure_lcp_interface("bvi10", "tap_Vlan10", true)` -5. Bring up the tap: `interface_set_state("tap_Vlan10", true)` -6. TC redirect: `add_tc_filter_redirect("tap_Vlan10", "Vlan10")` - -Teardown mirrors creation in reverse order. - -### 4.2 Tagged Member: Sub-Interface + VTR Pop-1 - -**File**: `SwitchVppFdb.cpp` — `vpp_create_vlan_member()` (TAGGED path) - -1. Create VPP sub-interface: `create_sub_interface(bobm0, 10, 10)` - - `lcp-auto-subint` automatically creates `Ethernet0.10` on the host -2. Add sub-interface to BD: `set_sw_interface_l2_bridge(bobm0.10, 10, true, NORMAL)` -3. Set VTR pop-1: `set_l2_interface_vlan_tag_rewrite(bobm0.10, 10, ~0, DOT1Q, POP_1)` -4. Admin up: `interface_set_state(bobm0.10, true)` - -### 4.3 Punt L2 Ethertype for BD Members - -**File**: `SaiVppXlate.c` — `init_vpp_client()` - -Global mode-set at VPP client initialization: -- `vpp_lcp_punt_l2_ethertype_mode_set(0x8809, 1)` — LACP: punt (no flood) -- `vpp_lcp_punt_l2_ethertype_mode_set(0x88cc, 1)` — LLDP: punt (no flood) -- `vpp_lcp_punt_l2_ethertype_mode_set(0x0806, 2)` — ARP: punt-copy (clone + flood) - -**VPP patch**: `0009-linux-cp-punt-l2-ethertype.patch` adds the L2-arc punt -node that intercepts matching ethertypes on interfaces in L2 mode. - -### 4.4 Untagged Member: Parent Interface in BD - -**File**: `SwitchVppFdb.cpp` — `vpp_create_vlan_member()` (UNTAGGED path) - -1. Add parent phy directly to BD: `set_sw_interface_l2_bridge(bobm0, 10, true, NORMAL)` -2. Enable per-interface punt: `vpp_lcp_punt_l2_ethertype_set(bobm0, true)` - - Ensures LLDP/LACP/ARP reach the host tap for the parent - -No sub-interface or VTR needed — wire frames are already untagged. - -### 4.5 Promiscuous Mode on Every Physical Interface - -**File**: `SwitchVppHostif.cpp` - -At LCP pair creation for each physical port: -```cpp -configure_lcp_interface(hwif_name, dev, true); -interface_set_promiscuous(hwif_name, true); // <-- added -``` - -**File**: `SaiVppXlate.c` — new `interface_set_promiscuous()` wrapper using -VPP's `sw_interface_set_promisc` API. - ---- - -## 5. Packet Flow - -### 5.1 Ingress: Tagged Member (e.g., Ethernet0 in Vlan10, tagged) - -``` -Wire (802.1Q tag=10) → bobm0 (dpdk-input, promisc on) - → ethernet-input: matches dot1q → dispatches to bobm0.10 - → l2-input (BD 10) - → l2-input-vtr: POP outer tag (frame now untagged in BD) - → l2-learn: learn src MAC on bobm0.10 - → l2-fwd: - ├─ Known unicast → output to destination member - └─ BUM → l2-flood → all BD members + BVI - → BVI (bvi10) → linux-cp-punt-xc → tap_Vlan10 → tc → Vlan10 (kernel) -``` - -### 5.2 Ingress: Untagged Member (e.g., Ethernet4 in Vlan10, untagged) - -``` -Wire (no tag) → bobm0 (dpdk-input, promisc on) - → ethernet-input: no tag → parent sw_if_index - → l2-input (BD 10) - → l2-input-vtr: no-op (no rewrite configured) - → l2-learn: learn src MAC on bobm0 - → [LLDP/LACP?] → lcp-punt-l2-ethertype: punt to host (no flood) - → l2-fwd: - ├─ Known unicast → output to destination member - └─ BUM → l2-flood → all BD members + BVI - → BVI (bvi10) → linux-cp-punt-xc → tap_Vlan10 → tc → Vlan10 (kernel) -``` - -### 5.3 Egress: Kernel → BD (e.g., ARP reply from Vlan10 SVI) - -``` -Kernel (Vlan10) → tc redirect → tap_Vlan10 - → linux-cp-punt-xc → bvi10 - → l2-input (BD 10, BVI port) - → l2-fwd: - ├─ Known unicast MAC on bobm0.10 (tagged member): - │ → l2-output → l2-output-vtr: PUSH tag 10 → bobm0.10 → wire (tagged) - ├─ Known unicast MAC on bobm0 (untagged member): - │ → l2-output → no VTR → bobm0 → wire (untagged) - └─ BUM (broadcast ARP request): - → l2-flood → all members: - → bobm0.10: PUSH tag 10 → wire (tagged) - → bobm0: no VTR → wire (untagged) -``` - -### 5.4 Egress: Unicast Forwarding Between Members - -``` -bobm0.10 (ingress, tagged member) → l2-input → POP tag - → l2-learn + l2-fwd - → destination MAC known on bobm0 (untagged member): - → l2-output → no VTR → bobm0 → wire (untagged) - -bobm0 (ingress, untagged member) → l2-input → no VTR - → l2-learn + l2-fwd - → destination MAC known on bobm0.10 (tagged member): - → l2-output → l2-output-vtr: PUSH tag 10 → bobm0.10 → wire (tagged) -``` - ---- - -## 6. VPP Configuration Summary (Runtime State) - -After SONiC applies the configuration above, VPP state looks like: - -``` -# Bridge domain 10 with BVI, tagged member, untagged member -vppctl show bridge-domain 10 detail - BD-ID 10, flood, learn, arp-term - bvi10 (BVI, sw_if_index 25) - bobm0.10 (tagged member, sw_if_index 23, vtr pop-1) - bobm0 (untagged member, sw_if_index 1) - -# LCP pairs -vppctl show lcp - bvi10 → tap_Vlan10 - bobm0 → Ethernet0 (physical) - bobm0.10 → Ethernet0.10 (auto-subint) - -# Promiscuous mode -vppctl show interface bobm0 - flags: ... promisc ... - -# Punt L2 ethertypes (global) - LACP (0x8809): punt - LLDP (0x88cc): punt -``` - ---- - -## 7. Files Modified - -| File | Change | -|------|--------| -| `platform/vpp/docker-syncd-vpp/conf/startup.conf.tmpl` | Enable `lcp-auto-subint` | -| `platform/vpp/docker-sonic-vpp/conf/startup.conf.tmpl` | Enable `lcp-auto-subint` | -| `platform/vpp/vppbld/patches/0009-linux-cp-punt-l2-ethertype.patch` | VPP plugin: L2 punt by ethertype | -| `src/sonic-sairedis/vslib/vpp/vppxlate/SaiVppXlate.c` | `interface_set_promiscuous()`, punt mode-set API, sub-if cache | -| `src/sonic-sairedis/vslib/vpp/vppxlate/SaiVppXlate.h` | Extern declarations | -| `src/sonic-sairedis/vslib/vpp/SwitchVppFdb.cpp` | BVI LCP pair, tagged/untagged member handling, delete fix | -| `src/sonic-sairedis/vslib/vpp/SwitchVppRif.cpp` | SUB_PORT RIF: rely on auto-subint | -| `src/sonic-sairedis/vslib/vpp/SwitchVppHostif.cpp` | Promisc on every phy at LCP creation | -| `src/sonic-sairedis/vslib/vpp/SwitchVpp.h` | `m_bvi_vlan_lcp_map` member | From b553997c3d16dcfcfcc93f24716bc103adf3fd3c Mon Sep 17 00:00:00 2001 From: Yue Gao Date: Tue, 12 May 2026 13:56:39 -0700 Subject: [PATCH 3/4] Update HLD to use classifier Signed-off-by: Yue Gao --- docs/HLD/vlan-bvi-hld.md | 505 +++++++++++++++++++++++++++------------ 1 file changed, 347 insertions(+), 158 deletions(-) diff --git a/docs/HLD/vlan-bvi-hld.md b/docs/HLD/vlan-bvi-hld.md index f7f75aa..fdcced6 100644 --- a/docs/HLD/vlan-bvi-hld.md +++ b/docs/HLD/vlan-bvi-hld.md @@ -10,26 +10,39 @@ tagged and untagged VLAN members. The standard SONiC data model creates: - Member ports with tagging mode (tagged or untagged) VPP's bridge-domain (BD) and BVI (Bridge Virtual Interface) constructs map to this -model, but require explicit wiring to connect VPP's data plane to the Linux control -plane for ARP resolution and IP forwarding. +model, but require explicit wiring for two distinct traffic classes: + +1. **Control-plane protocols (ARP, LLDP, LACP, DHCP)** — must be punted directly + from BD member interfaces to the Linux control plane so that per-port services + (LLDP/LACP agents, DHCP relay) receive frames on the correct interface. + +2. **L3 traffic over the VLAN SVI IP** — IP unicast destined to the SVI address + (e.g., ping, SSH, routing protocols) must traverse the BVI and be punted to + the kernel `Vlan10` interface via the BVI LCP pair. ### Challenges 1. **BVI ↔ Linux connectivity**: VPP's BVI has no automatic path to the Linux - network stack. ARP requests/replies and IP packets destined to the SVI must - reach the kernel `Vlan10` interface. + network stack. IP packets destined to the SVI must reach the kernel `Vlan10` + interface. + +2. **Per-port punt for control protocols**: ARP, LLDP, LACP, and DHCP must be + punted directly from the ingress BD member interface so that Linux daemons + (lldpd, teamd, dhcrelay) receive frames on the correct port. Relying on + BVI-path flooding would deliver these frames on `Vlan10` without port context. -2. **ARP resolution**: Hosts in the BD send ARP requests (broadcast) to resolve - the gateway (BVI IP). The BD must flood these to the BVI, and the BVI must - punt them to the kernel for the ARP daemon to reply. +3. **Clone vs. consume**: ARP and DHCP require clone-on-hit — a copy is punted + to the control plane while the original continues through the BD for normal + L2 flooding and BVI processing. LLDP and LACP are consumed (punt only, no + clone) because they are link-local and must not be forwarded in the BD. -3. **Tagged member dispatch**: VPP must create dot1q sub-interfaces for tagged +4. **Tagged member dispatch**: VPP must create dot1q sub-interfaces for tagged members, strip the tag on ingress to the BD, and push it back on egress. -4. **Untagged member handling**: The physical interface itself joins the BD +5. **Untagged member handling**: The physical interface itself joins the BD directly — no sub-interface or VLAN tag rewrite needed. -5. **Promiscuous mode**: VPP's virtio/DPDK backend filters tagged frames at +6. **Promiscuous mode**: VPP's virtio/DPDK backend filters tagged frames at the device level unless promiscuous mode is enabled on the parent interface. --- @@ -66,19 +79,156 @@ config interface ip add Vlan10 10.0.0.1/24 ## 3. Design Principles -### 3.1 BVI LCP Pair +### 3.1 BVI LCP Pair (L3 SVI Access) The BVI (`bvi10`) is the L3 endpoint of the bridge domain. An LCP (Linux Control Plane) pair is created between `bvi10` and a Linux tap device (`tap_Vlan10`). A tc filter redirects traffic between `tap_Vlan10` and the kernel `Vlan10` netdev. -This allows: -- **ARP requests** flooded in the BD reach the BVI, which punts to kernel -- **ARP replies** from kernel travel back through tap → BVI → BD → member -- **IP packets** destined to the SVI IP are routed to `ip4-local` and punted -- **Routed traffic** from kernel exits through BVI into the BD +The BVI LCP pair handles **L3 traffic destined to the SVI IP**: +- **IP unicast** to the SVI IP (ping, SSH, routing) reaches the BVI via L2 + forwarding, transitions to L3 via `bvi-input`, and is punted to Vlan10. +- **Routed traffic** from the kernel exits through Vlan10 → tap → BVI → BD. +- **Inter-VLAN routing** uses the BVI as the L3 gateway. + +### 3.2 L2 Classifier-Based Punt (Control Protocols) + +Control-plane protocols (ARP, LLDP, LACP, DHCP) are punted **directly from BD +member interfaces** using VPP's `l2-input-classify` feature with the +`linux-cp-punt` node as the target. This ensures per-port delivery to the +Linux control plane. + +Key design decisions: + +- **No new VPP graph nodes** — the implementation uses existing VPP + infrastructure: `l2-input-classify` (built-in L2 feature, bit 18) with + `linux-cp-punt` as hit-next target. + +- **Clone-on-hit** — a VPP patch adds clone support to `l2-input-classify`. + When a classify session has `opaque_index = 1`, the buffer is cloned via + `vlib_buffer_copy()`: the clone is sent to `linux-cp-punt` and the original + continues through the L2 feature chain (VTR → learn → fwd → flood). + +- **Punt vs. clone per protocol**: + - **LLDP/LACP** (`opaque_index=0`): punt/consume — the original is + redirected to `linux-cp-punt` and does not continue in the BD. These are + link-local protocols that must not be L2-forwarded. Note: this only + applies to **untagged** members (see below). + - **ARP/DHCP** (`opaque_index=1`): clone-on-hit — a copy is punted for + the control plane while the original continues through the BD for normal + flooding/forwarding/BVI processing. + +- **LLDP/LACP on tagged members follow the regular LCP path** — + `ethernet-input` dispatches frames by outer ethertype. LLDP (0x88CC) and + LACP (0x8809) are **not** 0x8100, so they are **not** dispatched to the + dot1q sub-interface. Instead, they stay on the parent physical interface + (which is not in the BD) and follow the normal LCP punt path: + `dpdk-input → ethernet-input → linux-cp-punt-xc → tap`. The classifier + on the sub-interface never sees these frames. Therefore, the tagged + classifier tables only need sessions for ARP and DHCP. + +- **Classifier runs before VTR** — `l2-input-classify` (bit 18) has higher + priority than `l2-input-vtr` (bit 11) in the L2 feature bitmap. For tagged + members, the 802.1Q header is still present when the classifier matches. + +### 3.3 Classifier Table Design + +Four shared classify tables are created lazily on the first BD member add and +persist for the lifetime of the process. A single `linux-cp-punt` next-index +is resolved via `vpp_add_node_next("l2-input-classify", "linux-cp-punt")`. + +#### 3.3.1 Untagged Member Tables + +Untagged frames have the ethertype at byte offset 12 (standard Ethernet). + +**Table: untag_other** — matches ethertype directly (skip=0, match=1) + +| Byte Offset | Field | Mask | +|-------------|-------|------| +| 12–13 | Ethertype | `0xFFFF` | + +Sessions: + +| Protocol | Ethertype | opaque_index | Action | +|----------|-----------|-------------|--------| +| LLDP | `0x88CC` | 0 | punt (consume) | +| LACP | `0x8809` | 0 | punt (consume) | +| ARP | `0x0806` | 1 | clone + punt | + +Miss: continue L2 feature chain (to VTR, learn, fwd). + +**Table: untag_ip4** — matches IP protocol + UDP dport (skip=1, match=2) + +This table catches DHCP among IPv4 packets (ethertype `0x0800`). + +| Byte Offset | Field | Mask | +|-------------|-------|------| +| 23 | IP Protocol | `0xFF` | +| 36–37 | UDP Dest Port | `0xFFFF` | -### 3.2 Frames in the BD Are Always Untagged +Sessions: + +| Protocol | Proto | DPort | opaque_index | Action | +|----------|-------|-------|-------------|--------| +| DHCP | 0x11 (UDP) | 67 (0x0043) | 1 | clone + punt | + +Miss: continue L2 feature chain. + +VPP session match data for skip=1 tables includes 16 bytes of skip padding +at the start: `match_len = (skip + match) × 16 = 48 bytes`. + +#### 3.3.2 Tagged Member Tables + +Tagged frames have the outer ethertype `0x8100` at byte 12, so VPP's +`l2-input-classify` dispatches all tagged frames to the **other** table slot. +The inner (payload) ethertype is at byte offset 16. + +**Table: tag_other** — matches inner ethertype (skip=1, match=1) + +Chained to `tag_dhcp` table on miss. + +| Byte Offset | Field | Mask | +|-------------|-------|------| +| 16–17 | Inner Ethertype | `0xFFFF` | + +Sessions: + +| Protocol | Inner Etype | opaque_index | Action | +|----------|-------------|-------------|--------| +| ARP | `0x0806` | 1 | clone + punt | + +Note: LLDP/LACP sessions are **not** needed in the tagged tables because +`ethernet-input` does not dispatch LLDP (0x88CC) or LACP (0x8809) frames +to the dot1q sub-interface — they stay on the parent and follow the regular +LCP punt path. + +Miss: chain to `tag_dhcp` table. + +**Table: tag_dhcp** — matches inner etype + IP protocol + UDP dport (skip=1, match=2) + +| Byte Offset | Field | Mask | +|-------------|-------|------| +| 16–17 | Inner Ethertype | `0xFFFF` | +| 27 | IP Protocol | `0xFF` | +| 40–41 | UDP Dest Port | `0xFFFF` | + +Sessions: + +| Protocol | Inner Etype | Proto | DPort | opaque_index | Action | +|----------|-------------|-------|-------|-------------|--------| +| DHCP | `0x0800` | 0x11 | 67 | 1 | clone + punt | + +Miss: continue L2 feature chain. + +#### 3.3.3 Table Attachment + +When a BD member is added: +- **Tagged member** (sub-interface, e.g., `bobm0.10`): + `classify_set_interface_l2_tables(bobm0.10, ip4=~0, ip6=~0, other=tag_other)` +- **Untagged member** (parent phy, e.g., `bobm0`): + `classify_set_interface_l2_tables(bobm0, ip4=untag_ip4, ip6=~0, other=untag_other)` + +### 3.4 Frames in the BD Are Always Untagged All frames inside the bridge domain are **untagged**: - Tagged members: VPP sub-interface pops the outer tag on ingress (`pop 1`) @@ -88,28 +238,14 @@ All frames inside the bridge domain are **untagged**: This matches the Linux kernel model where the `Vlan10` SVI sees untagged frames. -### 3.3 ARP Handling via BVI - -ARP in a bridge domain with BVI works like below: - -**BUM flood to BVI**: ARP requests (broadcast, dst ff:ff:ff:ff:ff:ff) are - flooded to all BD members including the BVI. When the frame reaches the BVI: - - `bvi-input` transitions from L2 to L3 - - The ARP request is punted to the host via - `linux-cp-punt-xc → tap_Vlan10 → tc → Vlan10` - - The kernel's ARP daemon generates a reply - -ARP is **not** punted directly from member interface. The broadcast ARP naturally -floods to the BVI through the bridge domain's normal BUM flooding path. - -### 3.4 Promiscuous Mode on Physical Interfaces +### 3.5 Promiscuous Mode on Physical Interfaces VPP's virtio/DPDK backend implements `VIRTIO_NET_F_CTRL_VLAN` which filters tagged frames at the device level. Promiscuous mode is enabled on every physical interface at LCP pair creation time so that tagged frames pass through to VPP's `ethernet-input` for sub-interface dispatch. -### 3.5 Auto Sub-Interface (lcp-auto-subint) +### 3.6 Auto Sub-Interface (lcp-auto-subint) VPP's linux-cp plugin `lcp-auto-subint` feature is enabled. When SAI creates a VPP sub-interface (e.g., `bobm0.10`), linux-cp automatically creates the @@ -134,7 +270,7 @@ When a VLAN SVI is created (SAI `ROUTER_INTERFACE_TYPE_VLAN`): Teardown mirrors creation in reverse order. -### 4.2 Tagged Member: Sub-Interface + VTR Pop-1 +### 4.2 Tagged Member: Sub-Interface + VTR Pop-1 + Classifier **File**: `SwitchVppFdb.cpp` — `vpp_create_vlan_member()` (TAGGED path) @@ -143,16 +279,49 @@ Teardown mirrors creation in reverse order. 2. Add sub-interface to BD: `set_sw_interface_l2_bridge(bobm0.10, 10, true, NORMAL)` 3. Set VTR pop-1: `set_l2_interface_vlan_tag_rewrite(bobm0.10, 10, ~0, DOT1Q, POP_1)` 4. Admin up: `interface_set_state(bobm0.10, true)` +5. Attach classifier: `l2_punt_classify_apply(bobm0.10, true /*tagged*/)` -### 4.3 Untagged Member: Parent Interface in BD +### 4.3 Untagged Member: Parent Interface in BD + Classifier **File**: `SwitchVppFdb.cpp` — `vpp_create_vlan_member()` (UNTAGGED path) 1. Add parent phy directly to BD: `set_sw_interface_l2_bridge(bobm0, 10, true, NORMAL)` +2. Attach classifier: `l2_punt_classify_apply(bobm0, false /*untagged*/)` No sub-interface or VTR needed — wire frames are already untagged. -### 4.4 Promiscuous Mode on Every Physical Interface +### 4.4 Classifier Initialization + +**File**: `SwitchVppFdb.cpp` — `l2_punt_classify_init()` + +Lazily creates the four shared classify tables and sessions described in +Section 3.3. Called automatically on the first BD member add. + +### 4.5 Clone-on-Hit VPP Patch + +**File**: `platform/vpp/vppbld/patches/0010-l2-input-classify-clone-on-hit.patch` + +Adds clone support to `l2-input-classify`. When a session match has +`opaque_index == L2_CLASSIFY_OPAQUE_CLONE (1)`: +1. Clone the buffer via `vlib_buffer_copy()` +2. Send clone to `hit_next_index` (linux-cp-punt) +3. Original buffer continues through the L2 feature chain + +### 4.6 VPP API Wrappers + +**File**: `SaiVppXlate.c` + +New functions for the classify binary API: +- `vpp_classify_table_create()` — create table with mask +- `vpp_classify_table_delete()` — delete table +- `vpp_classify_session_add()` — add session with match/opaque +- `vpp_classify_session_del()` — delete session +- `vpp_classify_set_interface_l2_tables()` — attach/detach tables on interface +- `vpp_add_node_next()` — resolve next-node index + +All use the `M`/`M22` macros for socket-aware message allocation. + +### 4.7 Promiscuous Mode on Every Physical Interface **File**: `SwitchVppHostif.cpp` @@ -162,181 +331,178 @@ configure_lcp_interface(hwif_name, dev, true); interface_set_promiscuous(hwif_name, true); // <-- added ``` -**File**: `SaiVppXlate.c` — new `interface_set_promiscuous()` wrapper using -VPP's `sw_interface_set_promisc` API. - --- ## 5. Packet Flow ### 5.1 ARP Request from Tagged Member -A host on tagged member Ethernet0 (bobm0.10) sends an ARP request to resolve -the gateway 10.0.0.1 (BVI IP): +A host on tagged member Ethernet0 (bobm0.10) sends an ARP broadcast: ``` Wire (802.1Q tag=10, ARP who-has 10.0.0.1, dst ff:ff:ff:ff:ff:ff) → bobm0 (dpdk-input, promisc on) → ethernet-input: dot1q tag=10 → dispatches to bobm0.10 → l2-input (BD 10) - → l2-input-vtr: POP tag (frame now untagged in BD) - → l2-learn: learn src MAC on bobm0.10 - → l2-fwd: dst ff:ff:ff:ff:ff:ff → broadcast - → l2-flood: flood to all BD members except source intf + BVI - └─ bvi10 (BVI): - → linux-cp-punt-xc - → tap_Vlan10 → tc → Vlan10 (kernel) - → kernel ARP daemon sends reply + → l2-input-classify: outer etype=0x8100 → "other" table + → tag_other: inner etype=0x0806 (ARP) → HIT, opaque=1 + → clone-on-hit: + CLONE → linux-cp-punt → Ethernet0 tap (with 802.1Q tag) + → kernel receives ARP on Ethernet0 with VLAN tag + → kernel ARP daemon replies (if IP is local) + ORIGINAL → continues L2 feature chain: + → l2-input-vtr: POP tag + → l2-learn → l2-fwd → l2-flood → bvi10 + → bvi-input → arp-input → linux-cp (Vlan10) ``` +ARP reaches the kernel **twice**: once via the clone (on `Ethernet0` with +VLAN tag, for per-port visibility) and once via the BVI path (on `Vlan10`, +untagged). + ### 5.2 ARP Request from Untagged Member -A host on untagged member Ethernet4 (bobm0) sends an ARP request: +A host on untagged member Ethernet4 (bobm1) sends an ARP broadcast: ``` Wire (no tag, ARP who-has 10.0.0.1, dst ff:ff:ff:ff:ff:ff) - → bobm0 (dpdk-input, promisc on) - → ethernet-input: no tag → parent sw_if_index (bobm0) + → bobm1 (dpdk-input, promisc on) + → ethernet-input: no tag → parent sw_if_index (bobm1) → l2-input (BD 10) - → l2-input-vtr: no-op - → l2-learn: learn src MAC on bobm0 - → l2-fwd: dst ff:ff:ff:ff:ff:ff → broadcast - → l2-flood: flood to all BD members except source intf + BVI - └─ bvi10 (BVI): - → bvi-input: L2→L3 transition - → ip4-punt → linux-cp-punt-xc - → tap_Vlan10 → tc → Vlan10 (kernel) - → kernel ARP daemon sends reply + → l2-input-classify: etype=0x0806 → "other" table + → untag_other: etype=0x0806 (ARP) → HIT, opaque=1 + → clone-on-hit: + CLONE → linux-cp-punt → Ethernet4 tap (untagged) + → kernel receives ARP on Ethernet4 + ORIGINAL → continues L2 feature chain: + → l2-learn → l2-fwd → l2-flood → bvi10 + → bvi-input → arp-input → linux-cp (Vlan10) ``` -### 5.3 ARP Reply from Kernel to Member +### 5.3 LLDP/LACP from Tagged Member -The kernel ARP daemon generates a unicast reply on `Vlan10`. The reply does -**not** go through VPP's bridge domain — it exits through the Linux bridge -directly to the member's LCP tap: +LLDP and LACP use link-local ethertypes (0x88CC, 0x8809) — **not** 0x8100. +When a tagged member port receives an LLDP frame, `ethernet-input` sees +ethertype 0x88CC and does **not** dispatch it to the dot1q sub-interface. +The frame stays on the parent physical interface, which has an LCP pair +but is not in the BD. It follows the regular LCP punt path: ``` -Kernel ARP reply (src MAC = BVI MAC, dst MAC = requester's MAC) - → Vlan10 → Linux bridge FDB lookup for dst MAC: - ├─ If learned on Ethernet8: unicast → Ethernet8 tap - └─ If unknown: flood to all bridge member taps - → Ethernet8 tap (tap-input, hw_if_index 18) - → linux-cp-punt-xc: 18 → 2 (bobm1) - → bobm1-output → bobm1-tx → wire (untagged) +Wire (LLDP 0x88CC, no 802.1Q encapsulation — LLDP is always untagged) + → bobm0 (dpdk-input, promisc on) + → ethernet-input: etype=0x88CC, hw-if-index=1, sw-if-index=1 (parent) + → linux-cp-punt-xc: sw_if_index 1 → tap (Ethernet0) + → kernel lldpd/teamd processes the frame on Ethernet0 ``` -For a **tagged** member: +The classifier on the sub-interface (`bobm0.10`) never sees LLDP/LACP. +This is the standard behavior — LLDP is a link-layer protocol that is +not VLAN-tagged on the wire. + +### 5.4 LLDP/LACP from Untagged Member + ``` - → Ethernet0.10 tap (tap-input) - → linux-cp-punt-xc → bobm0.10 - → bobm0.10-output → wire (tagged, 802.1Q tag added by sub-if) +Wire (no tag, LLDP 0x88CC) + → bobm1 → ethernet-input → bobm1 + → l2-input (BD 10) + → l2-input-classify: etype=0x88CC → "other" table + → untag_other: etype=0x88CC → HIT, opaque=0 + → punt (consume): → linux-cp-punt → Ethernet4 tap + → kernel lldpd/teamd processes the frame ``` -> **Note — Linux bridge FDB flooding issue:** -> -> The Linux bridge does not learn remote MACs on the correct member port -> because all L2 forwarding happens in VPP's bridge domain — no traffic from -> remote hosts ever arrives on member LCP taps to trigger kernel FDB learning. -> As a result, the first ARP reply (and all subsequent kernel-originated unicast -> frames) are **flooded** by the Linux bridge to all member taps as unknown -> unicast. This is functionally correct (the requester receives the reply) but -> wasteful. -> -> **Planned fix:** Implement SAI FDB event notifications (`SAI_FDB_EVENT_LEARNED`, -> `SAI_FDB_EVENT_AGED`, `SAI_FDB_EVENT_MOVE`) in VPP SAI. When VPP's `l2-learn` -> adds a MAC to its l2fib, the SAI layer generates a learned event. SONiC's -> fdborch processes the event and fdbsyncd programs the entry into the kernel -> bridge FDB. After that, the Linux bridge forwards unicast frames directly to -> the correct member tap without flooding. - -### 5.4 Unicast Destined to BVI (e.g., ping to Vlan10 SVI IP) - -Once ARP is resolved, the host sends IP packets with dst-MAC = BVI MAC. The -L2 forwarding table has learned the BVI MAC on the BVI port: +### 5.5 DHCP Request from Untagged Member + +DHCP discover from a client on untagged member Ethernet4: ``` -Wire (tagged, dst-MAC = BVI MAC, dst-IP = 10.0.0.1) → bobm0 (dpdk-input) - → ethernet-input → bobm0.10 +Wire (no tag, IPv4, UDP src=68 dst=67, 0.0.0.0 → 255.255.255.255) + → bobm1 → ethernet-input → bobm1 → l2-input (BD 10) - → l2-input-vtr: POP tag - → l2-learn (src MAC on bobm0.10) - → l2-fwd: dst MAC known on bvi10 (BVI port) - → l2-output → bvi10 - → bvi-input: L2→L3 transition - → ip4-input → ip4-lookup → ip4-local (dest is local SVI IP) - → linux-cp-punt-xc → tap_Vlan10 → tc → Vlan10 (kernel) - → kernel processes (ICMP reply, TCP, etc.) + → l2-input-classify: etype=0x0800 → "ip4" table + → untag_ip4: proto=0x11, dport=67 → HIT, opaque=1 + → clone-on-hit: + CLONE → linux-cp-punt → Ethernet4 tap (untagged) + → kernel receives DHCP on Ethernet4 + → Linux bridge floods to all member taps + → dhcrelay on Vlan10 picks up the broadcast + → relay adds option 82, forwards to DHCP server + ORIGINAL → continues L2 feature chain: + → l2-learn → l2-fwd → l2-flood → bvi10 + → (IP broadcast to 255.255.255.255, handled by BD flood) ``` -For an **untagged** member the path is the same except no VTR pop: +The DHCP relay process (`dhcrelay -id Vlan10 -iu docker0 `) +listens for DHCP broadcasts. The clone delivers the packet into the Linux +network stack where it reaches the relay via the Linux bridge that connects +member taps and `Vlan10`. + +### 5.6 DHCP Request from Tagged Member ``` -Wire (untagged, dst-MAC = BVI MAC) → bobm0 (dpdk-input) - → ethernet-input → bobm0 (parent sw_if_index) +Wire (802.1Q tag=10, IPv4, UDP src=68 dst=67) + → bobm0 → ethernet-input → bobm0.10 → l2-input (BD 10) - → l2-learn (src MAC on bobm0) - → l2-fwd: dst MAC known on bvi10 + → l2-input-classify: outer etype=0x8100 → "other" table + → tag_other: inner etype=0x0800 → MISS → chain to tag_dhcp + → tag_dhcp: inner etype=0x0800, proto=0x11, dport=67 → HIT, opaque=1 + → clone-on-hit: + CLONE → linux-cp-punt → Ethernet0 tap (with 802.1Q tag) + → kernel receives tagged DHCP on Ethernet0 + ORIGINAL → continues L2 feature chain: + → l2-input-vtr: POP tag → l2-learn → l2-flood → bvi10 +``` +Note: SONiC control plane doesn't support DHCP over tagged member interface. + +### 5.7 IP Unicast to SVI (e.g., ping to Vlan10 IP) + +L3 traffic destined to the SVI IP does **not** use the classifier — it flows +through normal L2 forwarding to the BVI, then through the BVI LCP pair. + +``` +Wire (tagged, dst-MAC = BVI MAC, dst-IP = 10.0.0.1) → bobm0 (dpdk-input) + → ethernet-input → bobm0.10 + → l2-input (BD 10) + → l2-input-classify: etype=0x0800, ip4 → untag_ip4 or tag_dhcp + → no session match (not DHCP) → MISS → continue + → l2-input-vtr: POP tag + → l2-learn → l2-fwd: dst MAC known on bvi10 → l2-output → bvi10 → bvi-input: L2→L3 transition → ip4-input → ip4-lookup → ip4-local → linux-cp-punt-xc → tap_Vlan10 → tc → Vlan10 (kernel) ``` -The key insight: unlike BUM traffic (which reaches the BVI via `l2-flood`), -known-unicast to the BVI MAC reaches it via normal `l2-fwd` lookup. In both -cases, once the packet enters `bvi-input`, it is treated as an L3 packet. +For an **untagged** member the path is the same except no VTR pop. -### 5.5 Routed Traffic with BVI as Next-Hop (Inter-VLAN / External) +### 5.8 Routed Traffic with BVI as Next-Hop (Inter-VLAN / External) When a host in BD 10 sends IP traffic to a destination on a **different -subnet** (e.g., another VLAN or an external route), the host's default -gateway is the BVI IP (Vlan10 SVI). The host resolves the gateway MAC via -ARP and sends the frame with dst-MAC = BVI MAC. - -**Ingress (L2 → BVI → L3 routing):** +subnet**, the host's default gateway is the BVI IP. The frame has dst-MAC = +BVI MAC and is L2-forwarded to the BVI. ``` Wire (dst-MAC = BVI MAC, dst-IP = 10.0.20.1) → bobm0.10 - → l2-input (BD 10) → l2-input-vtr: POP tag + → l2-input (BD 10) → classifier miss → VTR POP → l2-learn → l2-fwd: dst MAC = BVI MAC → bvi10 - → bvi-input: L2→L3 transition + → bvi-input: L2→L3 → ip4-input → ip4-lookup: - dst 10.0.20.1 → next-hop via bvi20 (another VLAN SVI) + dst 10.0.20.1 → next-hop via bvi20 (inter-VLAN) → or next-hop via bobm2 (L3 routed port) - → or next-hop via default route (upstream) ``` -**Case A: Inter-VLAN routing (destination in BD 20):** - -``` - → ip4-lookup → next-hop 10.0.20.1 reachable via bvi20 - → ip4-rewrite: rewrite dst-MAC to target host MAC, src-MAC to bvi20 MAC - → bvi-output (bvi20): L3→L2 transition into BD 20 - → l2-input (BD 20, from BVI port) - → l2-fwd: dst MAC known on member in BD 20 - → l2-output → [VTR if tagged member] → wire -``` +### 5.9 L2 Unicast Forwarding Between Members -**Case B: Routing to an L3 port (no bridge domain):** +Normal L2 forwarding is unaffected by the classifier (miss path): ``` - → ip4-lookup → next-hop via bobm2 (L3 interface) - → ip4-rewrite: rewrite MACs - → interface-output → bobm2 → wire -``` - -### 5.6 L2 Unicast Forwarding Between Members +bobm0.10 (ingress, tagged) → l2-input → classifier miss → POP tag + → l2-learn + l2-fwd → dst MAC on bobm1 (untagged): + → l2-output → no VTR → bobm1 → wire (untagged) -``` -bobm0.10 (ingress, tagged member) → l2-input → POP tag - → l2-learn + l2-fwd - → destination MAC known on bobm0 (untagged member): - → l2-output → no VTR → bobm0 → wire (untagged) - -bobm0 (ingress, untagged member) → l2-input → no VTR - → l2-learn + l2-fwd - → destination MAC known on bobm0.10 (tagged member): - → l2-output → l2-output-vtr: PUSH tag 10 → bobm0.10 → wire (tagged) +bobm1 (ingress, untagged) → l2-input → classifier miss + → l2-learn + l2-fwd → dst MAC on bobm0.10 (tagged): + → l2-output → l2-output-vtr: PUSH tag 10 → bobm0.10 → wire (tagged) ``` --- @@ -351,13 +517,26 @@ vppctl show bridge-domain 10 detail BD-ID 10, flood, learn bvi10 (BVI, sw_if_index 25) bobm0.10 (tagged member, sw_if_index 23, vtr pop-1) - bobm0 (untagged member, sw_if_index 1) + bobm1 (untagged member, sw_if_index 3) # LCP pairs vppctl show lcp bvi10 → tap_Vlan10 bobm0 → Ethernet0 (physical) bobm0.10 → Ethernet0.10 (auto-subint) + bobm1 → Ethernet4 (physical) + +# Classifier tables +vppctl show classify tables + Table 0 (untag_other): skip=0 match=1 sessions=3 + Table 1 (untag_ip4): skip=1 match=2 sessions=1 + Table 2 (tag_dhcp): skip=1 match=2 sessions=1 + Table 3 (tag_other): skip=1 match=1 sessions=3, next_table=2 + +# Classifier attachment +vppctl show classify interface + bobm0.10: ip4=~0 ip6=~0 other=3 (tag_other) + bobm1: ip4=1 (untag_ip4) ip6=~0 other=0 (untag_other) # Promiscuous mode vppctl show interface bobm0 @@ -366,10 +545,19 @@ vppctl show interface bobm0 --- -## 7. Related Documents +## 7. IPv6 Neighbor Discovery (Future Work) + +IPv6 Neighbor Discovery (ND) punt is not covered in this design and will be +addressed in a separate document. Two options are under consideration: + +1. **VPP built-in ND handling**: VPP's `ip6-neighbor-discovery` and linux-cp + plugin can handle ND natively without involving the SONiC control plane. + VPP would respond to NS/NA on the BVI and program neighbor entries directly. -- [VLAN BVI L2 Punt HLD](vlan-bvi-l2-punt-hld.md) — LLDP/LACP punt via - `lcp-punt-l2-ethertype` for bridged members +2. **Classifier-based punt**: Extend the classifier tables to match IPv6 ND + packets (Next Header=0x3A ICMPv6, Type=0x87 NS / 0x88 NA) and punt/clone + them to the Linux control plane, similar to the ARP approach. This would + require additional ip6 table slots. --- @@ -379,9 +567,10 @@ vppctl show interface bobm0 |------|--------| | `platform/vpp/docker-syncd-vpp/conf/startup.conf.tmpl` | Enable `lcp-auto-subint` | | `platform/vpp/docker-sonic-vpp/conf/startup.conf.tmpl` | Enable `lcp-auto-subint` | -| `src/sonic-sairedis/vslib/vpp/vppxlate/SaiVppXlate.c` | `interface_set_promiscuous()` wrapper | +| `platform/vpp/vppbld/patches/0010-l2-input-classify-clone-on-hit.patch` | Clone-on-hit for `l2-input-classify` | +| `src/sonic-sairedis/vslib/vpp/vppxlate/SaiVppXlate.c` | Classify API wrappers, `interface_set_promiscuous()` | | `src/sonic-sairedis/vslib/vpp/vppxlate/SaiVppXlate.h` | Extern declarations | -| `src/sonic-sairedis/vslib/vpp/SwitchVppFdb.cpp` | BVI LCP pair, tagged/untagged member handling | +| `src/sonic-sairedis/vslib/vpp/SwitchVppFdb.cpp` | BVI LCP pair, classifier init/apply/remove, tagged/untagged member handling | | `src/sonic-sairedis/vslib/vpp/SwitchVppRif.cpp` | SUB_PORT RIF: rely on auto-subint | | `src/sonic-sairedis/vslib/vpp/SwitchVppHostif.cpp` | Promisc on every phy at LCP creation | | `src/sonic-sairedis/vslib/vpp/SwitchVpp.h` | `m_bvi_vlan_lcp_map` member | From e1fa5a167f016aebe0f5130e48095d1f3e525b89 Mon Sep 17 00:00:00 2001 From: Yue Gao Date: Thu, 28 May 2026 11:36:08 -0400 Subject: [PATCH 4/4] Update HLD to address comments - disable flooding to bvi Signed-off-by: Yue Gao --- docs/HLD/vlan-bvi-hld.md | 256 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 246 insertions(+), 10 deletions(-) diff --git a/docs/HLD/vlan-bvi-hld.md b/docs/HLD/vlan-bvi-hld.md index fdcced6..8f5feb9 100644 --- a/docs/HLD/vlan-bvi-hld.md +++ b/docs/HLD/vlan-bvi-hld.md @@ -75,6 +75,113 @@ config vlan member add -u 10 Ethernet4 # untagged config interface ip add Vlan10 10.0.0.1/24 ``` +### 2.1 ASIC_DB Representation + +The CONFIG_DB configuration above flows through orchagent and surfaces in ASIC_DB as the following SAI objects. The sonic-vpp SAI layer translates each of these into VPP bridge-domain, sub-interface, BVI, LCP, and classify operations. + +#### Bridge object (SAI 1Q bridge) + +``` +ASIC_STATE:SAI_OBJECT_TYPE_BRIDGE:oid:0x39000000000001 + SAI_BRIDGE_ATTR_TYPE = SAI_BRIDGE_TYPE_1Q +``` + +Created implicitly at boot. The default 1Q bridge is the container for VLANs. + +#### VLAN object + +``` +ASIC_STATE:SAI_OBJECT_TYPE_VLAN:oid:0x2600000000064f + SAI_VLAN_ATTR_VLAN_ID = 10 +``` + +→ sonic-vpp creates VPP bridge-domain id 10 and BVI `bvi10`. + +#### VLAN member objects (one per port) + +``` +ASIC_STATE:SAI_OBJECT_TYPE_VLAN_MEMBER:oid:0x270000000007a0 + SAI_VLAN_MEMBER_ATTR_VLAN_ID = oid:0x2600000000064f + SAI_VLAN_MEMBER_ATTR_BRIDGE_PORT_ID = oid:0x3a00000000079f # bridge port for Ethernet0 + SAI_VLAN_MEMBER_ATTR_VLAN_TAGGING_MODE = SAI_VLAN_TAGGING_MODE_TAGGED + +ASIC_STATE:SAI_OBJECT_TYPE_VLAN_MEMBER:oid:0x270000000007a1 + SAI_VLAN_MEMBER_ATTR_VLAN_ID = oid:0x2600000000064f + SAI_VLAN_MEMBER_ATTR_BRIDGE_PORT_ID = oid:0x3a00000000079e # bridge port for Ethernet4 + SAI_VLAN_MEMBER_ATTR_VLAN_TAGGING_MODE = SAI_VLAN_TAGGING_MODE_UNTAGGED +``` + +→ sonic-vpp action: +- **Tagged** (`Ethernet0`): create `GigabitEthernet0/8/0.10` dot1q sub-interface, set `l2 tag-rewrite pop 1`, add to BD 10. +- **Untagged** (`Ethernet4`): add the parent hardware interface `GigabitEthernet0/8/1` directly to BD 10. +- Install l2-input-classify sessions on the member's `sw_if_index` (per the protocol matrix in §3.3). + +#### Bridge port objects (one per port that participates in any bridge) + +``` +ASIC_STATE:SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a00000000079f + SAI_BRIDGE_PORT_ATTR_TYPE = SAI_BRIDGE_PORT_TYPE_PORT + SAI_BRIDGE_PORT_ATTR_PORT_ID = oid:0x10000000000004 # Ethernet0 + SAI_BRIDGE_PORT_ATTR_BRIDGE_ID = oid:0x39000000000001 + SAI_BRIDGE_PORT_ATTR_ADMIN_STATE = true + SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE = SAI_BRIDGE_PORT_FDB_LEARNING_MODE_HW +``` + +→ sonic-vpp records the port-to-BD relationship; actual `set int l2 bridge` is deferred until VLAN_MEMBER object resolves which BD and which sub-interface to use. + +#### Router interface object (the SVI / BVI) + +``` +ASIC_STATE:SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x600000000064d + SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID = oid:0x3000000000040 + SAI_ROUTER_INTERFACE_ATTR_TYPE = SAI_ROUTER_INTERFACE_TYPE_VLAN + SAI_ROUTER_INTERFACE_ATTR_VLAN_ID = oid:0x2600000000064f + SAI_ROUTER_INTERFACE_ATTR_SRC_MAC_ADDRESS = 1C:23:CD:51:EB:00 + SAI_ROUTER_INTERFACE_ATTR_MTU = 9100 +``` + +→ sonic-vpp action: This is the trigger for BVI creation. +- Create `bvi10` (loopback interface with `bvi 1`) in BD 10. +- Set MAC address from `SAI_ROUTER_INTERFACE_ATTR_SRC_MAC_ADDRESS`. +- Create LCP pair `bvi10` ↔ `tap_Vlan10`; set up tc filter `tap_Vlan10` ↔ kernel `Vlan10`. +- Enable IP on the BVI; subsequent `SAI_OBJECT_TYPE_ROUTER_INTERFACE` IP-address attribute updates and `SAI_OBJECT_TYPE_NEIGHBOR_ENTRY` / `SAI_OBJECT_TYPE_ROUTE_ENTRY` operations use this BVI's `sw_if_index`. + +#### Hostif object (for the SVI Linux netdev) + +``` +ASIC_STATE:SAI_OBJECT_TYPE_HOSTIF:oid:0xd000000000a5e + SAI_HOSTIF_ATTR_TYPE = SAI_HOSTIF_TYPE_NETDEV + SAI_HOSTIF_ATTR_OBJ_ID = oid:0x600000000064d # the VLAN RIF + SAI_HOSTIF_ATTR_NAME = "Vlan10" + SAI_HOSTIF_ATTR_OPER_STATUS = true +``` + +→ sonic-vpp action: The `Vlan10` netdev already exists in the kernel (created by orchagent's host-interface manager); sonic-vpp ensures the LCP tap is wired to it via tc redirect. + +#### Putting it together + +``` +config_db.json + │ + ▼ +orchagent (VlanMgr, IntfMgr, ...) + │ + ▼ +ASIC_DB objects (above) + │ + ▼ +sonic-vpp SAI layer + ├── SAI_OBJECT_TYPE_VLAN → create BD 10 + ├── SAI_OBJECT_TYPE_VLAN_MEMBER (tagged) → dot1q sub-if + tag-rewrite + add to BD + classify sessions + ├── SAI_OBJECT_TYPE_VLAN_MEMBER (untagged) → parent hwif + add to BD + classify sessions + ├── SAI_OBJECT_TYPE_ROUTER_INTERFACE → bvi10 + LCP pair + tc redirect to Vlan10 + └── SAI_OBJECT_TYPE_HOSTIF → ensure Vlan10 netdev ↔ tap_Vlan10 tc link +``` + +ASIC_DB does **not** explicitly model: +- Which L2 protocols are punted (ARP/LLDP/LACP/DHCP) — that comes from CoPP / hostif-trap objects (`SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP`, `SAI_OBJECT_TYPE_HOSTIF_TRAP`) which orchagent installs once at boot; sonic-vpp translates them into the `l2-input-classify` sessions described in §3.3. +- BVI vs. non-BVI semantics — that is a property of `SAI_ROUTER_INTERFACE_TYPE_VLAN` (BVI) vs. `SAI_ROUTER_INTERFACE_TYPE_PORT` / `_SUB_PORT` (no BVI). + --- ## 3. Design Principles @@ -177,6 +284,33 @@ Miss: continue L2 feature chain. VPP session match data for skip=1 tables includes 16 bytes of skip padding at the start: `match_len = (skip + match) × 16 = 48 bytes`. +**Table: untag_ip6** — matches IPv6 next-header + ICMPv6 type (skip=2, match=1) + +This table catches IPv6 Neighbor Discovery (NS/NA/RS/RA/Redirect) frames +among IPv6 packets (ethertype `0x86DD`). Without this, IPv6 SVI operation +breaks once the BVI is removed from the BD flood group (see §3.7). + +| Byte Offset | Field | Mask | +|-------------|-------|------| +| 20 | IPv6 Next-Header | `0xFF` | +| 54 | ICMPv6 Type | `0xFF` | + +Sessions (all `opaque_index=1`, clone + punt): + +| Protocol | Next-Hdr | ICMPv6 Type | +|----------|----------|-------------| +| RS | 0x3A | 133 | +| RA | 0x3A | 134 | +| NS | 0x3A | 135 | +| NA | 0x3A | 136 | +| Redirect | 0x3A | 137 | + +Miss: continue L2 feature chain. + +DHCPv6 (UDP 546/547) is **not** included in v1 — DHCPv6 relay/server on the +SVI is out of scope; if enabled later, add UDP-dport sessions analogous to +the DHCPv4 entry in `untag_ip4`. + #### 3.3.2 Tagged Member Tables Tagged frames have the outer ethertype `0x8100` at byte 12, so VPP's @@ -220,13 +354,48 @@ Sessions: Miss: continue L2 feature chain. +**Table: tag_ip6** — matches IPv6 next-header + ICMPv6 type with a 4-byte +802.1Q shift (skip=2, match=1) + +Same fields as `untag_ip6` but shifted by 4 bytes to account for the outer +802.1Q header (the classifier runs before VTR, so the tag is still present). + +| Byte Offset | Field | Mask | +|-------------|-------|------| +| 24 | IPv6 Next-Header | `0xFF` | +| 58 | ICMPv6 Type | `0xFF` | + +Sessions: same five ND types as `untag_ip6`, `opaque_index=1`. + #### 3.3.3 Table Attachment When a BD member is added: - **Tagged member** (sub-interface, e.g., `bobm0.10`): - `classify_set_interface_l2_tables(bobm0.10, ip4=~0, ip6=~0, other=tag_other)` + `classify_set_interface_l2_tables(bobm0.10, ip4=~0, ip6=tag_ip6, other=tag_other)` - **Untagged member** (parent phy, e.g., `bobm0`): - `classify_set_interface_l2_tables(bobm0, ip4=untag_ip4, ip6=~0, other=untag_other)` + `classify_set_interface_l2_tables(bobm0, ip4=untag_ip4, ip6=untag_ip6, other=untag_other)` + +#### 3.3.4 Adding a New Punted L2-Multicast / Broadcast Protocol + +Because the BVI is excluded from the BD flood group (§3.7), **every** new +control protocol whose frames are L2-multicast or broadcast must be punted +by an explicit classifier session. There is no implicit fallback through +the BVI. To add a protocol (e.g., IGMP, MLD, DHCPv6, VRRP): + +1. Pick the table slot based on ethertype: + - `0x0800` (IPv4) → `untag_ip4` / future `tag_ip4` + - `0x86DD` (IPv6) → `untag_ip6` / `tag_ip6` + - other / link-local → `untag_other` / `tag_other` +2. If the table's existing mask covers your match fields, just add a session. + If not, create a new chained table (`next_table_index`) with the right + mask and link the prior table's miss to it. +3. Choose `opaque_index`: + - `0` = consume (frame stops in classifier, only the punted copy survives) + — use only for link-local protocols that must not be L2-forwarded. + - `1` = clone-on-hit — use for everything else so BD members still see + the frame. +4. Register the session in `l2_punt_classify_init()` so it is installed + before any BD member is attached. ### 3.4 Frames in the BD Are Always Untagged @@ -253,6 +422,53 @@ corresponding Linux VLAN device (`Ethernet0.10`) on the host side. This eliminates the need for explicit `configure_lcp_interface()` calls for sub-interfaces. +### 3.7 BVI Excluded from BD Flood Group + +By default, VPP adds every BD member — including the BVI — to the BD's +flood group, so broadcast / unknown-unicast / L2-multicast (BUM) frames +from any member are replicated to the BVI and then handed to `ip4-input` / +`ip6-input`. For an SVI port this is wasted work: + +- ARP broadcasts (`who-has 10.0.0.1`) are already cloned to the kernel via + the `l2-input-classify` punt path described in §3.3. The kernel's ARP + stack on `Vlan10` answers; the BVI does not need to see the flood copy. +- DHCP broadcasts are handled the same way (clone + punt). +- Unknown-unicast and L2-multicast inside the BD have no business reaching + the L3 SVI — they are pure CPU cost and can confuse the IP input path + (martian source logs, unwanted multicast joins, etc.). + +To prevent this, the BVI is added to the BD with **flood and +unknown-unicast flood disabled**: + +| BD-member flag | Normal member | BVI member | +|----------------|---------------|------------| +| `port_type` | `NORMAL` | `BVI` | +| `enable_flood` (broadcast + L2-mcast) | `true` | **`false`** | +| `enable_uu_flood` (unknown unicast) | `true` | **`false`** | +| `enable_bvi` | n/a | `true` | + +Known-unicast frames whose destination MAC matches the BVI's MAC are still +delivered to the BVI via FDB lookup — those are L3-bound packets and must +reach `ip4-input`. Only the flood and UU-flood replications are suppressed. + +This is configured via the `sw_interface_set_l2_bridge` binary API with the +`enable_flood` and `enable_uu_flood` bits cleared for the BVI's `sw_if_index`. + +**Implication — every kernel-visible BUM protocol needs an explicit +classifier session.** With the BVI no longer in the flood group, the BD has +no catch-all path to the host stack. Any L2-broadcast or L2-multicast frame +the kernel needs to see (ARP, DHCPv4, IPv6 ND, IGMP, MLD, DHCPv6, VRRP, +etc.) must be punted by a session in one of the `l2-input-classify` tables +(§3.3). Frames not matched by a session are flooded only to the other BD +members and **never reach** `Vlan10` or any other SVI tap. + +v1 ships sessions for: ARP, DHCPv4, LLDP, LACP, and IPv6 ND +(RS/RA/NS/NA/Redirect). Other protocols are out of scope for v1 and follow +the recipe in §3.3.4 when they are enabled. Link-local L2 protocols (LLDP, +LACP, STP) that travel without an 802.1Q header on tagged members continue +to use the parent-phy LCP path and do not need classifier sessions on the +sub-interface (see §3.2). + --- ## 4. Implementation Changes @@ -263,7 +479,9 @@ sub-interfaces. When a VLAN SVI is created (SAI `ROUTER_INTERFACE_TYPE_VLAN`): 1. Create BVI: `create_bvi_interface(mac, vlan_id)` -2. Add BVI to BD: `set_sw_interface_l2_bridge(bvi10, vlan_id, true, BVI)` +2. Add BVI to BD **with flood disabled** (see §3.7): + `set_sw_interface_l2_bridge(bvi10, vlan_id, port_type=BVI, + enable=true, enable_flood=false, enable_uu_flood=false)` 3. Create LCP pair: `configure_lcp_interface("bvi10", "tap_Vlan10", true)` 4. Bring up the tap: `interface_set_state("tap_Vlan10", true)` 5. TC redirect: `add_tc_filter_redirect("tap_Vlan10", "Vlan10")` @@ -352,13 +570,27 @@ Wire (802.1Q tag=10, ARP who-has 10.0.0.1, dst ff:ff:ff:ff:ff:ff) → kernel ARP daemon replies (if IP is local) ORIGINAL → continues L2 feature chain: → l2-input-vtr: POP tag - → l2-learn → l2-fwd → l2-flood → bvi10 - → bvi-input → arp-input → linux-cp (Vlan10) + → l2-learn → l2-fwd → l2-flood + (flood group excludes bvi10 — see §3.7 — + so the BVI does NOT receive a copy) + → replicated only to other BD members ``` -ARP reaches the kernel **twice**: once via the clone (on `Ethernet0` with -VLAN tag, for per-port visibility) and once via the BVI path (on `Vlan10`, -untagged). +ARP reaches the kernel **once**, via the classifier clone on the per-port +`Ethernet0` tap. The BVI is excluded from the BD flood group (§3.7), so the +broadcast is not replicated to `bvi10` and does **not** appear on `Vlan10`. + +> **Note on ARP replies for the SVI IP (`10.0.0.1`).** Because the cloned +> ARP request lands on `Ethernet0` rather than `Vlan10`, the kernel must +> still be able to answer for `Vlan10`'s IP. With default sysctls +> (`arp_ignore=0`), Linux will reply for any locally-owned IP regardless +> of the receiving interface; the reply egresses `Ethernet0` and is then +> bridged by VPP back into BD 10 via the LCP path. Implementations should +> validate this path end-to-end and, if needed, either (a) keep the BVI in +> the flood group for broadcast (set `enable_flood=true`, `enable_uu_flood=false`) +> so the kernel sees the request directly on `Vlan10`, or (b) install +> static BD `arp-entry` records and let VPP answer ARP for the SVI IP +> from `bvi10`'s MAC without involving the kernel. ### 5.2 ARP Request from Untagged Member @@ -375,10 +607,14 @@ Wire (no tag, ARP who-has 10.0.0.1, dst ff:ff:ff:ff:ff:ff) CLONE → linux-cp-punt → Ethernet4 tap (untagged) → kernel receives ARP on Ethernet4 ORIGINAL → continues L2 feature chain: - → l2-learn → l2-fwd → l2-flood → bvi10 - → bvi-input → arp-input → linux-cp (Vlan10) + → l2-learn → l2-fwd → l2-flood + (flood group excludes bvi10 — see §3.7) + → replicated only to other BD members ``` +As in §5.1, the kernel sees the ARP exactly once (via the classifier clone +on `Ethernet4`). The BVI does not receive the flood copy. + ### 5.3 LLDP/LACP from Tagged Member LLDP and LACP use link-local ethertypes (0x88CC, 0x8809) — **not** 0x8100.