Skip to content

[bfd] Add SAI-level BFD-coupled NHG fast switchover HLD#2371

Open
yuezhoujk wants to merge 4 commits into
sonic-net:masterfrom
yuezhoujk:bfd-sai-nhg-fast-switchover
Open

[bfd] Add SAI-level BFD-coupled NHG fast switchover HLD#2371
yuezhoujk wants to merge 4 commits into
sonic-net:masterfrom
yuezhoujk:bfd-sai-nhg-fast-switchover

Conversation

@yuezhoujk

Copy link
Copy Markdown
Collaborator

Add High Level Design document for SAI-level BFD-coupled Next Hop Group fast switchover. This feature enables ASIC hardware to autonomously remove or add ECMP group members based on BFD session state changes without control plane involvement, achieving microsecond-level convergence.

Key design points:

  • New SAI attribute SAI_NEXT_HOP_ATTR_BFD_DISCRIMINATOR
  • ASIC behavioral spec for autonomous ECMP member management
  • Primary/standby role-based make-before-break switchover
  • SONiC control plane changes (fpmsyncd, NhgOrch, Srv6Orch, BfdOrch)
  • SAI SDK implementation reference

Add High Level Design document for SAI-level BFD-coupled Next Hop
Group fast switchover. This feature enables ASIC hardware to
autonomously remove or add ECMP group members based on BFD session
state changes without control plane involvement, achieving
microsecond-level convergence.

Key design points:
- New SAI attribute SAI_NEXT_HOP_ATTR_BFD_DISCRIMINATOR
- ASIC behavioral spec for autonomous ECMP member management
- Primary/standby role-based make-before-break switchover
- SONiC control plane changes (fpmsyncd, NhgOrch, Srv6Orch, BfdOrch)
- SAI SDK implementation reference

Signed-off-by: yuezhou.jk <yuezhou.jk@alibaba-inc.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

Mark as inactive
```

### 4.3.4 BFD UP — Restore Forwarding

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe operation? One scenario is routing protocols implement route flap damping/ hold-off timers and might enforce a 30-second wait before declaring link viable again. If the hardware autonomously re-adds the member to the ECMP group the millisecond BFD goes UP, it will blackhole traffic on a flapping link and completely ignore the control plane's stability policies.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this. The current implementation already handles BFD UP through the control plane path (pathd → zebra → fpmsyncd → NhgOrch) — there is no autonomous restore in the SAI SDK. Section 4.3.4 has been updated to clarify this.

}
}
if (!members.empty()) {
removeMembers(members);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing, is it good to dynamically set the member's weight to 0, or update an admin-state/active attribute, preserving the object so NhgOrch state remains intact and the ASIC still has a reference to the object. Let they be removed by pathd later which will be a cleaner approach

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current implementation already takes this approach — on BFD DOWN we use weight reduction and mark the member inactive, preserving the SAI object. Updated the HLD to reflect this more clearly in Section 7.2.

The previous version incorrectly described BFD UP as triggering
autonomous ASIC-level NHG member restoration. The actual SAI SDK
code only handles BFD DOWN in bfd_notification_handler() — there
is no BFD UP auto-restore logic. BFD UP recovery is driven entirely
by the control plane (pathd → zebra → fpmsyncd → NhgOrch), which
respects routing protocol stability policies such as flap damping.

Signed-off-by: yuezhou.jk <yuezhou.jk@alibaba-inc.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

@yuezhoujk yuezhoujk marked this pull request as draft June 10, 2026 02:58
processBfdDown()/onBfdDown() do not exist in actual code. BfdOrch
only calls removeBfdPeer() on BFD DOWN, it does not call NhgOrch.
The fast switchover is handled entirely in SAI SDK layer via
reduce_member_weight() + m_active=false (preserving SAI objects).

Also add English presentation document for 15-min feature overview.

Signed-off-by: yuezhou.jk <yuezhou.jk@alibaba-inc.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

- Event 2 in Section 4.3.6: NH-B is still in m_valid_members when
  switch_to(STANDBY) is called, so Step 2 does remove NH-B
- Section 1.2: bfdsyncd uses TCP socket (PF_INET), not Unix socket

Signed-off-by: yuezhou.jk <yuezhou.jk@alibaba-inc.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

@yuezhoujk yuezhoujk force-pushed the bfd-sai-nhg-fast-switchover branch from aaae97b to 0797db9 Compare June 10, 2026 03:17
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

@yuezhoujk yuezhoujk force-pushed the bfd-sai-nhg-fast-switchover branch from 0797db9 to e5e72bd Compare June 10, 2026 04:00
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

@yuezhoujk yuezhoujk marked this pull request as ready for review June 10, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants