Reports & Publications

Dell Networking Data Center AI Switch Fabric Dell PowerSwitch SN5610 NVIDIA Spectrum with Dell SONiC AI Fabric Congestion Mitigation Evaluation

Sponsor: Dell Technologies
Dell PowerSwitch SN5610 NVIDIA Spectrum with Dell SONiC AI Fabric Congestion Mitigation Evaluation

Abstract

This Tolly report evaluates how the Dell PowerSwitch SN5610, built on NVIDIA Spectrum silicon and running Dell SONiC, handles congestion in AI Ethernet fabrics where traffic is both bursty and highly latency-sensitive. The test focused on RDMA over Converged Ethernet (RoCE) traffic, which is commonly used in AI training environments and can suffer significant performance degradation when congestion causes delay or packet loss across synchronized flows. 


Tolly built a rail-optimized three-switch fabric consisting of one spine and two leaf switches, with 400GbE inter-switch links and Dell PowerEdge servers generating both synthetic AI traffic and competing background traffic. The switches ran Dell SONiC Enterprise software version SONiC_OS_SN_4.5.1_Enterprise. Background traffic was generated with iPerf3, while simulated AI traffic was generated with the ib_send microbenchmark and tagged as RoCE so the fabric could identify and prioritize it appropriately.   


The evaluation deliberately oversubscribed a fabric link to show the effect of Dell’s dynamic congestion mitigation. With default settings and no AI prioritization enabled, the 400GbE inter-switch link was effectively split between background and AI traffic, with each receiving about 50% of available bandwidth. After dynamic prioritization was enabled, the switch sharply reduced competing background traffic from roughly 50% utilization to about 0.001% and reallocated essentially the full link to AI flows. Tolly measured AI throughput increasing from 199,048.06Mbps to 398,113.5Mbps, while background traffic dropped from 199,554.27Mbps to just 6.74Mbps.   


Overall, the report shows that Dell’s SONiC-based congestion mitigation can dynamically identify and prioritize RoCE-marked AI traffic, allowing the fabric to preserve bandwidth for critical AI workloads during oversubscription events. For AI training networks, this behavior helps reduce congestion-related disruption and improves the likelihood of consistent, high-performance collective operations across the fabric.