Reports & Publications
64 GPU AI Computing Performance Comparison Test H3C DDC-based RoCE Switch Network vs. InfiniBand Network
Login or create an account to download this report
Abstract
New H3C Technologies commissioned Tolly to evaluate AI-computing performance in a 64-GPU environment, comparing an H3C DDC-based RoCE switch fabric with an InfiniBand network and also comparing H3C’s DDC architecture with traditional ECMP-based RoCE. The main focus of the project was to determine whether DDC-based Ethernet could deliver NCCL collective-communication and Llama3 training performance comparable to IB, while improving traffic handling for AI workloads relative to conventional ECMP load balancing.
The test bed used eight servers, each equipped with eight NVIDIA H20 GPUs and eight 400G NICs, for a total of 64 GPUs. In the H3C DDC RoCE design, the spine layer used H3C S12500AI-128EP-NCFN switches in the NCF role and the leaf layer used H3C S12500AI-36DH20EP-NCPN switches in the NCP role. The InfiniBand comparison network used NVIDIA QM9700 switches. The software environment included Ubuntu 22.04.4, CUDA 12.4, and NCCL 2.22.3.
In NCCL Ring-AllReduce testing, the H3C DDC-based RoCE fabric achieved an average bus bandwidth of 231.802GB/s, compared with 230.698GB/s for InfiniBand, effectively equal overall with a slight average RoCE advantage of 0.48%. Results were closely matched across the tested message sizes, with H3C RoCE outperforming IB at several larger sizes including 512MB, 2GB, 8GB, and 16GB. In NCCL Ring-Alltoall testing, the H3C DDC RoCE network posted an average bus bandwidth of 37.4611GB/s versus 36.532GB/s for InfiniBand, a 2.54% advantage. The RoCE fabric showed especially strong gains at larger message sizes, reaching 11.66% above IB at 16GB.
Tolly also evaluated actual large-language-model training using Llama3 70B. Here again, the two fabrics were nearly identical. The InfiniBand network delivered an average iteration time of 17,724ms and 14.44 samples per second, while the H3C DDC RoCE network achieved 17,678ms and 14.48 samples per second. These results supported Tolly’s conclusion that DDC-based RoCE can provide user experience and AI-training performance comparable to IB in the same 64-GPU scenario.
The clearest difference emerged in the comparison between H3C DDC and ECMP-based RoCE. In NCCL Ring-Alltoall testing, DDC achieved average bus bandwidth of 37.4611GB/s versus 29.6646GB/s for ECMP, an advantage of 26.28%. Gains became particularly large at higher message sizes, including 58.98% at 1GB, 107.26% at 2GB, and 63.08% at 4GB. Overall, the report presents H3C’s DDC architecture as a high-performance Ethernet AI fabric design that can match InfiniBand-class results in these tests while substantially outperforming traditional ECMP-based RoCE for demanding collective-communication workloads.
The switches used in this test:
- H3C S12500AI-128EP-NCFN — Spine switch used in the H3C DDC-based RoCE fabric, serving in the NCF role of the DDC architecture. Each spine-to-leaf connection in this topology used 10 x 800G links.
- H3C S12500AI-36DH20EP-NCPN — Leaf switch used in the H3C DDC-based RoCE fabric, serving in the NCP role of the DDC architecture and connecting to servers with 4 x 400G links per server connection.
- NVIDIA QM9700 — InfiniBand switch used in both spine and leaf roles in the comparison IB fabric. In that topology, each spine-to-leaf connection used 8 x 400G links and each leaf-to-server connection used 2 x 400G links.