Reports & Publications
64 GPU AI Computing Performance Comparison Test H3C DDC-based RoCE Switch Network vs. InfiniBand Network
Login or create an account to download this report
Abstract
New H3C Technologies commissioned Tolly to compare AI-computing performance across two network approaches in a 64-GPU environment: an H3C DDC-based RoCE switch fabric versus InfiniBand, with an additional comparison of H3C’s DDC architecture against traditional ECMP-based RoCE. The main focus of the project was to determine whether DDC-based RoCE could deliver AI training and collective-communication performance comparable to IB while improving traffic distribution and bus bandwidth relative to conventional Ethernet load balancing.
The test environment used eight servers, each equipped with eight NVIDIA H20 GPUs and eight 400G NICs, for a total of 64 GPUs. Tolly evaluated NVIDIA NCCL collective operations and Llama3 70B training performance across two topologies: an InfiniBand network built with NVIDIA QM9700 spine and leaf switches, and an H3C RoCE fabric using S12500AI-96B-NCFK spine switches and S12500AI-18D48B-NCPK leaf switches. Software components included Ubuntu 22.04.4, CUDA 12.4, and NCCL 2.22.3.
In NCCL Ring-AllReduce testing, the H3C DDC-based RoCE network delivered average bus bandwidth of 231.01GB/s versus 230.698GB/s for InfiniBand, effectively equal overall and slightly ahead on average. Results were also close across most message sizes, with RoCE marginally outperforming IB at several larger sizes, including 512MB, 2GB, 4GB, 8GB, and 16GB. In NCCL Ring-Alltoall testing, H3C DDC-based RoCE again produced nearly identical average performance, posting 36.7648GB/s versus 36.532GB/s for IB, a slight average advantage of 0.64%. At larger message sizes from 2GB through 16GB, the H3C RoCE fabric consistently outperformed the IB configuration in bus bandwidth.
Tolly also tested actual large-language-model training using Llama3 70B. Here, the two fabrics were effectively tied: InfiniBand delivered an average iteration time of 17,724ms and 14.44 samples per second, while the H3C DDC RoCE network delivered 17,676ms and 14.48 samples per second. This result supported Tolly’s conclusion that DDC-based RoCE can provide AI training performance and user experience comparable to InfiniBand in the same 64-GPU operational scenario.
The clearest differentiation appeared in the comparison between H3C DDC and conventional ECMP-based RoCE. In NCCL Ring-Alltoall testing, H3C DDC achieved average bus bandwidth of 36.7468GB/s versus 29.6646GB/s for ECMP, an average advantage of 23.87%. Gains became especially large at higher message sizes, reaching 60.52% at 1GB, 107.75% at 2GB, and more than 35% at 16GB. Overall, the report presents H3C’s DDC architecture as an Ethernet-based AI fabric design that can match InfiniBand-class results in these tests while materially outperforming traditional ECMP RoCE in demanding collective-communication workloads.