Reports & Publications
64 GPU AI Computing Performance Comparison Test H3C RoCE Network (S12500CR & S9855-G Series Switches) vs. InfiniBand Network
Login or create an account to download this report
Abstract
New H3C Technologies commissioned Tolly to evaluate AI-computing performance in a 64-GPU environment, comparing an H3C RoCE fabric built with S12500CR and S9855-G series switches against an InfiniBand network using NVIDIA QM9700 switches. The main focus of the project was to determine whether this Ethernet-based RoCE architecture could deliver NCCL collective-communication and Llama3 training performance comparable to InfiniBand under the same large-scale AI workload conditions.
The H3C S12500CR is positioned as a flagship switch family for AI computing and large-model deployments, using a CLOS+ orthogonal hardware architecture intended to provide 100% lossless data channels and high-density, non-blocking server access. The H3C 9855-G series is presented as a high-performance 400GE/100GE Ethernet platform for high-end data centers and AIGC scenarios, with redundant hot-swappable power supplies and fans. In the tested RoCE fabric, H3C S12504CR and S12508CR switches served as spine devices, while H3C S9855-32DH-G switches served as leaf devices connecting the servers. The comparison IB network used NVIDIA QM9700 switches. Both environments used eight servers, each equipped with eight NVIDIA H20 GPUs and eight 400G NICs, running Ubuntu 22.04.4, CUDA 12.4, and NCCL 2.22.3.
In NCCL Ring-AllReduce testing, the H3C S12500CR and S9855-G RoCE fabric achieved an average bus bandwidth of 231.15GB/s, compared with 230.698GB/s for the InfiniBand network, giving the H3C fabric a slight average advantage of 0.20%. Results were closely matched across all tested message sizes, with RoCE ahead at 8MB, 64MB, 128MB, 512MB, 2GB, and 8GB, and nearly identical to IB at 16GB. Tolly characterizes this as essentially equivalent collective-communication performance in the tested 64-GPU environment.
The evaluation also included Llama3 70B training to assess a real AI workload beyond synthetic collectives. Here again, the two networks were nearly identical. The InfiniBand fabric delivered an average iteration time of 17,724ms and throughput of 14.44 samples per second, while the H3C S12500CR and S9855-G RoCE fabric achieved 17,693ms and 14.47 samples per second. Overall, the report concludes that the H3C RoCE architecture provides performance and user experience comparable to InfiniBand for 64-GPU AI training, positioning the H3C Ethernet fabric as a viable alternative for large-scale AI deployments that want RoCE-based networking without giving up near-IB-class results.
List of switches tested:
- H3C S12504CR — Spine switch used in the H3C RoCE fabric for the 64-GPU test environment. Part of the S12500CR series, it is positioned as a flagship AI-computing switch built on H3C’s CLOS+ orthogonal architecture for high-density, non-blocking data center fabrics.
- H3C S12508CR — Spine switch used alongside the S12504CR in the H3C RoCE fabric. It is part of the same S12500CR family and is designed for AI computing and large-scale model scenarios requiring high-bandwidth, lossless transport.
- H3C S9855-32DH-G — Leaf switch used in the H3C RoCE fabric to connect the servers. It is part of the H3C 9855-G series of high-density 400GE/100GE Ethernet switches for high-end data centers and AIGC computing environments.
- NVIDIA QM9700 — InfiniBand switch used in both spine and leaf roles in the comparison IB network for the 64-GPU benchmark tests.