Reports & Publications

AI Model Training and Inference Performance with NVIDIA GPUs Huawei Xinghe AI Data Center Network vs. Industry Ethernet Network

Sponsor: Huawei Technologies, Co. Ltd
Huawei Xinghe AI Data Center Network vs. Industry Ethernet Network

Abstract

Huawei Technologies commissioned Tolly to evaluate AI model training and inference performance using the Huawei Xinghe AI Data Center Network versus an industry Ethernet network under the same NVIDIA GPU-based compute environment. The main focus of the project was to compare network behavior and application-level AI performance across NCCL collective communication, Llama 2 model training, DeepSeek inference, and concurrent training-plus-inference workloads, with particular attention to Huawei’s AI accelerator NSLB load-balancing algorithm.  


Tolly built an eight-server test bed arranged in a two-spine, four-leaf topology. The Huawei fabric used CE9866-128DQ and XH9230-128DQ switches, while the comparison environment used 400GE Ethernet switches from another industry vendor. Each server was equipped with eight NVIDIA H100 80GB HBM3 GPUs and eight MCX75310AAS-NEAT NICs. According to the topology diagram on page 2, each server connected with 8 x 400GE links into the leaf layer, and the spine-leaf interconnect also used 8 x 400GE links.  


In NCCL Ring AllReduce testing, Huawei showed clear gains. Under per-flow load balancing with a 4GB message size, effective bandwidth reached 389.06GB/s versus 253.32GB/s for the other Ethernet vendor, a 53.58% advantage. Under per-packet load balancing with simultaneous background tasks, Huawei achieved 374.63GB/s versus 334.05GB/s, a 12.15% improvement.  


Application-level AI tests also favored Huawei. In Llama2-13B training, average performance reached 35.99 TFLOPs under per-flow load balancing versus 32.96 TFLOPs for the comparison fabric, a 9.19% gain. With background NCCL Ring tasks under per-packet load balancing, Huawei achieved 36.89 TFLOPs versus 35.62 TFLOPs, improving training performance by 3.57%.  


Inference and mixed-workload results were similarly strong. In DeepEP multi-Prefill testing, Huawei improved Prefill throughput by 32.88% and 31.39%, depending on node arrangement. In Decode plus background-task testing, Huawei improved Decode throughput by 13.6%. In the integrated DeepSeek Prefill plus NCCL AllReduce scenario, Huawei improved Prefill throughput by 33.86% and increased AllReduce throughput by 31.15%. Overall, the report concludes that Huawei Xinghe AI Data Center Network delivered consistently higher AI training and inference performance than the comparison Ethernet environment across the tested workloads.