Nvidia has released new benchmark data showing its GB200 NVL72 rack-scale server can deliver up to a tenfold improvement in performance-per-watt for certain mixture-of-experts (MoE) inference workloads compared with prior-generation systems.

MoE architectures split work across many small “expert” submodels and activate only a subset for each token. That design cuts compute and energy use compared with dense models that fire all parameters every time, but it is harder to run because it needs very fast links between chips. DeepSeek’s open-source releases helped make MoE mainstream in 2025, and frontier models from OpenAI, Mistral and Moonshot AI now use the same pattern.

Nvidia’s GB200 NVL72 system packs 72 Blackwell GPUs into one server with 30TB of shared memory and an NVLink fabric delivering about 130TB/s of bandwidth, so the chips behave like a single large processor. The company positions the design as enabling higher utilization and lower energy per token for large-scale inference.

Nvidia said the resulting efficiency improvements can translate into substantially more throughput within the same power and rack footprint, though real-world gains will vary based on model architecture, optimizations and workload patterns. Nvidia highlighted MoE models such as China’s Moonshot AI Kimi K2 Thinking model and DeepSeek models among those showing the largest measured improvements.

The data underscores the strategic tension around MoE: the architecture can reduce total training demand for GPUs, potentially threatening Nvidia’s core business, but the company argues that its tightly integrated stack still offers a clear advantage when serving those models at scale, even as AMD prepares a competing multichip server expected next year.