← All posts
GPU

Your GPUs Aren't Slow. Your Topology Is.

Isreal Urephu2 min read

Most discussions about AI infrastructure focus on GPUs, model architectures, and FLOPs. But the difference between average and exceptional GPU utilization often comes down to the system around the GPU and most of that is invisible until you go looking for it.

Last week I ran nvidia-smi topo -m and numactl --hardware on one of our GPU nodes, expecting a clean topology. Instead, I found this:

GPU0 sits on NUMA node 0. GPU1 sits on NUMA node 1. The relationship between them is "SYS" meaning any communication between the two GPUs has to cross the UPI interconnect between CPU sockets. The node distance table confirmed it: remote memory access costs roughly 2.1x a local access on this hardware.

What that means in practice: a workload split across both GPUs pays a cross-socket tax on every gradient sync, every memory copy, every collective operation silently, with no error, no log line, nothing in nvidia-smi -l that screams "this is the problem." The job just runs slower than the hardware suggests it should.

This is the part that doesn't show up in spec sheets. A server with two A100s and NVLink is a very different machine from two A100s that happen to be in the same chassis but on opposite sockets with no NVLink between them and from the outside, both look identical: "2x A100, 80GB."

The fix here isn't exotic it's about awareness and placement. Pin single-GPU workloads to the CPU cores and memory on the same NUMA node as that GPU. For multi-GPU jobs, understand whether you're paying the UPI penalty before you blame the model, the dataloader, or NCCL. Kubernetes' Topology Manager (single-numa-node policy) can enforce this alignment automatically, but only if you know to turn it on by default the scheduler can place a pod's GPU on one socket and its CPU/memory on the other, and you'll never know unless you check.

The broader point: buying faster GPUs is easy. Understanding the topology you already have NUMA placement, PCIe paths, interconnect type is where you find performance that's sitting on the table for free.

Screenshot from 2026-06-18 17-17-07