← All posts
KubernetesGPUAI Infrastructures

Why Quantization Improves LLM Inference

Isreal UrephuJune 19, 20262 min read

Before you tune vLLM, understand what actually governs inference speed. It comes down to two things:

  1. How fast weights can move from GPU memory (HBM/VRAM) to the on-chip compute buffer (SRAM)
  2. How fast the tensor cores can compute once the data arrives

Everything else is downstream of these two constraints.

Three layers of memory, three very different speeds

CPU DRAM is your server's main RAM, used for the OS, tokenisation, and I/O. Slow but large.

GPU HBM/VRAM is the memory on the GPU card itself, closest to the compute cores. An A100 delivers around 2 TB/s here. This is where model weights live permanently during serving.

GPU SRAM is the on-chip memory sitting right next to the tensor cores. An A100 delivers close to 19 TB/s here. This is where actual computation happens.

What happens when an inference pod starts up

At startup, weights load from disk into CPU RAM, then transfer into GPU HBM via PCIe. This happens once and the weights stay there for the lifetime of the server. For every token generated, those weights stream from HBM into SRAM for computation, then get discarded. Only the KV cache stays.

Decode is memory-bound. Tensor cores sit idle waiting on data, not the other way around.

This is where quantisation helps, through two mechanisms:

Weight only (W8, W4) — smaller weights move faster from HBM to SRAM. Reduces memory footprint and movement time.

Weights and activations (W8A8, FP8) — tensor cores run native low-precision arithmetic. An A100 does INT8 matmuls at roughly 2x the throughput of BF16. This makes computation faster too.

The math for Llama 3.1 8B on an RTX 4090 (24 GB VRAM):

fp32  8B x 4 bytes = 32 GB does not fit
bf16  8B x 2 bytes = 16 GB 8 GB left for KV cache
int8  8B x 1 byte = 8 GB 16 GB left for KV cache
int4  8B x 0.5   = 4 GB 20 GB left for KV cache

The memory saved on weights goes directly to KV cache. More KV cache means more concurrent users.

Quantisation is a memory optimisation and throughput decision, not just a model optimisation.

hashtag#LLMInference hashtag#vLLM hashtag#GPU hashtag#MLInfrastructure hashtag#PlatformEngineering hashtag#Quantization

Screenshot from 2026-06-19 09-34-53