Cutting LLM Memory by 84%: A Deep Dive into Fused Kernels

The Stride Swap: When computing P . W_T , we don’t really have to bodily transpose the large W matrix in reminiscence. As a substitute, we invert the shapes and strides in W ’s block pointer to learn the rows of W as columns of W^T . This ends in a “free” transpose that saves each time and VRAM.
Numerical Precision: It&#8217;s value noting that whereas X and W is likely to be in bfloat16 , the buildup of dW and dX through atomic_add is normally carried out in float32 to stop the buildup of tiny rounding errors throughout hundreds of rows.
Rivalry Word: Whereas atomic_add is important for dW (as a result of each program updates the identical weights), dX is personal to every program, which means there may be zero rivalry between program IDs for that particular tensor.
Atomic Add Masking: atomic_add doesn’t assist block pointers. Subsequently, we implement the pointer and masks logic for dW explicitly.

or fine-tuned an LLM, you’ve probably hit a wall on the final step: the Cross-Entropy Loss.

The offender is the logit bottleneck. To foretell the subsequent token, we venture a hidden state into a large vocabulary house. For Llama 3 (128,256 tokens), the burden matrix alone is over 525 million parameters. Whereas that’s solely ~1GB in bfloat16, the intermediate logit tensor is the true problem. For big batches, it might probably simply exceed 80GB of VRAM simply to compute a single scalar loss.

Optimising this layer is how libraries like Unsloth and Liger-Kernel obtain such large reminiscence reductions. On this article, we’ll construct a fused Linear + Cross Entropy kernel from scratch in Triton. We are going to derive the maths and implement a tiled ahead and backward cross that slashes peak reminiscence utilization by 84%.

Word on Efficiency: This implementation is primarily academic. We prioritise mathematical readability and readable Triton code by utilizing world atomic operations. Whereas it solves the reminiscence bottleneck, matching production-grade speeds would require considerably extra advanced implementations that are out of scope for this text.

This publish is a part of my Triton sequence. We’ll be utilizing ideas like tiling and online softmax that we’ve lined beforehand. If these sound unfamiliar, I like to recommend catching up there first!