Close Menu
    Trending
    • OpenAIs nya webbläsare ChatGPT Atlas
    • Creating AI that matters | MIT News
    • Scaling Recommender Transformers to a Billion Parameters
    • Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know
    • Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI
    • ChatGPT Gets More Personal. Is Society Ready for It?
    • Why the Future Is Human + Machine
    • Why AI Is Widening the Gap Between Top Talent and Everyone Else
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Learning Triton One Kernel at a Time: Matrix Multiplication
    Artificial Intelligence

    Learning Triton One Kernel at a Time: Matrix Multiplication

    ProfitlyAIBy ProfitlyAIOctober 14, 2025No Comments15 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    multiplication is undoubtedly the most typical operation carried out by GPUs. It’s the elementary constructing block of linear algebra and reveals up throughout a large spectrum of various fields comparable to graphics, physics simulations and scientific computing whereas being ubiquitous in machine studying.

    In at this time’s article, we’ll break down the conceptual implementation of basic matrix-matrix multiplication (GEMM) whereas introducing a number of optimisation ideas comparable to tiling and reminiscence coalescing. Lastly, we’ll implement GEMM in Triton!

    This text is the second of a collection on Triton and GPU kernels, If you’re not accustomed to Triton or want a refresher on GPU fundamentals, take a look at the earlier article! All of the code showcased on this article is obtainable on GitHub.

    Disclaimer: all the next figures and animations have been made by the writer except acknowledged in any other case.

    Naive GEMM

    Let’s begin easy: we wish to multiply two matrices X and Y with shapes (M,N) and (N,Ok) respectively. The output matrix Z=X@Y will due to this fact have form (M,Ok).

    This operation entails computing the dot merchandise of all pairs of rows and columns in X and Y respectively. A simple NumPy implementation would possibly look one thing like this:

    Whereas straightforward to put in writing, learn and perceive, this implementation is extremely inefficient when it comes to reminiscence entry and caching. As talked about within the first article of this collection, a elementary facet of GPU optimisation is minimising knowledge transfers. 

    Nonetheless, our present implementation begins by loading a row from X, iteratively hundreds all Ok columns of Y, computes their dot product and repeats the method for each row in X. This leads to a complete of M(Ok+1) loading operations. 

    Naive Matrix Multiplication, purple and blue tiles signify the vectors concerned in dot merchandise at each time step and inexperienced cells the computed output values.

    As seen within the animation, the reminiscence entry sample is wasteful, as each column of Y is loaded M occasions. As an analogy: that is like working to the grocery retailer (world reminiscence) each time you want a brand new ingredient for a dish as a substitute of getting ready all of the components in your kitchen counter (shared reminiscence). Ideally, we wish to minimise the variety of occasions every chunk of knowledge is loaded and maximise its reusability as soon as loaded. This leaves us with two essential axes of optimisation:

    1. How can we enhance the entry sample to minimise redundant hundreds?
    2. How a lot knowledge can we load without delay, and the place ought to it’s saved on the GPU?

    Tiled GEMM

    As talked about beforehand, the naive strategy to GEMM leads to many redundant hundreds, which induces pointless overhead. Ideally, we’d prefer to load every phase of knowledge solely as soon as and carry out all of the operations during which they’re used earlier than dropping them from reminiscence.

    A sublime strategy to this downside is tiling, which entails dividing massive matrices in smaller “tiles” or sub-matrices. Contemplate two matrices X and Y with shapes (4,6) and (6,4) respectively, X@Y leads to a matrix Z with form (4,4). 

    With a purpose to compute the primary ingredient of Z, Z[0,0], we have to compute the dot product between the primary row of X and the primary column of Y: Z[0,0] = dot(X[0, :], Y[:, 0]). We will additionally break down the dot product into smaller chunks, as an illustration in teams of three components: Z[0,0] = dot(X[0,0:3], Y[0:3, 0]) + dot(X[0,3:6], Y[3:6, 0]). 

    Alternatively, we are able to increase this strategy to 2 dimensions and compute a complete (2,2) block of Z at a time: Z[0:2, 0:2] = dot(X[0:2, 0:2], Y[0:2, 0:2]) + dot(X[0:2, 2:4], Y[2:4, 0:2]) + dot(X[0:2, 4:6], Y[4:6, 0:2]). 

    Right here’s a visible illustration of tiled matrix multiplication:

    Tiled Matrix Multiplication. The computation is break up in a number of “tiles” of X and Y (highlighted in pale blue and purple), every containing a number of blocks (darkish blue and purple). In every block, we compute dot merchandise (inexperienced cells in X and Y). These dot merchandise are gathered throughout the blocks of a tile to compute the output values in Z (the buildup is represented by colours from orange to inexperienced).

    The above animation illustrates how knowledge is reused in tiled GEMM. For every 2×2 block in X and Y, we compute 4 dot merchandise, which ends up in a (2,2) output matrix in Z. Since every tile accommodates 3 blocks, we have to accumulate 3 of those matrices to compute the ultimate (2,2) output in Z. This accumulation is represented by coloured cells in Z. 

    Within the kitchen analogy, that is like fetching components from the shop and getting ready them on the kitchen counter (i.e. small shared reminiscence), reusing them a number of occasions earlier than going again to the shop.

    Importantly, reusing loaded knowledge over a number of steps permits this strategy to drastically cut back the variety of load operations. For (2,2) blocks, every X row and Y column is utilized in two dot merchandise. Subsequently, we’re performing twice as many operations with every block of loaded knowledge, roughly halving the variety of load operations! Word that this generalises to bigger blocks as properly, utilizing a (32,32) block would scale back the variety of hundreds by an element of round 32. 

    Now you’re most likely questioning “how massive can these blocks be”? To reply this query, let’s recall how reminiscence is managed in fashionable GPUs.

    GPU Reminiscence Hierarchy

    We distinguish 4 essential kinds of reminiscence in Nvidia GPUs. Right here, we take the instance of an A100:

    • Registers: The quickest and smallest sort of reminiscence on the GPU, residing straight inside every Streaming Multiprocessor (SM). On the A100, every SM offers 256 KB of register file area (65,536 × 32-bit registers), distributed amongst its threads. Every thread will get its personal non-public 32-bit registers for storing short-term variables and intermediate outcomes, avoiding reminiscence visitors altogether. Nonetheless, register utilization per thread straight impacts occupancy, as utilizing too many registers per thread limits what number of threads can run concurrently.
    • L1/Shared Reminiscence: On an A100, every SM has 192KB of SRAM that may be flexibly configured as both a hardware-managed L1 cache or a programmer-managed shared reminiscence. For performance-critical kernels like matrix multiplication, we explicitly use this area as shared reminiscence to stage knowledge tiles near the compute models, bypassing the L1 cache fully. This provides us fine-grained management over knowledge reuse.
    • L2 cache: This cache is slower than L1 however a lot bigger, with round 40 MB shared throughout all SMs on the A100. It serves as a worldwide cache for each knowledge and directions, lowering the variety of accesses to high-latency HBM reminiscence. The L2 cache is coherent throughout SMs, which means that updates from one SM are seen to others, enabling synchronisation between thread blocks. Its bandwidth can attain a number of terabytes per second, appearing as a buffer between the quick on-chip SRAM and the slower HBM.
    • Excessive Bandwidth Reminiscence (HBM): That is the machine reminiscence, it has a capability of both 40GB or 80GB relying on the A100 mannequin. It offers extraordinarily excessive bandwidth (as much as 2 TB/s on the 80 GB variant) however with a lot greater latency than on-chip caches. HBM is the place massive tensors, mannequin weights, and datasets reside throughout execution. Since accessing HBM is pricey, environment friendly kernels intention to minimise knowledge motion and maximise on-chip knowledge reuse by way of registers and shared reminiscence.

    As you may see, the reminiscence hierarchy typically trades off capability with latency. Subsequently, maximising efficiency boils right down to loading knowledge from HBM into shared reminiscence effectively and reusing it as a lot as attainable.

    GPU Reminiscence Hierarchy, from quickest/smallest (high) to slowest/largest (backside).

    Selecting our block dimension is important. We would like blocks to be massive sufficient to create lots of parallel work, however sufficiently small that their knowledge matches within the SM’s shared reminiscence and registers. A BLOCK_SIZE of 64 is a typical place to begin as a result of it’s a a number of of the warp dimension (32 threads), guaranteeing full {hardware} utilisation.

    Parallel Tiled GEMM

    With these issues in thoughts, a pure follow-up to our tiled GEMM is to parallelise the computation of every pairs of tiles over a number of thread blocks, as depicted on the next animation.

    Parallel Tiled Matrix Multiplication. The iteration over tiles is changed by a parallel operation over a number of thread blocks.

    Reminiscence Coalescing

    Earlier than writing tiled GEMM in Triton, we have to contemplate one final element: reminiscence coalescing, a way that permits optimum use of world reminiscence bandwidth. Reminiscence coalescing is achieved when subsequent threads in a warp entry subsequent reminiscence addresses. Think about a librarian needing to fetch books for a shopper, if all books are side-by-side on a shelf, they will seize them unexpectedly. In distinction, if all books are mendacity on completely different cabinets, they’ll must seize them one after the other, which takes considerably longer.

    To grasp how this is applicable to our case, be aware that matrices are saved linearly in reminiscence, in different phrases a (2,2) matrix is saved as a sequence of 4 consecutive components. Frameworks like PyTorch undertake a row-major format, which means that components of a matrix are per-row contiguous in reminiscence. For example, components of our (2,2) matrix can be saved as follows: [(0,0), (0,1), (1,0), (1,1)], discover that components of the identical row are contiguous (touching) whereas components of the identical column have a stride of 1 (separated by one ingredient).

    PyTorch shops matrices in row-major format. Components of a row contiguous in reminiscence whereas components of a column are strided.

    This means that we are able to load rows utilizing coalesced hundreds, however columns do not fulfill this situation. Nonetheless, we have to entry columns of Y to compute dot merchandise. With a purpose to maximise efficiency, a very good apply is to transpose Y in order that we iterate on its rows reasonably than its columns. 

    Nonetheless, transposing Y isn’t sufficient to change its format in reminiscence. As talked about beforehand, PyTorch shops matrices in a flat array. Every matrix dimension is related to a stride attribute, denoting the soar essential to go from one ingredient to the subsequent one alongside this dimension. For example, a (10,10) matrix would have strides=(10,1). Certainly, ranging from ingredient [0,0], ingredient [1,0] is 10 reminiscence slots (i.e. one row) away, whereas ingredient [0,1] is adjoining. 

    When transposing a tensor, PyTorch doesn’t modify the format in reminiscence however merely recomputes the strides. With a purpose to make the transpose efficient from a reminiscence standpoint we have to name Y.T.contiguous().

    These are the required steps the load columns of Y effectively, nonetheless we’ll have to transpose the loaded blocks inside the kernel to carry out the dot product correctly: z_block = tl.dot(X_block, Y_block.T).

    Illustration of Y, Y.T and Y.T.contiguous() of their block illustration and reminiscence format. The transpose operation adjustments the behaviour of the matrix however doesn’t modify its reminiscence format. Because of this we have to add .contiguous() to allow coalesced reads on rows.

    Triton Implementation

    From right here on, we first describe the kernel with out reminiscence coalescing to simplify the logic and pointer arithmetic earlier than summarising the adjustments required to make the load operations coalesced on Y columns.

    Let’s begin by specializing in the PyTorch wrapper across the kernel. We have to learn M, N, Ok from the enter matrices and compute their strides since these constants shall be helpful later within the kernel. Then, we outline the BLOCK_SIZE and declare the grid.

    Now let’s dive into the precise kernel code. We’re going to utilize Triton’s make_block_ptr utility, which simplifies the pointer arithmetic. We create one block pointer per matrix and go the matrix form, its strides, and the scale of the block as inputs. Moreover, we specify the offset, the coordinate of the top-left ingredient within the present block. For X, this corresponds to (m_idx * BLOCK_SIZE, 0) the place m_idx is the index of the present block alongside the M dimension. 

    From there, we outline z_acc, a zero matrix that may obtain the partial dot-products as we iterate by means of tiles. We now iterate by means of the shared dimension N, loading blocks of dimension (BLOCK_SIZE, BLOCK_SIZE), and accumulate their dot merchandise in z_acc. We then transfer the block pointers alongside the shared dimension through the use of .advance.

    You may need seen that when loading knowledge, we use boundary_check and padding_option as a substitute of masks and different as within the earlier article. These arguments are particular to using block pointers and specify which axes to verify for out-of-bound operations (right here (0,1) for x and y) and learn how to deal with these invalid values. Right here we set them to zero to be ignored within the dot product.

    We will now check out the efficiency of this kernel through the use of the next operate:

    def bench(fn: callable, x: torch.Tensor, y: torch.Tensor, repeat: int):
      flops = []
      med_latency = []
    
      for _ in tqdm(vary(repeat), desc=f"Benchmarking {fn.__name__}"):
        latency_ms = triton.testing.do_bench(
          lambda: fn(x, y),
          quantiles=[0.5], # get the median latency
          return_mode="all",
          )
        n_flops = 2 * M * N * Ok # matmul roughly requires 2*M*N*Ok operations
        tflops = n_flops / (latency_ms / 1e3) / 1e12
    
        med_latency.append(latency_ms)
        flops.append(tflops)
    
      flops = np.array(flops)
      med_latency = np.array(med_latency)
      print(f"Absolute Error: {torch.sum(torch.abs(X@Y - fn(x, y)))}")
      print(f"Median Latency: {med_latency.imply():.4f} ± {med_latency.std():.3f} ms")
      print(f"Throughput: {flops.imply():.4f} ± {flops.std():.3f} TeraFLOPS")
    
    M = 8192
    N = 6144
    Ok = 4096
    
    X = torch.randn((M, N), machine="cuda", dtype=torch.float32)
    Y = torch.randn((N, Ok), machine="cuda", dtype=torch.float32)
    
    bench(block_matmul, X, Y, repeat=10)

    We get the next outputs (utilizing a T4 GPU on Colab):

    Absolute Error: 0.0 # the kernel outputs the proper end result!
    Median Latency: 130.7831 ± 1.794 ms
    Throughput: 3.1533 ± 0.043 TeraFLOPS

    Now let’s evaluation the adjustments required for coalesced hundreds on Y: we primarily have to flip the form, strides and offsets when defining the block pointer for Y. Moreover, we replace the block pointer to maneuver alongside the column dimension (beforehand row dimension). The total code for this implementation is obtainable on GitHub.

    @triton.jit
    def coalesced_block_matmul_kernel(
        X_ptr, X_m_stride, X_n_stride,
        Y_ptr, Y_k_stride, Y_n_stride,
        Z_ptr, Z_m_stride, Z_k_stride,
        M, N, Ok,
        BLOCK_SIZE: tl.constexpr,
    ):
        ... 
        y_block_ptr = tl.make_block_ptr(
            base=Y_ptr,
            # flip the form, strides and offsets to match Y.T
            form=(Ok, N),
            strides=(Y_k_stride, Y_n_stride), 
            offsets=(k_idx * BLOCK_SIZE, 0),
            block_shape=(BLOCK_SIZE, BLOCK_SIZE),
            order=(0, 1),
        )
        ...
    
        for _ in vary(0, N, BLOCK_SIZE):
            ... # hundreds
            z_acc += tl.dot(x, y.T)  # transpose Y again for dot product
            x_block_ptr = tl.advance(x_block_ptr, offsets=(0, BLOCK_SIZE))
            # advance the block pointer alongside columns of Y.T (i.e rows of Y)
            y_block_ptr = tl.advance(y_block_ptr, offsets=(0, BLOCK_SIZE))
    
        tl.retailer(pointer=z_block_ptr, worth=z_acc, boundary_check=(0, 1))
    
    def coalesced_block_matmul(X, Y):
        Y = Y.T.contiguous()  # Y is now (Ok,N)
        M, N = X.form
        Ok, _ = Y.form
        Z = torch.empty((M, Ok), machine="cuda")
    
        x_stride_m, x_stride_n = X.stride()
        y_stride_k, y_stride_n = Y.stride()
        z_stride_m, z_stride_k = Z.stride()
    
        ...  # outline BLOCK_SIZE and grid
    
        coalesced_block_matmul_kernel[grid](
            X, x_stride_m, x_stride_n,
            Y, y_stride_n, y_stride_k,
            Z, z_stride_m, z_stride_k,
            M, N, Ok,
            BLOCK_SIZE,
        )
    
        return Z

    Listed here are the outcomes of our benchmark for the kernel with coalesced hundreds for Y:

    Absolute Error: 0.0 # Once more, the kernel is appropriate!
    Median Latency: 261.9420 ± 0.858 ms
    Throughput: 1.5741 ± 0.005 TeraFLOPS

    Surprisingly, the throughput of this second kernel is just half of what we obtained with the primary one, regardless of enhancing the effectivity of load operations 🤔

    A fast inspection utilizing nsight (Nvidia’s kernel profiler, extra on that in a future article) reveals that the transpose operation inside the kernel creates a “visitors jam”. Particularly, the transpose creates financial institution conflicts, inflicting threads to stay idle more often than not. Notably, the warp scheduler has no eligible warp to dispatch 87.6% of the time as they’re ready for the financial institution battle to resolve. Moreover, the report reads:

    ———————– ———– ————–
    Metric Title Metric Unit Metric Worth
    ———————– ———– ————–
    …
    DRAM Throughput % 8.20
    Compute (SM) Throughput % 21.14
    …

    This means that the kernel is latency certain (i.e. neither reminiscence nor compute certain, seek advice from the earlier article for extra particulars). In distinction, the primary kernel is compute certain (i.e. rising compute will enhance efficiency) because the compute throughput is excessive in comparison with the DRAM throughput.

    ———————– ———– ————–
    Metric Title Metric Unit Metric Worth
    ———————– ———– ————–
    …
    DRAM Throughput % 29.35
    Compute (SM) Throughput % 74.39
    …

    Conclusion

    This experiment highlights the significance of profiling and empirical validation. Even well-intentioned optimisations like coalescing reminiscence accesses can introduce new bottlenecks if not evaluated fastidiously. The primary kernel, although easier, was compute-bound and higher matched the {hardware} traits.

    Within the subsequent articles of this collection, we’ll implement a softmax kernel, paying explicit consideration to integrating Triton with PyTorch’s autograd and profiling kernels utilizing Nsight.

    Till subsequent time! 👋

    Helpful Assets



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleOptimizing food subsidies: Applying digital platforms to maximize nutrition | MIT News
    Next Article Microsoft lanserar MAI-Image-1 deras första egenutvecklade text-till-bild-modell
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Creating AI that matters | MIT News

    October 21, 2025
    Artificial Intelligence

    Scaling Recommender Transformers to a Billion Parameters

    October 21, 2025
    Artificial Intelligence

    Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know

    October 21, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    NVIDIA:s transkriptionsverktyg Parakeet producerar 60 minuter text på 1 sekund

    May 12, 2025

    How to Design Machine Learning Experiments — the Right Way

    August 9, 2025

    Sourcing, Annotation, and Managing Costs Explained | Shaip

    April 3, 2025

    Exploring RAFT: The Future of AI with Retrieval-Augmented Fine-Tuning

    April 4, 2025

    Inroads to personalized AI trip planning | MIT News

    June 10, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Attaining LLM Certainty with AI Decision Circuits

    May 2, 2025

    Why your AI investments aren’t paying off

    April 5, 2025

    How to prevent order discrepancy with automated PO-SO matching

    April 4, 2025
    Our Picks

    OpenAIs nya webbläsare ChatGPT Atlas

    October 22, 2025

    Creating AI that matters | MIT News

    October 21, 2025

    Scaling Recommender Transformers to a Billion Parameters

    October 21, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.