Close Menu
    Trending
    • Creating AI that matters | MIT News
    • Scaling Recommender Transformers to a Billion Parameters
    • Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know
    • Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI
    • ChatGPT Gets More Personal. Is Society Ready for It?
    • Why the Future Is Human + Machine
    • Why AI Is Widening the Gap Between Top Talent and Everyone Else
    • Implementing the Fourier Transform Numerically in Python: A Step-by-Step Guide
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Learning Triton One Kernel At a Time: Vector Addition
    Artificial Intelligence

    Learning Triton One Kernel At a Time: Vector Addition

    ProfitlyAIBy ProfitlyAISeptember 27, 2025No Comments10 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    , just a little optimisation goes a good distance. Fashions like GPT4 value greater than $100 tens of millions to coach, which makes a 1% effectivity achieve price over 1,000,000 {dollars}. A robust strategy to optimise the effectivity of machine studying fashions is by writing a few of their parts immediately on the GPU. Now for those who’re something like me, the straightforward point out of CUDA kernels is sufficient to ship chills down your backbone, as they’re notoriously complicated to put in writing and debug.

    Luckily, OpenAI launched Triton in 2021, a brand new language and compiler abstracting away a lot of CUDA’s complexity and permitting much less skilled practitioners to put in writing performant kernels. A notable instance is Unsloth, an LLM-training service that guarantees 30x quicker coaching with 60% much less reminiscence utilization, all because of changing layers written in PyTorch with Triton kernels.

    On this tutorial collection, we’ll study the fundamentals of GPU structure and implement high-performance Triton kernels! All of the code offered on this collection will likely be accessible at https://github.com/RPegoud/Triton-Kernels.

    GPU Structure Fundamentals

    On this part, we’ll undergo the very fundamentals of (Nvidia) GPUs to get us began and write our first Triton kernel by the tip of this text.

    Ranging from the smallest software program unit, we are able to describe the hierarchy of execution items as follows:

    • Threads: The smallest unit of labor, they run the user-defined kernel code.
    • Warps: The smallest scheduling unit, they’re all the time composed of 32 parallel threads, every with their very own instruction tackle counter and register state. Threads in a warp begin collectively however are free to department and execute independently.
    • Thread Blocks: Group of warps, the place all threads can cooperate through shared reminiscence and sync limitations. It’s required that thread blocks can execute independently and in any order, in parallel or sequentially. This independence permits thread blocks to be scheduled in any order throughout any variety of cores, in order that GPU applications scale effectively with the variety of cores. We are able to synchronise the threads inside a block at particular factors within the kernel if wanted, for instance to synchronise reminiscence entry.
    • Streaming Multiprocessor (SM): A unit in control of executing many warps in parallel, it owns shared reminiscence and an L1 cache (holds the newest global-memory traces that the SM has accessed). An SM has a devoted warp scheduler that pull warps from the thread blocks which might be able to run.

    On the {hardware} facet, the smallest unit of labor is a CUDA core, the bodily Arithmetic Logic Unit (ALU) which performs arithmetic operations for a thread (or components of it).

    To summarise this part with an analogy, we may see CUDA cores as particular person staff, whereas a warp is a squad of 32 staff given the identical instruction without delay. They could or might not execute this job the identical manner (branching) and may probably full it at a distinct cut-off date (independence). A thread block consists of a number of squads sharing a typical workspace (i.e. have shared reminiscence), staff from all squads within the workspace can look ahead to one another to get lunch on the identical time. A streaming multiprocessor is a manufacturing facility ground with many squads working collectively and sharing instruments and storage. Lastly, the GPU is a complete plant, with many flooring.

    Hierarchy of an Nvidia GPU structure. Dotted rectangles characterize reminiscence blocks (made by creator)

    Optimisation Fundamentals

    When optimising deep studying fashions, we’re juggling with three major parts:

    1. Compute: Time spent by the GPU computing floating level operations (FLOPS).
    2. Reminiscence: Time spent transferring tensors inside a GPU.
    3. Overhead: All different operations (Python interpreter, PyTorch dispatch, …).

    Conserving these parts in thoughts helps determining the fitting strategy to resolve a bottleneck. As an illustration, rising compute (e.g. utilizing a extra highly effective GPU) doesn’t assist if more often than not is spent doing reminiscence transfers. Ideally although, more often than not ought to be spent on compute, extra exactly on matrix multiplications, the exact operation GPUs are optimised for.

    This suggests minimising the associated fee paid to maneuver knowledge round, both from the CPU to the GPU (”knowledge switch value”), from one node to the opposite (”community value”) or from CUDA world reminiscence (DRAM, low cost however gradual) to CUDA shared reminiscence (SRAM, costly however quickest on-device reminiscence). The later is known as bandwidth prices and goes to be our major focus for now. Frequent methods to cut back bandwidth prices embrace:

    1. Reusing knowledge loaded in shared reminiscence for a number of steps. A primary instance of that is tiled matrix multiplication, which we’ll cowl in a future put up.
    2. Fusing a number of operations in a single kernel (since each kernel launch implies shifting knowledge from DRAM to SRAM), for example we are able to fuse a matrix multiplication with an activation perform. Typically, operator fusion can present huge efficiency enhance because it prevents quite a lot of world reminiscence reads/writes and any two operators current a chance for fusion.
    Matrix multiplication adopted by a ReLU activation with out operator fusion. (made by creator)

    On this instance, we carry out a matrix multiplication x@W and retailer the lead to an intermediate variable a. We then apply a relu to a and retailer the lead to a variable y. This requires the GPU to learn from x and W in world reminiscence, write the lead to a, learn from a once more and eventually write in y. As an alternative, operator fusion would permit us to halve the quantity of reads and writes to world reminiscence by performing the matrix multiplication and making use of the ReLU in a single kernel.

    Fused matrix multiplication and ReLU activation. (made by creator)

    Triton

    We’ll now write our first Triton kernel, a easy vector addition. First, let’s stroll by how this operation is damaged down and executed on a GPU.

    Take into account eager to sum the entries of two vectors X and Y, every with 7 parts (n_elements=7).

    We’ll instruct the GPU to sort out this downside in chunks of three parts at a time (BLOCK_SIZE=3). Subsequently, to cowl all 7 parts of the enter vectors, the GPU will launch 3 parallel “applications”, impartial occasion of our kernel, every with a novel program ID, pid:

    • Program 0 is assigned parts 0, 1, 2.
    • Program 1 is assigned parts 3, 4, 5.
    • Program 2 is assigned component 6.

    Then, these applications will write again the leads to a vector Z saved in world reminiscence.

    An essential element is {that a} kernel doesn’t obtain a complete vector X, as an alternative it receives a pointer to the reminiscence tackle of the primary component, X[0]. With a view to entry the precise values of X, we have to load them from world reminiscence manually.

    We are able to entry the information for every block by utilizing this system ID: block_start = pid * BLOCK_SIZE. From there, we are able to get the remaining component addresses for that block by computing offsets = block_start + vary(0, BLOCK_SIZE) and cargo them into reminiscence.

    Nonetheless, do not forget that program 2 is barely assigned component 6, however its offsets are [6, 7, 8]. To keep away from any indexing error, Triton lets us outline a masks to establish legitimate goal parts, right here masks = offsets < n_elements.

    We are able to now safely load X and Y and add them collectively earlier than writing the consequence again to an output variable Z in world reminiscence in an identical manner.

    Per-block vector indexing. Slices of X, Y and Z are despatched to impartial thread blocks, every listed by a novel ID. (Picture by creator)

    Let’s take a more in-depth take a look at the code, right here’s the Triton kernel:

    import triton
    import triton.language as tl
    
    @triton.jit
    def add_kernel(
    	x_ptr, # pointer to the primary reminiscence entry of x
    	y_ptr, # pointer to the primary reminiscence entry of y
    	output_ptr, # pointer to the primary reminiscence entry of the output
    	n_elements, # dimension of x and y
    	BLOCK_SIZE: tl.constexpr, # dimension of a single block
    ):
    	# --- Compute offsets and masks ---
    	pid = tl.program_id(axis=0) # block index
    	block_start = pid * BLOCK_SIZE # begin index for present block
    	offsets = block_start + tl.arange(0, BLOCK_SIZE) # index vary
    	masks = offsets < n_elements # masks out-of-bound parts
    	
    	# --- Load variables from world reminiscence ---
    	x = tl.load(x_ptr + offsets, masks=masks)
    	y = tl.load(y_ptr + offsets, masks=masks)
    
    	# --- Operation ---
    	output = x + y	
    	
    	# --- Save outcomes to world reminiscence ---
    	tl.retailer(pointer=output_ptr + offsets, worth=output, masks=masks)

    Let’s break down a few of the Triton-specific syntax:

    • First, a Triton kernel is all the time adorned by <a href="http://twitter.com/triton" goal="_blank" rel="noreferrer noopener">@triton</a>.jit.
    • Second, some arguments have to be declared as static, that means that they’re identified at compute-time. That is required for BLOCK_SIZE and is achieved by add the tl.constexpr kind annotation. Additionally word that we don’t annotate different variables, since they don’t seem to be correct Python variables.
    • We use tl.program_id to entry the ID of the present block, tl.arange behaves equally to Numpy’s np.arange.
    • Loading and storing variables is achieved by calling tl.load and tl.retailer with arrays of pointers. Discover that there is no such thing as a return assertion, this function is delegated to tl.retailer.

    To make use of our kernel, we now want to put in writing a PyTorch-level wrapper that gives reminiscence pointers and defines a kernel grid. Typically, the kernel grid is a 1D, 2D or 3D tuple containing the variety of thread blocks allotted to the kernel alongside every axis. In our earlier instance, we used a 1D grid of three thread blocks: grid = (3, ).

    To deal with various array sizes, we default to grid = (ceil(n_elements / BLOCK_SIZE), ).

    def add(X: torch.Tensor, Y: torch.Tensor) -> torch.Tensor:
    	"""PyTorch wrapper for `add_kernel`."""
    	output = torch.zeros_like(x) # allocate reminiscence for the output
    	n_elements = output.numel()  # dimension of X and Y
    	
    	# cdiv = ceil div, computes the variety of blocks to make use of
    	grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),)
    	# calling the kernel will robotically retailer `BLOCK_SIZE` in `meta`
    	# and replace `output`
    	add_kernel[grid](X, Y, output, n_elements, BLOCK_SIZE=1024)
    	
    	return output

    Listed below are two last notes concerning the wrapper:

    You might need seen that grid is outlined as a lambda perform. This permits Triton to compute the variety of thread blocks to launch at launch time. Subsequently, we compute the grid dimension based mostly on the block dimension which is saved in meta, a dictionary of compile-time constants which might be uncovered to the kernel.

    When calling the kernel, the worth of output will likely be modified in-place, so we don’t must reassign output = add_kernel[…].
    We are able to conclude this tutorial by verifying that our kernel works correctly:

    x, y = torch.randn((2, 2048), system="cuda")
    
    print(add(x, y))
    >> tensor([ 1.8022, 0.6780, 2.8261, ..., 1.5445, 0.2563, -0.1846], system='cuda:0')
    
    abs_difference = torch.abs((x + y) - add(x, y))
    print(f"Max absolute distinction: {torch.max(abs_difference)}")
    >> Max absolute distinction: 0.0

    That’s it for this introduction, in following posts we’ll study to implement extra fascinating kernels equivalent to tiled matrix multiplication and see combine Triton kernels in PyTorch fashions utilizing autograd.

    Till subsequent time! 👋

    References and Helpful Sources





    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleWhat Clients Really Ask for in AI Projects
    Next Article Eulerian Melodies: Graph Algorithms for Music Composition
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Creating AI that matters | MIT News

    October 21, 2025
    Artificial Intelligence

    Scaling Recommender Transformers to a Billion Parameters

    October 21, 2025
    Artificial Intelligence

    Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know

    October 21, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Guide: Så får du ut mesta möjliga av Perplexitys AI-funktioner

    June 26, 2025

    How to Become a Machine Learning Engineer (Step-by-Step)

    September 15, 2025

    Eulerian Melodies: Graph Algorithms for Music Composition

    September 28, 2025

    Google indexerade tusentals privata ChatGPT-konversationer

    August 8, 2025

    After an outcry, OpenAI swiftly rereleased 4o to paid users. But experts say it should not have removed the model so suddenly.

    August 15, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Using generative AI to diversify virtual training grounds for robots | MIT News

    October 8, 2025

    Personliga föremål till mixad verklighet – MIT återskapar leksaker i mixed reality

    April 10, 2025

    AI companions are the final stage of digital addiction, and lawmakers are taking aim

    April 8, 2025
    Our Picks

    Creating AI that matters | MIT News

    October 21, 2025

    Scaling Recommender Transformers to a Billion Parameters

    October 21, 2025

    Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know

    October 21, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.