Close Menu
    Trending
    • The Power of Building from Scratch
    • These four charts show where AI companies could go next in the US
    • Undetectable AI vs. Grammarly’s AI Humanizer: What’s Better with ChatGPT?
    • Do You Really Need a Foundation Model?
    • xAI lanserar AI-sällskap karaktärer genom Grok-plattformen
    • How to more efficiently study complex treatment interactions | MIT News
    • Claude får nya superkrafter med verktygskatalog
    • How Metrics (and LLMs) Can Trick You: A Field Guide to Paradoxes
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Why Your Next LLM Might Not Have A Tokenizer
    Artificial Intelligence

    Why Your Next LLM Might Not Have A Tokenizer

    ProfitlyAIBy ProfitlyAIJune 24, 2025No Comments15 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    In my last article, we dove into Google’s Titans — a mannequin that pushes the boundaries of long-context recall by introducing a dynamic reminiscence module that adapts on the fly, form of like how our personal reminiscence works.

    It’s a wierd paradox. We’ve got AI that may analyze a 10-million-word doc, but it nonetheless fumbles questions like: “What number of ‘r’s are within the phrase strawberry?”

    The issue isn’t the AI’s mind; it’s the eyes. Step one in how these fashions learn, tokenization, primarily pre-processes language for them. In doing so, it strips away the wealthy, messy particulars of how letters type phrases; the entire world of sub-word data simply vanishes.


    1. Misplaced in Tokenization: The place Subword Semantics Die

    Language, for people, begins as sound, spoken lengthy earlier than it’s written. But it’s by way of writing and spelling that we start to know the compositional construction of language. Letters type syllables, syllables type phrases, and from there, we construct conversations. This character-level understanding permits us to appropriate, interpret, and infer even when the textual content is noisy or ambiguous. In distinction, language fashions skip this part completely. They’re by no means uncovered to characters or uncooked textual content as-is; as an alternative, their total notion of language is mediated by a tokenizer.

    This tokenizer, mockingly, is the one part in your entire pipeline that’s not discovered. It’s dumb, fastened, and fully primarily based on heuristics, regardless of sitting on the entry level of a mannequin designed to be deeply adaptive. In impact, tokenization units the stage for studying, however with none studying of its personal.

    Furthermore, tokenization is extraordinarily brittle. A minor typo, say, “strawverry” as an alternative of “strawberry”, can yield a totally totally different token sequence, regardless of the semantic intent remaining apparent to any human reader. This sensitivity, as an alternative of being dealt with proper then and there, is handed downstream, forcing the mannequin to interpret a corrupted enter. Worse nonetheless, optimum tokenizations are extremely domain-dependent. A tokenizer educated on on a regular basis English textual content might carry out fantastically for pure language, however fail miserably when encountering supply code, producing lengthy and semantically awkward token chains for variable names like user_id_to_name_map

    Just like the “Spinal Wire”, that’s the language pipeline, the upper up it’s compromised, the extra it cripples all the pieces downstream. Sitting proper on the prime, a flawed tokenizer distorts enter earlier than the mannequin even begins reasoning. Regardless of how sensible the structure is, it’s working with corrupted indicators from the beginning.

    (Supply: Creator)
    How a easy typo can presumably waste LLM’s “considering energy” to rectify it

    2. Behold! Byte Latent Transformer

    If tokenization is the brittle basis holding fashionable LLMs again, the pure query follows: why not remove it completely? That’s exactly the unconventional path taken by researchers at Meta AI with the Byte Latent Transformer (BLT) (Pagnoni et al. 2024)1. Reasonably than working on phrases, subwords, and even characters, BLT fashions language from uncooked bytes — essentially the most elementary illustration of digital textual content. This allows LLMs to be taught the language from the very floor up, with out the tokenizer being there to eat away on the subword semantics.

    However modeling bytes immediately is way from trivial. A naïve byte-level Transformer would choke on enter lengths a number of occasions longer than tokenized textual content — a million phrases grow to be practically 5 million bytes (1 phrase = 4.7 characters on common, and 1 character = 1 byte), making consideration computation infeasible as a consequence of its quadratic scaling. BLT circumvents this by introducing a dynamic two-tiered system: easy-to-predict byte segments are compressed into latent “patches,” considerably shortening the sequence size. The complete, high-capacity mannequin is then selectively utilized, focusing its computational sources solely the place linguistic complexity calls for it.

    (Supply: Tailored from Pagnoni et al. 2024, Determine 2)
    Zoomed-out view of your entire Byte Latent Former structure

    2.1 How does it work?

    The mannequin could be conceptually divided into three main parts, every with a definite accountability:

    2.1.1 The Native Encoder:

    The first perform of the Native Encoder is to remodel a protracted enter sequence of Nbytes of uncooked bytes, b = (b1, b2,…, bN_bytes)right into a a lot shorter sequence of Npatches of latent patch representations, p = (p1, p2,…, pN_patches).

    Step 1: Enter Segmentation and Preliminary Byte Embedding

    The enter sequence is segmented into patches primarily based on a pre-defined technique, akin to entropy-based patching. This offers patch boundary data however doesn’t alter the enter sequence itself. This patch boundary data will turn out to be useful later.

    (Supply: Pagnoni et al. 2024, Determine 3)
    Completely different methods for patching, visualized

    The primary operation throughout the encoder is to map every discrete byte worth (0-255) right into a steady vector illustration. That is achieved by way of a learnable embedding matrix, Ebyte (form: [256, he]), the place he is the hidden dimension of the native module.
    Enter: A tensor of byte IDs of form [B, Nbytes], the place B is the batch measurement.
    Output: A tensor of byte embeddings, X (form: [B, Nbytes, he]).

    Step 2: Contextual Augmentation by way of N-gram Hashing

    To counterpoint every byte illustration with native context past its particular person id, the researchers make use of a hash-based n-gram embedding approach. For every byte bi at place i, a set of previous n-grams, gi,n = {bi-n+1,…, bi} are constructed for a number of values of n ∈ {3,…,8}.

    These n-grams are mapped by way of a hash perform to indices inside a second, separate embedding desk, Ehash (form: [Vhash, he]), the place Vhash is a set, giant vocabulary measurement (i.e., the variety of hash buckets).

    The ensuing n-gram embeddings are summed with the unique byte embedding to supply an augmented illustration, ei. This operation is outlined as:

    (Supply: Creator)
    Rationalization: Lookup the hash of the n-gram within the embedding desk and add it to the respective byte embedding, for all n ∈ [3,8]

    the place xi is the preliminary embedding for byte bi.
    The form of the tensor E = {e1, e2,…,eN_bytes} stays [B, Nbytes, he].

    Step 3: Iterative Refinement with Transformer and Cross-Consideration Layers

    The core of the Native Encoder consists of a stack of le an identical layers. Every layer performs a two-stage course of to refine byte representations and distill them into patch representations.

    Step 3a: Native Self-Consideration: 
    The enter is processed by a typical Transformer block. This block makes use of a causal self-attention mechanism with a restricted consideration window, which means every byte illustration is up to date by attending solely to a set variety of previous byte representations. This ensures computational effectivity whereas nonetheless permitting for contextual refinement.

    Enter: If it’s the primary layer, the enter is the context-augmented byte embedding E; in any other case, it receives the output from the earlier native Self-Consideration layer.

    (Supply: Creator)
    Hl: Enter for the present Self-Consideration layer
    E: Context-Augmented Byte Embedding from Step 2
    H‘l-1: Output from the earlier Self-Consideration layer

    Output: Extra contextually conscious byte-representations, H‘l (form:
    [B, Nbytes, he])

    Step 3b: Multi-Headed Cross-Consideration:
    The aim of the Cross-Consideration is to distill the fine-grained, contextual data captured within the byte representations and inject it into the extra summary patch representations, giving them a wealthy consciousness of their constituent sub-word constructions. That is achieved by way of a cross-attention mechanism the place patches “question” the bytes they include.

    Queries (Q): The patch embeddings are projected utilizing a easy linear layer to type the queries.
    For any subsequent layer (l>0), the patch embeddings are merely the refined patch representations output by the cross-attention block of the earlier layer, P(l−1).
    Nonetheless, for the very first layer (l=0), these patch embeddings should be created from scratch. This initialization is a three-step course of:

    1. Gathering: Utilizing the patch boundary data obtained in Step 1, the mannequin gathers the byte representations from H0 that belong to every patch. For a single patch, this leads to a tensor of form (Nbytes_per_patch, he). After padding every patch illustration to be of the identical size, if there are J patches, the form of your entire concatenated tensor turns into:
      (B, J, Nbytes_per_patch, he).
    2. Pooling: To summarize the vector for every patch, a pooling operation (e.g., max-pooling) is utilized throughout the Nbytes_per_patch dimension. This successfully summarizes essentially the most salient byte-level options throughout the patch.
      • Enter Form: (B, J, Nbytes_per_patch, he)
      • Output Form: (B, J, he)
    3. Projection: This summarized patch vector, nonetheless within the small native dimension he is then handed by way of a devoted linear layer to the worldwide dimension, hg, the place he <<< hg. This projection is what bridges the native and international modules.
      • Enter Form: (B, J, he)
      • Output Form: (B, J, hg)
    (Supply: Creator)
    Abstract of the 3-step course of to get the primary patch embeddings:
    1. Gathering and pooling the bytes for every respective patch.
    2. Concatenating the patches to a single tensor.
    3. Projection of the patch embedding tensor to the worldwide dimension.

    The patch representations, obtained both from the earlier cross-attention block’s output or initialized from scratch, are then fed right into a linear projection layer to type queries.

    • Enter Form: (B, J, hg)
    • Output Form: (B, J, da), the place da is the “consideration dimension”.

    Keys and Values: These are derived from the byte representations Hl from Step 3a. They’re projected from dimension he​ to an intermediate consideration dimension da, by way of impartial linear layers:

    (Supply: Creator)
    Projection of the Self-Consideration output from Step 3a to Keys and Values.
    (Supply: Creator)
    Overview of the Info move within the Native Encoder

    2.1.2 The Latent International Transformer

    The sequence of patch representations generated by the Native Encoder is handed to the Latent International Transformer. This module serves as the first reasoning engine of the BLT mannequin. It’s a normal, high-capacity autoregressive Transformer composed of lg self-attention layers, the place lg is considerably bigger than the variety of layers within the native modules.

    Working on patch vectors (form: [B, J, hg]), this transformer performs full self-attention throughout all patches, enabling it to mannequin advanced, long-range dependencies effectively. Its sole perform is to foretell the illustration of the following patch, oj (form: [B, 1, hg]), within the sequence primarily based on all previous ones. The output is a sequence of predicted patch vectors, Oj (form: [B, J, hg]), which encode the mannequin’s high-level predictions.

    (Supply: Creator)
    oj is the patch that accommodates the knowledge for the following prediction

    2.1.3 The Native Decoder

    The ultimate architectural part is the Native Decoder, a light-weight Transformer that decodes the anticipated patch vector, oj, the final token from the worldwide mannequin’s output, Oj​, again right into a sequence of uncooked bytes. It operates autoregressively, producing one byte at a time.

    The era course of, designed to be the inverse of the encoder, begins with the hidden state of the final byte within the encoder’s output, Hl. Then, for every subsequent byte generated by the decoder (d’ok), in a typical autoregressive method, it makes use of the anticipated byte’s hidden state because the enter to information the era.

    Cross-Consideration: The final byte’s state of the encoder’s output Hl[:,-1,:] (appearing as question, with form: [B, 1, he]) attends to the goal patch vector oj (appearing as Key and Worth). This step injects the high-level semantic instruction from the patch idea into the byte stream.

    The question vectors are projected to an consideration dimension, da, whereas the patch vector is projected to create the important thing and worth. This alignment ensures the generated bytes are contextually related to the worldwide prediction.

    (Supply: Creator)
    The final equations, which encapsulate what Question, Key, and Worth are.
    d’ok: The ok+1th predicted byte’s hidden state from the decoder.

    Native Self-Consideration: The ensuing patch-aware byte representations are then processed by a causal self-attention mechanism. This enables the mannequin to contemplate the sequence of bytes already generated throughout the present patch, implementing native sequential coherence and proper character ordering.

    After passing by way of all ld layers, every together with the above two levels, the hidden state of the final byte within the sequence is projected by a last linear layer to a 256-dimensional logit vector. A softmax perform converts these logits right into a chance distribution over the byte vocabulary, from which the following byte is sampled. This new byte is then embedded and appended to the enter sequence for the following era step, persevering with till the patch is totally decoded.

    (Supply: Creator)
    Overview of the Info move in Native Decoder

    3. The Verdict: Bytes Are Higher Than Tokens!

    Byte Latent Transformer might genuinely be an alternative choice to the common vanilla Tokenization-based Transformers at scale. Listed below are just a few convincing causes for that argument:

    1. Byte-Stage Fashions Can Match The Ones Primarily based On Tokens.
    One of many primary contributions of this work is that byte-level fashions, for the primary time, can match the scaling habits of state-of-the-art token-based architectures akin to LLaMA 3 (Grattafiori et al. 2024)2. When educated beneath compute-optimal regimes, the Byte Latent Transformer (BLT) displays efficiency scaling tendencies akin to these of fashions utilizing byte pair encoding (BPE). This discovering challenges the long-standing assumption that byte-level processing is inherently inefficient, exhibiting as an alternative that with the appropriate architectural design, tokenizer-free fashions even have a shot.

    (Supply: Tailored from Pagnoni et al. 2024, Determine 6)
    BLT exhibiting aggressive BPB (perplexity equal for byte fashions) and comparable scaling legal guidelines to these of the tokenizer-based LLaMA fashions

    2. A New Scaling Dimension: Buying and selling Patch Dimension for Mannequin Dimension.
    The BLT structure decouples mannequin measurement from sequence size in a means that token-based fashions can’t. By dynamically grouping bytes into patches, BLT can use longer common patches to save lots of on compute. This saved compute could be reallocated to extend the dimensions and capability of the primary Latent International Transformer whereas retaining the full inference price (FLOPs) fixed. The paper exhibits this new trade-off is extremely helpful: bigger fashions working on longer patches persistently outperform smaller fashions working on shorter tokens/patches for a set inference funds.
    This implies you possibly can have a bigger and extra succesful mannequin — at no further compute price!

    (Supply: Tailored from Pagnoni et al. 2024, Determine 1)
    The steeper scaling curves of the bigger BLT fashions enable them to surpass the efficiency of the token-based Llama fashions after the crossover level.

    3. Subword Consciousness Via Byte-Stage Modeling
    By processing uncooked bytes immediately, BLT avoids the knowledge loss sometimes launched by tokenization, having access to the interior construction of phrases — their spelling, morphology, and character-level composition. This leads to a heightened sensitivity to subword patterns, which the mannequin demonstrates throughout a number of benchmarks.
    On CUTE (Character-level Understanding and Textual content Analysis) (Edman et al., 2024)3, BLT excels at duties involving fine-grained edits like character swaps or substitutions, attaining near-perfect accuracy on spelling duties the place fashions like LLaMA 3 fail completely.
    Equally, on noised HellaSwag (Zellers et al, 2019)4, the place inputs are perturbed with typos and case variations, BLT retains its reasoning potential much more successfully than token-based fashions. These outcomes are indicative of BLT’s inherent robustness, which Token-based fashions can’t achieve even with considerably extra information.

    (Supply: Pagnoni et al. 2024, Desk 3)
    The mannequin’s direct byte-level processing results in large good points on character manipulation (CUTE) and noise robustness (HellaSwag Noise Avg.), duties that problem token-based architectures.

    4. BLT Reveals Stronger Efficiency on Low-Useful resource Languages.
    Fastened tokenizers, usually educated on a majority of English or high-resource language information, could be inefficient and inequitable for low-resource languages, usually breaking phrases down into particular person bytes (a phenomenon referred to as “byte-fallback”). As a result of BLT is inherently byte-based, it treats all languages equally from the beginning. The outcomes present this results in improved efficiency in machine translation, notably for languages with scripts and morphologies which might be poorly represented in normal BPE vocabularies.

    (Supply: Pagnoni et al. 2024, Desk 4)
    Machine translation efficiency on the FLORES-101 benchmark (Goyal et al., 2022)5. Comparable efficiency on high-resource languages, however superior for low-resource languages, outperforming the LLaMA 3 mannequin.

    5. Dynamic Allocation Of Compute: Not Each Phrase Is Equally Deserving
    A key energy of the BLT structure lies in its potential to dynamically allocate computation primarily based on enter complexity. Not like conventional fashions that expend a set quantity of compute per token—treating easy phrases like “the” and complicated ones like “antidisestablishmentarianism” with equal price—BLT ties its computational effort to the construction of its discovered patches. The high-capacity International Transformer works just for patches, permitting BLT to type longer patches over predictable, low-complexity sequences and shorter patches over areas requiring deeper reasoning. This allows the mannequin to focus its strongest parts the place they’re wanted most, whereas offloading routine byte-level decoding to a lighter, native decoder, yielding a much more environment friendly and adaptive allocation of sources.


    4. Closing Ideas And Conclusion

    For me, what makes BLT thrilling isn’t simply the benchmarks or the novelties, it’s the concept a mannequin can transfer past the superficial wrappers we name “languages” — English, Japanese, even Python — and begin studying immediately from the uncooked bytes, the elemental substrate of all communication. I like that. A mannequin that doesn’t depend on a set vocabulary, however as an alternative learns construction from the bottom up? That appears like an actual step towards one thing extra common.

    In fact, one thing this totally different received’t be embraced with open arms, in a single day. Tokenizers have grow to be baked into all the pieces — our fashions, our instruments, our instinct. Ditching them means rethinking the very foundational block of your entire AI ecosystem. However the upside right here is tough to disregard. Perhaps as an alternative of the entire structure, we’d see a few of its options being built-in into the brand new techniques we see sooner or later.


    5. References

    [1] Pagnoni, Artidoro, et al. “Byte latent transformer: Patches scale better than tokens.” arXiv preprint arXiv:2412.09871 (2024).
    [2] Grattafiori, Aaron, et al. “The llama 3 herd of models.” arXiv preprint arXiv:2407.21783 (2024).
    [3] Edman, Lukas, Helmut Schmid, and Alexander Fraser. “CUTE: Measuring LLMs’ Understanding of Their Tokens.” arXiv preprint arXiv:2409.15452 (2024).
    [4] Zellers, Rowan, et al. “Hellaswag: Can a machine really finish your sentence?.” arXiv preprint arXiv:1905.07830 (2019).
    [5] Goyal, Naman, et al. “The flores-101 evaluation benchmark for low-resource and multilingual machine translation.” Transactions of the Affiliation for Computational Linguistics 10 (2022): 522-538.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAgentic AI: Implementing Long-Term Memory
    Next Article Core Machine Learning Skills, Revisited
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    The Power of Building from Scratch

    July 16, 2025
    Artificial Intelligence

    Do You Really Need a Foundation Model?

    July 16, 2025
    Artificial Intelligence

    How to more efficiently study complex treatment interactions | MIT News

    July 16, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Writer lanserar Palmyra X5 en LLM med 1 miljon tokens kontextfönster

    April 29, 2025

    AI Predictive Analytics: Transforming Business Decision-Making

    April 10, 2025

    AI-Powered Solutions for Enhanced Location Tracking • AI Parabellum

    April 7, 2025

    Top 10 Image To Video AI Tools To Try In 2025 » Ofemwire

    June 13, 2025

    The Rise of the “AI-First” Company Is About to Reshape the Future of Work

    May 6, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Cyberbrottslingar använder Vercels v0 för att skapa falska inloggningssidor

    July 3, 2025

    Automate invoice and AP management

    May 23, 2025

    How to Generate Synthetic Data: A Comprehensive Guide Using Bayesian Sampling and Univariate Distributions

    May 26, 2025
    Our Picks

    The Power of Building from Scratch

    July 16, 2025

    These four charts show where AI companies could go next in the US

    July 16, 2025

    Undetectable AI vs. Grammarly’s AI Humanizer: What’s Better with ChatGPT?

    July 16, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.