Close Menu
    Trending
    • The Basics of Vibe Engineering
    • Beyond Prompt Caching: 5 More Things You Should Cache in RAG Pipelines
    • Linear Regression Is Actually a Projection Problem, Part 1: The Geometric Intuition
    • Vibe Coding with AI: Best Practices for Human-AI Collaboration in Software Development
    • Generative AI improves a wireless vision system that sees through obstructions | MIT News
    • A better method for identifying overconfident large language models | MIT News
    • DataRobot + Nebius: An enterprise-ready AI Factory optimized for agents
    • Two-Stage Hurdle Models: Predicting Zero-Inflated Outcomes
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Beyond Prompt Caching: 5 More Things You Should Cache in RAG Pipelines
    Artificial Intelligence

    Beyond Prompt Caching: 5 More Things You Should Cache in RAG Pipelines

    ProfitlyAIBy ProfitlyAIMarch 19, 2026No Comments13 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    , we talked intimately about what Prompt Caching is in LLMs and the way it can prevent some huge cash and time when operating AI-powered apps with excessive site visitors. However aside from Immediate Caching, the idea of a cache may also be utilized in a number of different components of AI functions, akin to RAG retrieval caching or caching of total query-response pairs, offering additional price and time financial savings. On this put up, we’re going to have a look in additional element at what different parts of an AI app can profit from caching mechanisms. So, let’s check out caching in AI past Immediate Caching.


    Why does it make sense to cache different issues?

    So, Immediate Caching is smart as a result of we count on system prompts and directions to be handed as enter to the LLM, in precisely the identical format each time. However past this, we are able to additionally count on consumer queries to be repeated or look alike to some extent. Particularly when speaking about deploying RAG or different AI apps inside a corporation, we count on a big portion of the queries to be semantically comparable, and even an identical. Naturally, teams of customers inside a corporation are going to be desirous about comparable issues more often than not, like ‘what number of days of annual depart is an worker entitled to in line with the HR coverage‘, or ‘what’s the course of for submitting journey bills‘. However, statistically, it’s extremely unlikely that a number of customers will ask the very same question (the very same phrases permitting for a precise match), except we offer them with proposed, standardized queries throughout the UI of the app. Nonetheless, there’s a very excessive probability that customers ask queries with totally different phrases which can be semantically very comparable. Thus, it is smart to additionally consider a semantic cache aside from the standard cache.

    On this manner, we are able to additional distinguish between the 2 sorts of cache:

    • Actual-Match Caching, that’s, after we cache the unique textual content or some normalized model of it. Then we hit cache solely with actual, word-for-word matches of the textual content. Actual-match caching may be applied utilizing a KV cache like Redis.
    • Semantic Caching, that’s, creating an embedding of the textual content. Then we hit cache with any textual content that’s semantically much like it and exceeds a predefined similarity rating threshold (like cosine similarity above ~0.95). Since we have an interest within the semantics of the texts and we carry out a similarity search, a vector database, akin to ChromaDB, would must be used as a cache retailer.

    In contrast to Immediate Caching, the place we get to make use of a cache built-in into the API service of the LLM, to implement caching in different levels of a RAG pipeline, now we have to make use of an exterior cache retailer, like Redis or ChromaDB talked about above. Whereas this can be a little bit of a problem, as we have to arrange these cache shops ourselves, it additionally offers us with extra management over the parametrization of the cache. As an illustration, we get to determine about our Cache Expiration insurance policies, which means how lengthy a cached merchandise stays legitimate and may be reused. This parameter of the cache reminiscence is outlined as Time-To-Live (TTL).

    As illustrated in my previous posts, a quite simple RAG pipeline appears to be like one thing like this:

    Even within the easiest type of a RAG pipeline, we already use a caching-like mechanism with out even realizing it. That’s, storing the embeddings in a vector database and retrieving them from there, as a substitute of creating requests to an embedding mannequin each time and recalculating the embeddings. That is very simple and basically a non-negotiable half (it might be foolish of us to not do it) even of a quite simple RAG pipeline, as a result of the embeddings of the paperwork usually stay the identical (we have to recalculate an embedding solely when a doc of the information base is altered), so it is smart to calculate as soon as and retailer it someplace.

    However aside from storing the information base embeddings in a vector database, different components of the RAG pipeline may also be reused, and we are able to profit from making use of caching to them. Let’s see what these are in additional element!

    . . .

    1. Question Embedding Cache

    The very first thing that’s carried out in a RAG system when a question is submitted is that the question is reworked into an embedding vector, in order that we are able to carry out semantic search and retrieval towards the information base. Apparently, this step may be very light-weight compared to calculating the embeddings of your entire information base. Nonetheless, in high-traffic functions, it might nonetheless add pointless latency and value, and in any case, recalculating the identical embeddings for a similar queries over and over is wasteful.

    So, as a substitute of computing the question embedding each time from scratch, we are able to first verify if now we have already computed the embedding for a similar question earlier than. If sure, we merely reuse the cached vector. If not, we generate the embedding as soon as, retailer it within the cache, and make it accessible for future reuse.

    On this case, our RAG pipeline would look one thing like this:

    Probably the most simple method to implement question embedding caching is by on the lookout for the exact-match of the uncooked consumer question. For instance:

    What space codes correspond to Athens, Greece?

    However, we are able to additionally use a normalized model of the uncooked consumer question by performing some easy operations, like making it lowercase or stripping punctuation. On this manner, the next queries…

    What space codes correspond to athens greece?
    What space codes correspond to Athens, Greece
    what space codes correspond to Athens // Greece?

    … would all map to …

    what space codes correspond to athens greece?

    We then seek for this normalized question within the KV retailer, and if we get a cache hit, we are able to then immediately use the embedding that’s saved within the cache, without having to make a request to the embedding mannequin once more. That’s going to be an embedding wanting one thing like this, for instance:

    [0.12, -0.33, 0.88, ...]

    Normally, for the question embedding cache, the key-values have the next format:

    question → embedding

    As it’s possible you’ll already think about, the hit for this may considerably enhance if we suggest the customers with standardized queries throughout the app’s UI, past letting them kind their very own queries in free textual content.

    . . .

    2. Retrieval Cache

    Caching may also be utilized on the retrieval step of an RAG pipeline. Which means that we are able to cache the retrieved outcomes for a selected question and reduce the necessity to carry out a full retrieval for comparable queries. On this case, the important thing of the cache will be the uncooked or normalized consumer question, or the question embedding. The worth we get again from the cache is the retrieved doc chunks. So, our RAG pipeline with retrieval caching, both exact-match or semantic, would look one thing like this:

    So for our normalized question…

    what space codes correspond to athens greece?

    or from the question embedding…

    [0.12, -0.33, 0.88, ...]

    we might immediately get again from the cache the retrieved chunks.

    [
     chunk_12,
     chunk_98,
     chunk_42
    ]

    On this manner, when an an identical and even considerably comparable question is submitted, we have already got the related chunks and paperwork within the cache — there isn’t a must carry out the retrieval step. In different phrases, even for queries which can be solely reasonably comparable (for instance, cosine similarity above ~0.85), the precise response might not exist within the cache, however the related chunks and paperwork wanted to reply the question typically do.

    Normally, for the retrieval cache, the key-values have the next format:

    question → retrieved_chunks

    One might marvel how that is totally different from the question embedding cache. In any case, if the question is identical, why indirectly hit the cache within the retrieval cache and likewise embody a question embedding cache? The reply is that in apply, the question embedding cache and the retrieval cache might have totally different TTL insurance policies. That’s as a result of the paperwork within the information base might change, and even when now we have the identical question or the identical question embedding, the corresponding chunks could also be totally different. This explains the usefulness of the question embedding cache current individually.

    . . .

    3. Reranking Cache

    One other method to make the most of caching within the context of RAG is by caching the outcomes of the reranker mannequin (if we use one). Extra particularly, because of this as a substitute of passing the retrieved ranked outcomes to a reranker mannequin and getting again the reranked outcomes, we immediately get the reranked order from the cache, for a selected question and retrieved chunks. On this case, our RAG pipeline would look one thing like this:

    In our Athens space codes instance, for our normalized question:

    what space codes correspond to athens greece?

    and hypothetical retrieved and ranked chunks

    [
     chunk_12,
     chunk_98,
     chunk_42
    ]

    we might immediately get the reranked chunks as output of the cache:

    [
    chunk_98,
    chunk_12,
    chunk_42
    ]

    Normally, for the reranking cache, the keys and values have the next format:

    (question + retrieved_chunks) → reranked_chunks

    Once more, one might marvel: if we hit the reranking cache, shouldn’t we additionally all the time hit the retrieval cache? At first look, this might sound true, however in apply, it’s not essentially the case.

    One motive is that, as defined already, totally different caches might have totally different TTL insurance policies. Even when the reranking consequence continues to be cached, the retrieval cache might have already expired and require performing the retrieval step from scratch.

    However past this, in a posh RAG system, we likely are going to make use of multiple retrieval mechanism (e.g., semantic search, BM25, and many others.). In consequence, we might hit the retrieval cache for one of many retrieval mechanisms, however not for all, and thus not hit the cache for reranking. Vice versa, we might hit the cache for reranking, however miss on the person caches of the varied retrieval mechanisms — we might find yourself with the identical set of paperwork, however by retrieving totally different paperwork from every particular person retrieval mechanism. For these causes, the retrieval and reranking caches are conceptually and virtually totally different.

    . . .

    4. Immediate Meeting Cache

    One other helpful place to use caching in a RAG pipeline is throughout the immediate meeting stage. That’s, as soon as retrieval and reranking are accomplished, the related chunks are mixed with the system immediate and the consumer question to kind the ultimate immediate that’s despatched as enter to the LLM. So, if the question, system immediate, and reranked chunks all match, then we hit cache. Which means that we don’t must reconstruct the ultimate immediate once more, however we are able to get components of it (the context) and even your entire remaining immediate immediately from cache.

    Caching the immediate meeting step in a RAG pipeline would look one thing like this:

    Persevering with with our Athens instance, suppose the consumer submits the question…

    what space codes correspond to athens greece?

    and after retrieval and reranking, we get the next chunks (both from the reranker or the reranking cache):

    [
    chunk_98,
    chunk_12,
    chunk_42
    ]

    In the course of the immediate meeting step, these chunks are mixed with the system immediate and the consumer question to assemble the ultimate immediate that might be despatched to the LLM. For instance, the assembled immediate might look one thing like:

    System: You're a useful assistant that solutions questions utilizing the supplied context.
    
    Context:
    [chunk_98]
    [chunk_12]
    [chunk_42]
    
    Consumer: what space codes correspond to athens greece?

    Normally, for the immediate meeting cache, the important thing values have the next format:

    (question + system_prompt + retrieved_chunks) → assembled_prompt

    Apparently, the computational financial savings listed here are smaller in comparison with the opposite caching layers talked about above. Nonetheless, context caching can nonetheless scale back latency and simplify immediate development in high-traffic techniques. Specifically, immediate meeting caching is smart to implement in techniques the place immediate meeting is advanced and consists of extra operations than a easy concatenation, like inserting guardrails.

    . . .

    5. Question – Response Caching

    Final however not least, we are able to cache pairs of total queries and responses. Intuitively, after we discuss caching, the very first thing that involves thoughts is caching question and response pairs. And this may be the last word jackpot for our RAG pipeline, as on this case, we don’t must run any of it, and we are able to simply present a response to the consumer’s question solely by utilizing the cache.

    Extra particularly, on this case, we retailer total question — remaining response pairs within the cache, and utterly keep away from any retrieval (in case of RAG) and re-generation of a response. On this manner, as a substitute of retrieving related chunks and producing a response from scratch, we immediately get a precomputed response, which was generated at some earlier time for a similar or an analogous question.

    To securely implement query-response caching, we both have to make use of actual matches within the type of a key-value cache or use semantic caching with a really strict threshold (like 0.99 cosine similarity between consumer question and cached question).

    So, our RAG pipeline with query-response caching would look one thing like this:

    Persevering with with our Athens instance, suppose a consumer asks the question:

    what space codes correspond to athens greece?

    Assume that earlier, the system already processed this question via the total RAG pipeline, retrieving related chunks, reranking them, assembling the immediate, and producing the ultimate reply with the LLM. The generated response would possibly look one thing like:

    The principle phone space code for Athens, Greece is 21. 
    Numbers within the Athens metropolitan space sometimes begin with the prefix 210, 
    adopted by the native subscriber quantity.

    The following time an an identical or extraordinarily comparable question seems, the system doesn’t must run the retrieval, reranking, or technology steps once more. As a substitute, it might instantly return the cached response.

    Normally, for the query-response cache, the important thing values have the next format:

    question → final_response

    . . .

    On my thoughts

    Aside from Prompt Caching immediately supplied within the API providers of the varied LLMs, a number of different caching mechanisms may be utilized in an RAG software to attain price and latency financial savings. Extra particularly, we are able to make the most of caching mechanisms within the type of question embeddings cache, retrieval cache, reranking cache, immediate meeting cache, and question response cache. In apply, in a real-world RAG, many or all of those cache shops can be utilized together to supply improved efficiency when it comes to price and time because the customers of the app scale.


    Liked this put up? Let’s be mates! Be a part of me on:

    📰Substack 💌 Medium 💼LinkedIn ☕Buy me a coffee!

    All photographs by the writer, besides talked about in any other case.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleLinear Regression Is Actually a Projection Problem, Part 1: The Geometric Intuition
    Next Article The Basics of Vibe Engineering
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    The Basics of Vibe Engineering

    March 19, 2026
    Artificial Intelligence

    Linear Regression Is Actually a Projection Problem, Part 1: The Geometric Intuition

    March 19, 2026
    Artificial Intelligence

    Vibe Coding with AI: Best Practices for Human-AI Collaboration in Software Development

    March 19, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    How to Use Gyroscope in Presentations, or Why Take a JoyCon to DPG2025

    April 21, 2025

    Definition, Types, Benefits, Use Cases, and Challenges

    February 12, 2026

    Microsoft’s Quiet AI Layoffs, US Copyright Office’s Bombshell AI Guidance, 2025 State of Marketing AI Report, and OpenAI Codex

    May 20, 2025

    AI models are using material from retracted scientific papers

    September 23, 2025

    Forskare skapar AI-verktyg som beräknar biologisk ålder från selfies

    May 12, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Varför ligger Apple bakom AI-utvecklingen – Vilka är kommande Apple AI produkter

    December 3, 2025

    Features, Benefits, Pricing, Alternatives and Review • AI Parabellum

    April 3, 2025

    Learning, Hacking, and Shipping ML

    December 1, 2025
    Our Picks

    The Basics of Vibe Engineering

    March 19, 2026

    Beyond Prompt Caching: 5 More Things You Should Cache in RAG Pipelines

    March 19, 2026

    Linear Regression Is Actually a Projection Problem, Part 1: The Geometric Intuition

    March 19, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.