Close Menu
    Trending
    • Dispatch: Partying at one of Africa’s largest AI gatherings
    • Topp 10 AI-filmer genom tiderna
    • OpenAIs nya webbläsare ChatGPT Atlas
    • Creating AI that matters | MIT News
    • Scaling Recommender Transformers to a Billion Parameters
    • Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know
    • Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI
    • ChatGPT Gets More Personal. Is Society Ready for It?
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » How to Context Engineer to Optimize Question Answering Pipelines
    Artificial Intelligence

    How to Context Engineer to Optimize Question Answering Pipelines

    ProfitlyAIBy ProfitlyAISeptember 5, 2025No Comments9 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    engineering is among the most related subjects in machine studying at present, which is why I’m writing my third article on the subject. My purpose is to each broaden my understanding of engineering contexts for LLMs and share that information by means of my articles.

    In at present’s article, I’ll talk about bettering the context you feed into your LLMs for query answering. Normally, this context is predicated on retrieval augmented technology (RAG), nonetheless, in at present’s ever-shifting surroundings, this method must be up to date.

    The co-founder of Chroma (a vector database supplier) tweeted that RAG is useless. I don’t absolutely agree that we gained’t use RAG anymore, however his tweet highlights how there are completely different choices for filling the context of your LLM.

    You too can learn my earlier context engineering articles:

    1. Basic Context engineering techniques
    2. Advanced context engineering techniques

    Desk of Contents

    Why you need to care about context engineering

    First, let me spotlight three key factors for why you need to care about context engineering:

    • Higher output high quality by avoiding context rot. Fewer pointless tokens enhance output high quality. You’ll be able to learn extra particulars about it in this article
    • Cheaper (don’t ship pointless tokens, they price cash)
    • Velocity (much less tokens = sooner response instances)

    These are three core metrics for many query answering techniques. The output high quality is of course of utmost precedence, contemplating customers won’t need to use a low-performing system.

    Moreover, worth ought to all the time be a consideration, and in case you can decrease it (with out an excessive amount of engineering price), it’s a easy determination to take action. Lastly, a sooner query answering system supplies a greater person expertise. You don’t need customers ready quite a few seconds to get a response when ChatGPT will reply a lot sooner.

    The normal question-answering method

    Conventional, on this sense, means the commonest query answering method in techniques constructed after the release of ChatGPT. This technique is conventional RAG, which works as follows:

    1. Fetch essentially the most related paperwork to the person’s query, utilizing vector similarity retrieval
    2. Feed related paperwork together with a query into an LLM, and obtain a response

    Contemplating its simplicity, this method works extremely effectively. Curiously sufficient, we additionally see this taking place with one other conventional method. BM25 has been around since 1994 and was, for instance, lately utilized by Anthropic once they launched Contextual Retrieval, proving how efficient even easy data retrieval methods are.

    Nonetheless, you possibly can nonetheless vastly enhance your query answering system by updating your RAG utilizing some methods I’ll describe within the subsequent part.

    Enhancing RAG context fetching

    Regardless that RAG works comparatively effectively, you possibly can seemingly obtain higher efficiency by introducing the methods I’ll talk about on this part. The methods I describe right here all concentrate on bettering the context you feed to the LLM. You’ll be able to enhance this context with two fundamental approaches:

    1. Use fewer tokens on irrelevant context (for instance, eradicating or utilizing much less materials from related paperwork)
    2. Add paperwork which can be related

    Thus, you need to concentrate on reaching one of many factors above. For those who suppose when it comes to precision and recall:

    1. Will increase precision (at the price of recall)
    2. Enhance recall (at the price of precision)

    It is a tradeoff you will need to make whereas engaged on context engineering your query answering system.

    Lowering the variety of irrelevant tokens

    On this part, I spotlight three fundamental approaches to scale back the variety of irrelevant tokens you feed into the LLMs context:

    • Reranking
    • Summarization
    • Prompting GPT

    When fetching paperwork from vector similarity search, they’re returned so as of most related to least related, given the vector similarity rating. Nonetheless, this similarity rating may not precisely characterize which paperwork are most related.

    Reranking

    You’ll be able to thus use a reranking mannequin, for instance, Qwen reranker, to reorder the doc chunks. You’ll be able to then select to solely hold the highest X most related chunks (in accordance with the reranker), which ought to take away some irrelevant paperwork out of your context.

    Summarization

    You too can select to summarize paperwork, lowering the variety of tokens used per doc. You’ll be able to, for instance, hold the complete doc from the highest 10 most related paperwork fetched, summarize paperwork ranked from 11-20, and discard the remaining.

    This method will enhance the chance that you simply hold the complete context from related paperwork, whereas a minimum of sustaining some context (the abstract) from paperwork which can be much less prone to be related.

    Prompting GPT

    Lastly, you may as well immediate GPT whether or not the fetched paperwork are related to the person question. For instance, in case you fetch 15 paperwork, you may make 15 particular person LLM calls to evaluate if every doc is related. You then discard paperwork which can be deemed irrelevant. Remember that these LLM calls should be parallelized to maintain response time inside an appropriate restrict.

    Including related paperwork

    Earlier than or after eradicating irrelevant paperwork, you additionally make sure you embrace related paperwork. I embrace two fundamental approaches on this subsection:

    • Higher embedding fashions
    • Looking out by means of extra paperwork (at the price of decrease precision)

    Higher embedding fashions

    To search out the very best embedding fashions, you possibly can go to the HuggingFace embedding model leaderboard, the place Gemini and Qwen are within the prime 3 as of the writing of this text. Updating your embedding mannequin is often an affordable method to fetch extra related paperwork. It’s because working and storing embeddings is often low cost, for instance, embedding by means of the Gemini API, and storing vectors in Pinecone.

    Search extra paperwork

    One other (comparatively easy) method to fetch extra related paperwork is to fetch extra paperwork typically. Fetching extra paperwork naturally will increase the chance that you simply add related ones. Nonetheless, it’s important to steadiness this with avoiding context rot and lowering the variety of irrelevant paperwork to a minimal. Each pointless token in an LLM name is, as earlier, prone to:

    • Scale back output high quality
    • Enhance price
    • Decrease pace

    These are all essential features of a question-answering system.

    Agentic search method

    I’ve mentioned agentic search approaches in earlier articles, for instance, once I mentioned Scaling your AI Search. Nonetheless, on this part, I’ll dive deeper into organising an agentic search, which replaces some or all the vector retrieval step in your RAG.

    Step one is that the person supplies their query to a given set of information factors, for instance, a set of paperwork. You then arrange an agentic system consisting of an orchestra agent and a listing of sub-agents.

    This determine highlights an orchestra system of LLM brokers. The principle agent receives the person question and assigns duties to subagents. Picture by ChatGPT.

    That is an instance of the pipeline the brokers would comply with (although there are various methods to set it up).

    1. Orchestra agent tells two subagents to iterate over all doc filenames and return related paperwork
    2. Related paperwork are fed again to the orchestra agent, which once more releases a subagent to every of the related paperwork, to fetch subparts (chunks) of the doc which can be related to the person’s query. These chunks are then fed again to the orchestra agent
    3. The orchestra agent solutions the person’s query, given the offered chunks

    One other circulate you might implement could possibly be to retailer doc embeddings, and change the first step with vector similarity between the person query and every doc.

    This agentic method has upsides and disadvantages.

    Upsides:

    • Higher probability of fetching related chunks than with conventional RAG
    • Extra management over the RAG system. You’ll be able to replace system prompts, and many others, whereas RAG is comparatively static with its embedding similarities

    Draw back:

    In my view, constructing such an agent-based retrieval system is an excellent highly effective method that may result in superb outcomes. The consideration it’s important to make when constructing such a system is whether or not the elevated high quality you’ll (seemingly) see is definitely worth the enhance in price.

    Different context engineering features

    On this article, I’ve primarily lined context engineering for the paperwork we fetch in a query answering system. Nonetheless, there are additionally different features you ought to be conscious of, primarily:

    • The system/person immediate you might be utilizing
    • Different data fed into the immediate

    The immediate you write to your query answering system must be exact, structured, and keep away from irrelevant data. You’ll be able to learn many different articles on the subject of structuring prompts, and you’ll usually ask an LLM to enhance these features of your immediate.

    Generally, you additionally feed different data into your immediate. A typical instance is feeding in metadata, for instance, information masking details about the person, equivalent to:

    • Identify
    • Job position
    • What they often seek for
    • and many others

    Everytime you add such data, you need to all the time ask your self:

    Does amending this data assist my query answering system reply the query?

    Generally the reply is sure, different instances it’s no. An important half is that you simply made a rational determination on whether or not the data is required within the immediate. For those who can’t justify having this data within the immediate, it ought to often be eliminated.

    Conclusion

    On this article, I’ve mentioned context engineering to your query answering system, and why it’s necessary. Query answering techniques often encompass an preliminary step to fetch related data. The concentrate on this data must be to scale back the variety of irrelevant tokens to a minimal, whereas additionally together with as many related items of data as attainable.

    👉 Discover me on socials:

    🧑‍💻 Get in touch

    🔗 LinkedIn

    🐦 X / Twitter

    ✍️ Medium

    You too can learn my in-depth article on Anthropic’s contextual retrieval beneath:



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleShowcasing Your Work on HuggingFace Spaces
    Next Article Apple planerar att lansera en AI-driven sökverktyg som integrerar Google Gemini
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Creating AI that matters | MIT News

    October 21, 2025
    Artificial Intelligence

    Scaling Recommender Transformers to a Billion Parameters

    October 21, 2025
    Artificial Intelligence

    Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know

    October 21, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Data Annotation Techniques For The Most Common AI Use Cases In Healthcare

    May 13, 2025

    Grounding AI: 7 Powerful Strategies to Build Smarter, More Reliable Language Models

    May 20, 2025

    Rethinking Data Science Interviews in the Age of AI

    July 4, 2025

    A Visual Guide to Tuning Random Forest Hyperparameters

    September 4, 2025

    Stepwise Selection Made Simple: Improve Your Regression Models in Python

    August 28, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Preparing Video Data for Deep Learning: Introducing Vid Prepper

    September 29, 2025

    Explainable AI in Senior Healthcare: Transforming Medical Decisions

    April 10, 2025

    Anthropic can now track the bizarre inner workings of a large language model

    April 3, 2025
    Our Picks

    Dispatch: Partying at one of Africa’s largest AI gatherings

    October 22, 2025

    Topp 10 AI-filmer genom tiderna

    October 22, 2025

    OpenAIs nya webbläsare ChatGPT Atlas

    October 22, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.