Close Menu
    Trending
    • Implementing DRIFT Search with Neo4j and LlamaIndex
    • Agentic AI in Finance: Opportunities and Challenges for Indonesia
    • Dispatch: Partying at one of Africa’s largest AI gatherings
    • Topp 10 AI-filmer genom tiderna
    • OpenAIs nya webbläsare ChatGPT Atlas
    • Creating AI that matters | MIT News
    • Scaling Recommender Transformers to a Billion Parameters
    • Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » RAG Explained: Reranking for Better Answers
    Artificial Intelligence

    RAG Explained: Reranking for Better Answers

    ProfitlyAIBy ProfitlyAISeptember 24, 2025No Comments11 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    , we took a have a look at how the retrieval mechanism of a RAG pipeline works. In a RAG pipeline, related paperwork from a information base are recognized and retrieved based mostly on how comparable they’re to the person’s question. Extra particularly, the similarity of every textual content chunk is quantified utilizing a retrieval metric, like cosine similarity, L2 distance, or dot product as a measure, then the textual content chunks are ranked based mostly on their similarity scores, and eventually, we decide the highest textual content chunks which are essentially the most just like the person’s question.

    Sadly, excessive similarity scores don’t all the time assure excellent relevance. In different phrases, the retriever could retrieve a textual content chunk that has a excessive similarity rating, however is actually not that helpful – simply not what we have to reply our person’s query 🤷🏻‍♀️. And that is the place re-ranking is launched, as a technique to refine outcomes earlier than feeding them into the LLM.

    As in my earlier posts, I’ll as soon as once more be utilizing the War and Peace textual content for example, licensed as Public Area and simply accessible via Project Gutenberg.

    • • •

    What about Reranking?

    Textual content chunks retrieved solely based mostly on a retrieval metric – that’s, uncooked retrieval– might not be that helpful for a number of totally different causes:

    • The retrieved chunks we find yourself with could fluctuate largely with the chosen variety of high chunks ok. Relying on the quantity ok of high chunks we retrieve, we could get very totally different outcomes.
    • We could retrieve chunks which are semantically near what we’re searching for, however nonetheless off-topic and, in actuality, not applicable to reply the person’s question.
    • We could get partial matches to particular phrases included within the person’s question, resulting in chunks that embody these particular phrases however are actually irrelevant.

    Again to my favourite query from the ‘Warfare and Peace’ instance, if we ask ‘Who’s Anna Pávlovna?’, and use a really small ok (like ok = 2), the retrieved chunks could not comprise sufficient info to comprehensively reply the query. Conversely, if we permit for a lot of chunks ok to be retrieved (say ok = 20), we’re likely going to additionally retrieve some irrelevant textual content chunks the place ‘Anna Pávlovna’ is simply talked about, however isn’t the subject of the chunk. Thus, the which means of a few of these chunks goes to be unrelated to the person’s question and ineffective for answering it. Due to this fact, we want a technique to distinguish the really related retrieved textual content chunks out of all of the retrieved chunks.

    Right here, it’s price clarifying that one easy resolution for this situation can be simply retrieving every thing and passing every thing to the era step (to the LLM). Sadly, this can’t be performed for a bunch of causes, like that the LLMs have sure context home windows, or that the LLMs’ efficiency will get worse when overstuffing with info.

    So, that is the problem we attempt to deal with by introducing the reranking step. In essence, reranking means re-evaluating the chunks which are retrieved based mostly on the cosine similarity scores with a extra correct, but additionally dearer and slower technique.

    Picture by creator – making an attempt to suit every thing I’ve talked about to date right into a single diagram 😅

    There are numerous strategies for doing this, as for example, cross-encoders, using an LLM to do the reranking, or utilizing heuristics. In the end, by introducing this additional reranking step, we primarily implement what is known as a two-stage retrieval with reranking, which is a typical business method. This permits for bettering the relevance of the retrieved textual content chunks and, consequently, the standard of the generated responses.

    So, let’s take a extra detailed look… 🔍

    • • •

    Reranking with a Cross-Encoder

    Cross-encoders are the usual fashions used for reranking in a RAG framework. In contrast to retriever features used within the preliminary retrieval step, which simply take into consideration the similarity scores of various textual content chunks, cross-encoders are capable of carry out a extra in-depth comparability of every of the retrieved textual content chunks with the person’s question. Extra particularly, a cross encoder collectively embeds a doc and the person’s question and produces a similarity rating. On the flip facet, in cosine similarity-based retrieval, the doc and the person’s question are embedded individually from each other, after which their similarity is calculated. In consequence, some info of the unique texts is misplaced when creating the embeddings individually, and a few extra info is preserved when the texts are collectively embedded. Consequently, a cross encoder can assess higher the relevance between two textual content chunks (that’s, the person’s question and a doc).

    So why not use a cross-encoder within the first place? The reply is as a result of cross-encoders are very gradual. As an example, a cosine similarity seek for about 1,000 passages takes lower than a millisecond. Quite the opposite, utilizing solely a cross-encoder (like ms-marco-MiniLM-L-6-v2) to look the identical set of 1,000 passages and match for a single question can be orders-of-magnitude slower!

    That is to be anticipated if you concentrate on it, since utilizing a cross-encoder implies that we’ve got to pair every chunk of the information base with the person’s question and embed them on the spot, and for each new question. Quite the opposite, with cosine similarity-based retrieval, we get to create all of the embeddings of the information base beforehand, and simply as soon as, after which as soon as the person submits a question, we simply must embed the person’s question and calculate the pairwise cosine similarities.

    For that cause, we regulate our RAG pipeline appropriately and get one of the best of each worlds; first, we slender down the candidate related chunks with the cosine similarity search, after which, within the second step, we assess the similarity of the retrieved chunks extra precisely with a cross-encoder.

    • • •

    Again to the ‘Warfare and Peace’ Instance

    So now let’s see how all these play out within the ‘Warfare and Peace’ instance by answering yet one more time my favourite query – ‘Who’s Anna Pávlovna?’.

    My code to date seems one thing like this:

    import os
    from langchain.chat_models import ChatOpenAI
    from langchain.document_loaders import TextLoader
    from langchain.embeddings import OpenAIEmbeddings
    from langchain.vectorstores import FAISS
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.docstore.doc import Doc
    
    import faiss
    
    api_key = "my_api_key"
    
    # initialize LLM
    llm = ChatOpenAI(openai_api_key=api_key, mannequin="gpt-4o-mini", temperature=0.3)
    
    # initialize embeddings mannequin
    embeddings = OpenAIEmbeddings(openai_api_key=api_key)
    
    # loading paperwork for use for RAG 
    text_folder =  "RAG recordsdata"  
    
    paperwork = []
    for filename in os.listdir(text_folder):
        if filename.decrease().endswith(".txt"):
            file_path = os.path.be a part of(text_folder, filename)
            loader = TextLoader(file_path)
            paperwork.prolong(loader.load())
    
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    split_docs = []
    for doc in paperwork:
        chunks = splitter.split_text(doc.page_content)
        for chunk in chunks:
            split_docs.append(Doc(page_content=chunk))
            
    paperwork = split_docs
    
    # normalize information base embeddings
    import numpy as np
    def normalize(vectors):
        vectors = np.array(vectors)
        norms = np.linalg.norm(vectors, axis=1, keepdims=True)
        return vectors / norms
    
    doc_texts = [doc.page_content for doc in documents]
    doc_embeddings = embeddings.embed_documents(doc_texts)
    doc_embeddings = normalize(doc_embeddings)
    
    # faiss index with interior product
    import faiss
    dimension = doc_embeddings.form[1]
    index = faiss.IndexFlatIP(dimension)  # interior product index
    index.add(doc_embeddings)
    
    # create vector database w FAISS 
    vector_store = FAISS(embedding_function=embeddings, index=index, docstore=None, index_to_docstore_id=None)
    vector_store.docstore = {i: doc for i, doc in enumerate(paperwork)}
    
    def foremost():
        print("Welcome to the RAG Assistant. Kind 'exit' to give up.n")
        
        whereas True:
            user_input = enter("You: ").strip()
            if user_input.decrease() == "exit":
                print("Exiting…")
                break
    
            # embedding + normalize question
            query_embedding = embeddings.embed_query(user_input)
            query_embedding = normalize([query_embedding]) 
    
            # search FAISS index
            D, I = index.search(query_embedding, ok=2)
            
            # get related paperwork
            relevant_docs = [vector_store.docstore[i] for i in I[0]]
            retrieved_context = "nn".be a part of([doc.page_content for doc in relevant_docs])
            
            # D incorporates interior product scores == cosine similarities (since normalized)
            print("nTop chunks and their cosine similarity scores:n")
            for rank, (idx, rating) in enumerate(zip(I[0], D[0]), begin=1):
               print(f"Chunk {rank}:")
               print(f"Cosine similarity: {rating:.4f}")
               print(f"Content material:n{vector_store.docstore[idx].page_content}n{'-'*40}")
                   
            # system immediate
            system_prompt = (
                "You're a useful assistant. "
                "Use ONLY the next information base context to reply the person. "
                "If the reply isn't within the context, say you do not know.nn"
                f"Context:n{retrieved_context}"
            )
    
            # messages for LLM 
            messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_input}
            ]
    
            # generate response
            response = llm.invoke(messages)
            assistant_message = response.content material.strip()
            print(f"nAssistant: {assistant_message}n")
    
    if __name__ == "__main__":
        foremost()
        

    For ok = 2, we get the next high chunks retrieved.

    However, if we set ok = 6, we get the next chunks retrieved, and considerably of a extra informative reply, containing extra information on our query, like the truth that she’s ‘maid of honor and favourite of the Empress Márya Fëdorovna’.

    Now, let’s regulate our code to rerank these 6 chunks and see if the highest 2 stay the identical. To do that, we will probably be utilizing a cross-encoder mannequin to re-rank the top-k retrieved paperwork earlier than passing them to your LLM. Extra particularly, I will probably be using the cross-encoder/ms-marco-TinyBERT-L2 cross-encoder, which is an easy, pre-trained cross-encoding mannequin, operating on high of PyTorch. To take action, we additionally must import the torch and transformers libraries.

    import torch
    from transformers import CrossEncoder

    Then we will initialise the cross-encoder and outline a perform for reranking the highest ok chunks retrieved from the vector search:

    # initialize cross-encoder mannequin
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2', system='cuda' if torch.cuda.is_available() else 'cpu')
    
    def rerank_with_cross_encoder(question, relevant_docs):
        
        pairs = [(query, doc.page_content) for doc in relevant_docs] # pairs of (question, doc) for cross-encoder
        scores = cross_encoder.predict(pairs) # relevance scores from cross-encoder mannequin
        
        ranked_indices = np.argsort(scores)[::-1] # type paperwork based mostly on cross-encoder rating (the upper, the higher)
        ranked_docs = [relevant_docs[i] for i in ranked_indices]
        ranked_scores = [scores[i] for i in ranked_indices]
        
        return ranked_docs, ranked_scores

    … and in addition regulate of perform as follows:

            ...
    
            # search FAISS index
            D, I = index.search(query_embedding, ok=6)
            
            # get related paperwork
            relevant_docs = [vector_store.docstore[i] for i in I[0]]
            
            # rerank with our perform
            reranked_docs, reranked_scores = rerank_with_cross_encoder(user_input, relevant_docs)
            
            # get high reranked chunks
            retrieved_context = "nn".be a part of([doc.page_content for doc in reranked_docs[:2]])
            
            # D incorporates interior product scores == cosine similarities (since normalized)
            print("nTop 6 Retrieved Chunks:n")
            for rank, (idx, rating) in enumerate(zip(I[0], D[0]), begin=1):
                print(f"Chunk {rank}:")
                print(f"Similarity: {rating:.4f}")
                print(f"Content material:n{vector_store.docstore[idx].page_content}n{'-'*40}")
    
            # show high reranked chunks
            print("nTop 2 Re-ranked Chunks:n")
            for rank, (doc, rating) in enumerate(zip(reranked_docs[:2], reranked_scores[:2]), begin=1):
                print(f"Rank {rank}:")
                print(f"Reranker Rating: {rating:.4f}") 
                print(f"Content material:n{doc.page_content}n{'-'*40}")
                   
            ...

    … and eventually, these are the highest 2 chunks, and the respective reply we get, after re-ranking with the cross-encoder:

    Discover how these 2 chunks are totally different from the highest 2 chunks we bought from the vector search.

    Thus, the significance of the reranking step is rendered clearly. We use the vector search to slender down the presumably related chunks, out of all of the accessible paperwork within the information base, however then use the reranking step to determine essentially the most related chunks precisely.

    Picture by creator

    We will think about the two-step retrieval as a funnel: the primary stage pulls in a large set of candidate chunks, and the reranking stage filters out the irrelevant ones. What’s left is essentially the most helpful context, resulting in clearer and extra correct solutions.

    • • •

    On my thoughts

    So, it turns into obvious is an important step for constructing a strong RAG pipeline. Essentially, it permits us to bridge the hole between the fast however not so exact vector search, and context-aware solutions. By performing a two-step retrieval, with the vector search being step one, and the second step being the reranking, we get one of the best of each worlds: effectivity at scale and better high quality responses. In apply, this two-stage method is what makes fashionable RAG pipelines each sensible and highly effective.

    • • •

    Cherished this put up? Let’s be pals! Be a part of me on:

    📰Substack 💌 Medium 💼LinkedIn ☕Buy me a coffee!

    • • •

    What about pialgorithms?

    Trying to convey the ability of RAG into your group?

    pialgorithms can do it for you 👉 book a demo right this moment!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticlePyTorch Explained: From Automatic Differentiation to Training Custom Neural Networks
    Next Article Decoding Nonlinear Signals In Large Observational Datasets
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Implementing DRIFT Search with Neo4j and LlamaIndex

    October 22, 2025
    Artificial Intelligence

    Agentic AI in Finance: Opportunities and Challenges for Indonesia

    October 22, 2025
    Artificial Intelligence

    Creating AI that matters | MIT News

    October 21, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    The sweet taste of a new idea | MIT News

    May 19, 2025

    The MIT-Portugal Program enters Phase 4 | MIT News

    April 30, 2025

    Time Series Forecasting Made Simple (Part 2): Customizing Baseline Models

    May 9, 2025

    How Conversational AI is Framing the Future of Automobiles?

    June 25, 2025

    Like human brains, large language models reason about diverse data in a general way | MIT News

    April 5, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Landing your First Machine Learning Job: Startup vs Big Tech vs Academia

    June 3, 2025

    How Modern AI Document Processing Activates Your Trapped Data

    September 5, 2025

    My Experiments with NotebookLM for Teaching 

    September 16, 2025
    Our Picks

    Implementing DRIFT Search with Neo4j and LlamaIndex

    October 22, 2025

    Agentic AI in Finance: Opportunities and Challenges for Indonesia

    October 22, 2025

    Dispatch: Partying at one of Africa’s largest AI gatherings

    October 22, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.