, I walked by way of building a simple RAG pipeline utilizing OpenAI’s API, LangChain, and native information, in addition to effectively chunking large text files. These posts cowl the fundamentals of establishing a RAG pipeline capable of generate responses primarily based on the content material of native information.
So, to date, we’ve talked about studying the paperwork from wherever they’re saved, splitting them into textual content chunks, after which creating an embedding for every chunk. After that, we by some means magically choose the embeddings which might be acceptable for the person question and generate a related response. However it’s vital to additional perceive how the retrieval step of RAG really works.
Thus, on this put up, we’ll take issues a step additional by taking a better have a look at how the retrieval mechanism works and analyzing it in additional element. As in my earlier put up, I might be utilizing the War and Peace textual content for example, licensed as Public Area and simply accessible by way of Project Gutenberg.
What in regards to the embeddings?
With the intention to perceive how the retrieval step of the RAG framework works, it’s essential to first perceive how textual content is remodeled and represented in embeddings. For LLMs to deal with any textual content, it have to be within the type of a vector, and to carry out this transformation, we have to utilise an embedding mannequin.
An embedding is a vector illustration of knowledge (in our case, textual content) that captures its semantic that means. Every phrase or sentence of the unique textual content is mapped to a high-dimensional vector. Embedding fashions used to carry out this transformation are designed in such a approach that comparable meanings lead to vectors which might be shut to 1 one other within the vector house. For instance, the vectors for the phrases comfortable and joyful can be shut to 1 one other within the vector house, whereas the vector for the phrase unhappy can be removed from them.
To create high-quality embeddings that work successfully in an RAG pipeline, one must make the most of pretrained embedding fashions, like BERT and GPT. There are numerous kinds of embeddings one can create and corresponding fashions accessible. For example:
- Phrase Embeddings: In phrase embeddings, every phrase has a set vector no matter context. Standard fashions for creating one of these embedding are Word2Vec and GloVe.
- Contextual Embeddings: Contextual embeddings consider that the that means of a phrase can change primarily based on context. Take, as an illustration, the financial institution of a river and opening a checking account. Some fashions that can be utilized for producing contextual embeddings are BERT, RoBERTa, and GPT.
- Sentence Embeddings: These are embeddings capturing the that means of full sentences. Respective fashions that can be utilized are Sentence-BERT or USE.
In any case, textual content have to be remodeled into vectors to be usable in computations. These vectors are merely representations of the textual content. In different phrases, the vectors and numbers don’t have any inherent that means on their very own. As an alternative, they’re helpful as a result of they seize similarities and relationships between phrases or phrases in a mathematical type.
For example, we may think about a tiny vocabulary consisting of the phrases king, queen, lady, and man, and assign every of them an arbitrary vector.
king = [0.25, 0.75]
queen = [0.23, 0.77]
man = [0.15, 0.80]
lady = [0.13, 0.82]
Then, we may attempt to do some vector operations like:
king - man + lady
= [0.25, 0.75] - [0.15, 0.80] + [0.13, 0.82]
= [0.23, 0.77]
≈ queen 👑
Discover how the semantics of the phrases and the relationships between them are preserved after mapping them into vectors, permitting us to carry out operations.
So, an embedding is simply that — a mapping of phrases to vectors, aiming to protect that means and relationships between phrases, and permitting to carry out computations with them. We are able to even visualize these dummy vectors in a vector house to see how associated phrases cluster collectively.

The distinction between these easy vector examples and the actual vectors produced by embedding fashions is that precise embedding fashions generate vectors with a whole bunch of dimensions. Two-dimensional vectors are helpful for constructing instinct about how that means could be mapped right into a vector house, however they’re far too low-dimensional to seize the complexity of actual language and vocabulary. That’s why actual embedding fashions work with a lot larger dimensions, usually within the a whole bunch and even hundreds. For instance, Word2Vec produces 300-dimensional vectors, whereas BERT Base produces 768-dimensional vectors. This larger dimensionality permits embeddings to seize the a number of dimensions of actual language, like that means, utilization, syntax, and the context of phrases and phrases.
Assessing the similarity of embeddings
After the textual content is remodeled into embeddings, inference turns into vector math. That is precisely what permits us to establish and retrieve related paperwork within the retrieval step of the RAG framework. As soon as we flip each the person’s question and the information base paperwork into vectors utilizing an embedding mannequin, we are able to then compute how comparable they’re utilizing cosine similarity.
Cosine similarity is a measure of how comparable two vectors (embeddings) are. Given two vectors A and B, cosine similarity is calculated as follows:

Merely put, cosine similarity is calculated because the cosine of the angle between two vectors, and it ranges from 1 to -1. Extra particularly:
- 1 signifies that the vectors are semantically an identical (e.g., automotive and vehicle).
- 0 signifies that the vectors don’t have any semantic relationship (e.g., banana and justice).
- -1 signifies that the vectors are semantically reverse (e.g., sizzling and chilly).
In observe, nevertheless, values close to -1 are extraordinarily uncommon in embedding fashions. It’s because even semantically reverse phrases (like sizzling and chilly) usually happen in comparable contexts (e.g., it’s getting sizzling and it’s getting chilly). For cosine similarity to achieve -1, the phrases themselves and their contexts would each should be completely reverse—one thing that doesn’t actually occur in pure language. In consequence, even reverse phrases sometimes have embeddings which might be nonetheless considerably shut in that means.
Different similarity metrics aside from cosine similarity do exist, such because the dot product or Euclidean distance, however these aren’t normalized and are magnitude-dependent, making them much less appropriate for evaluating textual content embeddings. On this approach, cosine similarity is the dominant metric used for quantifying the similarity between embeddings.
Again to our RAG pipeline, by calculating the cosine similarity between the person’s question embeddings and the information base embeddings, we are able to establish the chunks of textual content which might be most comparable—and due to this fact contextually related—to the person’s query, retrieve them, after which use them to generate the reply.
Discovering the highest okay comparable chunks
So, after getting the embeddings of the information base and the embedding(s) for the person question textual content, that is the place the magic occurs. What we basically do is that we calculate the cosine similarity between the person question embedding and the information base embeddings. Thus, for every textual content chunk of the information base, we get a rating between 1 and -1 indicating the chunk’s similarity with the person’s question.
As soon as now we have the similarity scores, we type them in descending order and choose the highest okay chunks. These high okay chunks are then handed into the technology step of the RAG pipeline, permitting it to successfully retrieve related info for the person’s question.
To hurry up this course of, the Approximate Nearest Neighbor (ANN) search is commonly used. ANN finds vectors which might be practically probably the most comparable, delivering outcomes near the true top-N however at a a lot sooner fee than precise search strategies. After all, precise search is extra correct; nonetheless, it’s also extra computationally costly and will not scale effectively in real-world functions, particularly when coping with large datasets.
On high of this, a threshold could also be utilized to the similarity scores to filter out chunks that don’t meet a minimal relevance rating. For instance, in some circumstances, a bit would possibly solely be thought-about if its similarity rating exceeds a sure threshold (e.g., cosine similarity > 0.3).
So, who’s Anna Pávlovna?
Within the ‘Battle and Peace‘ instance, as demonstrated in my previous post, we break up the complete textual content into chunks after which create the respective embeddings for every chunk. Then, when the person submits a question, like ‘Who’s Anna Pávlovna?’, we additionally create the respective embedding(s) for the person’s question textual content.
import os
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.doc import Doc
api_key = 'your_api_key'
# initialize LLM
llm = ChatOpenAI(openai_api_key=api_key, mannequin="gpt-4o-mini", temperature=0.3)
# initialize embeddings mannequin
embeddings = OpenAIEmbeddings(openai_api_key=api_key)
# loading paperwork for use for RAG
text_folder = "RAG information"
paperwork = []
for filename in os.listdir(text_folder):
if filename.decrease().endswith(".txt"):
file_path = os.path.be a part of(text_folder, filename)
loader = TextLoader(file_path)
paperwork.prolong(loader.load())
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = []
for doc in paperwork:
chunks = splitter.split_text(doc.page_content)
for chunk in chunks:
split_docs.append(Doc(page_content=chunk))
paperwork = split_docs
# create vector database w FAISS
vector_store = FAISS.from_documents(paperwork, embeddings)
retriever = vector_store.as_retriever()
def essential():
print("Welcome to the RAG Assistant. Kind 'exit' to stop.n")
whereas True:
user_input = enter("You: ").strip()
if user_input.decrease() == "exit":
print("Exiting…")
break
# get related paperwork
relevant_docs = retriever.invoke(user_input)
retrieved_context = "nn".be a part of([doc.page_content for doc in relevant_docs])
# system immediate
system_prompt = (
"You're a useful assistant. "
"Use ONLY the next information base context to reply the person. "
"If the reply is just not within the context, say you do not know.nn"
f"Context:n{retrieved_context}"
)
# messages for LLM
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_input}
]
# generate response
response = llm.invoke(messages)
assistant_message = response.content material.strip()
print(f"nAssistant: {assistant_message}n")
if __name__ == "__main__":
essential()
On this script, I used LangChain’s retriever object retriever = vector_store.as_retriever()
, which by default makes use of the cosine similarity to evaluate the relevance of the doc embeddings with the person’s question. It additionally retrieves by default the okay=4 paperwork. Thus, in essence, what we’re doing there’s that we retrieve the high okay most related to the person question chunks primarily based on cosine similarity.
In any case, LangCahin’s .as_retriever()
methodology doesn’t enable us to show the cosine similarity values — we simply get the highest okay related chunks. So, so as to try the cosine similarities, I’m going to regulate our script a bit of bit and use .similarity_search_with_score()
as a substitute of .as_retriever()
. We are able to simply do that by including the next half to our essential()
perform:
# REMOVE THIS LINE
retriever = vector_store.as_retriever()
def essential():
print("Welcome to the RAG Assistant. Kind 'exit' to stop.n")
whereas True:
user_input = enter("You: ").strip()
if user_input.decrease() == "exit":
print("Exiting…")
break
# ADD THIS SECTION
# Similarity search with rating
outcomes = vector_store.similarity_search_with_score(user_input, okay=2)
# Extract paperwork and cosine similarity scores
print(f"nCosine Similarities for High 5 Chunks:n")
for idx, (doc, sim_score) in enumerate(outcomes):
print(f"Chunk {idx + 1}:")
print(f"Cosine Similarity: {sim_score:.4f}")
print(f"Content material:n{doc.page_content}n")
# CONTINUE WITH REST OF THE CODE...
# System immediate for LLM technology
retrieved_context = "nn".be a part of([doc.page_content for doc, _ in results])
Discover how we are able to explicitly outline the variety of retrieved chunks okay, now set as okay=2.
Lastly, we are able to once more ask and obtain an answear:

… however now we’re additionally capable of see the textual content chunks primarily based on which this reply is created, and the respective cosine similarity scores…

Apparently, totally different parameters can lead to totally different solutions. For example, we’re going to get barely totally different solutions when retrieving the highest okay=2, okay=4, and okay=10 outcomes. Considering the extra parameters which might be used within the chunking step, like chunk dimension and chunk overlap, it turns into apparent that parameters play an important function in getting good outcomes from a RAG pipeline.
• • •
Cherished this put up? Let’s be associates! Be a part of me on:
📰Substack 💌 Medium 💼LinkedIn ☕Buy me a coffee!
• • •
What about pialgorithms?
Trying to carry the facility of RAG into your group?
pialgorithms can do it for you 👉 book a demo right this moment!
