Chunk Size as an Experimental Variable in RAG Systems

Person: “What does the inexperienced highlighting imply on this doc?”
RAG system: “Inexperienced highlighted textual content is interpreted as configuration settings.”

the sorts of solutions we count on right this moment from Retrieval-Augmented Era (RAG) techniques.

Over the previous few years, RAG has turn into one of many central architectural constructing blocks for knowledge-based language fashions: As a substitute of relying completely on the information saved within the mannequin, RAG techniques mix language fashions with exterior doc sources.

The time period was launched by Lewis et al. and describes an method that’s broadly used to cut back hallucinations, enhance the traceability of solutions, and allow language fashions to work with proprietary knowledge.

I wished to know why a system selects one particular reply as a substitute of a really comparable different. This choice is commonly made already on the retrieval stage, lengthy earlier than an LLM comes into play.

For that reason, I carried out three experiments on this article to research how completely different chunk sizes (80, 220, 500) affect retrieval conduct.

Desk of Contents
1 – Why Chunk Size Is More Than Just a Parameter
2 – How Does Chunk Size Influence the Stability of Retrieval Results in Small RAG Systems?
3 – Minimal RAG System Without Output Generation
4 – Three Experiments: Chunk Size as a Variable
5 – Final Thoughts

1 – Why Chunk Measurement Is Extra Than Only a Parameter

In a typical RAG pipeline, paperwork are first cut up into smaller textual content segments, embedded into vectors, and saved in an index. When a question is issued, semantically comparable textual content segments are retrieved after which processed into a solution. This closing step is commonly carried out together with a language mannequin.

Typical parts of a RAG system embody:

Doc preprocessing
Chunking
Embedding
Vector index
Retrieval logic
Non-obligatory: Era of the output

On this article, I give attention to the retrieval step. This step is determined by a number of parameters:

Selection of the embedding mannequin:
The embedding mannequin determines how textual content is transformed into numerical vectors. Totally different fashions seize which means at completely different ranges of granularity and are educated on completely different aims. For instance, light-weight sentence-transformer fashions are sometimes enough for semantic search, whereas bigger fashions might seize extra nuance however include larger computational value.
Distance or similarity metric:
The space or similarity metric defines how the closeness between two vectors is measured. Widespread decisions embody cosine similarity, dot product or Euclidean distance. For normalized embeddings, cosine similarity is commonly used
Variety of retrieved outcomes (Prime-k):
The variety of retrieved outcomes specifies what number of textual content segments are returned by the retrieval step. A small Prime-k can miss related context, whereas a big Prime-k will increase recall however might introduce noise.
Overlap between textual content segments:
Overlap defines how a lot textual content is shared between consecutive chunks. It’s usually used to keep away from shedding necessary data at chunk boundaries. A small overlap reduces redundancy however dangers slicing explanations in half, whereas a bigger overlap will increase robustness at the price of storing and processing extra comparable chunks.
Chunk measurement:
Describes the dimensions of the textual content items which might be extracted from a doc and saved as particular person vectors. Relying on the implementation, chunk measurement could be outlined primarily based on characters, phrases, or tokens. The dimensions determines how a lot context a single vector represents.

Small chunks include little or no context and are extremely particular. Massive chunks embody extra surrounding data, however at a a lot coarser stage. In consequence, chunk measurement determines which components of the which means are literally in contrast when a question is matched in opposition to a bit.

Chunk measurement implicitly displays assumptions about how a lot context is required to seize which means, how strongly data could also be fragmented, and the way clearly semantic similarity could be measured.

With this text, I wished to discover precisely this by a small RAG system experiment and requested myself:

How do completely different chunk sizes have an effect on retrieval conduct?

The main target shouldn’t be on a system meant for manufacturing use. As a substitute, I wished to learn the way completely different chunk sizes have an effect on the retrieval outcomes.

2 – How Does Chunk Measurement Affect the Retrieval Ends in Small RAG Techniques?

I subsequently requested myself the next questions:

How does chunk measurement change retrieval ends in a small, managed RAG system?
Which textual content segments make it to the highest of the rating when the queries are similar however the chunk sizes differ?

To analyze this, I intentionally outlined a easy setup wherein all situations (besides chunk measurement) stay the identical:

Three Markdown paperwork because the information base
Three similar, mounted questions
The identical embedding mannequin for vectorizing the texts

The textual content used within the three Markdown information is predicated on a documentation from an actual software known as OneLatex. To maintain the experiment targeted on retrieval conduct, the content material was barely simplified and diminished to the core explanations related for the questions.

The three questions I used the place:

"Q1: What's the essential benefit of separating content material creation from formatting in OneLatex?"
"Q2: How does OneLatex interpret textual content highlighted in inexperienced in OneNote?"
"Q3: How does OneLatex interpret textual content highlighted in yellow in OneNote?"

As well as, I intentionally omitted an LLM for output era.

The rationale for that is easy: I didn’t need an LLM to show incomplete or poorly-matched textual content segments right into a coherent reply. This makes it a lot clearer what truly occurs within the retrieval step, how the parameters of the retrieval work together, and what function the sentence transformer performs.

3 – Minimal RAG System With out Output Era

For the experiments, I subsequently used a small RAG system with the next parts: Markdown paperwork because the information base, a easy chunking logic with overlap, a sentence transformer mannequin to generate embeddings, and a rating of textual content segments utilizing cosine similarity.

Because the embedding mannequin, I used all-MiniLM-L6-v2 from the Sentence-Transformers library. This mannequin is light-weight and subsequently well-suited for operating domestically on a private laptop computer (I ran it domestically on my Lenovo laptop computer with 64 GB of RAM). The similarity between a question and a textual content section is calculated utilizing cosine similarity. As a result of the vectors are normalized, the dot product could be in contrast instantly.

I intentionally saved the system small and subsequently didn’t embody any chat historical past, reminiscence or agent logic, or LLM-based reply era.

As an “reply,” the system merely returns the highest-ranked textual content section. This makes it a lot clearer which content material is definitely recognized as related by the retrieval step.

The complete code for the mini RAG system could be present in my GitHub repository:

→ 🤓 Find the full code in the GitHub Repo 🤓 ←

4 – Three Experiments: Chunk Measurement as a Variable

For the analysis, I ran the three instructions beneath by way of the command line:

#Experiment 1 - Baseline
python essential.py --chunk-size 220 --overlap 40 --top-k 3

#Experiment 2 - Small Chunk-Measurement
python essential.py --chunk-size 80 --overlap 10 --top-k 3

#Experiment 3 - Large Chunk-Measurement
python essential.py --chunk-size 500 --overlap 50 --top-k 3

The setup from Part 3 stays precisely the identical: The identical three paperwork, the identical three questions, and the identical embedding mannequin.

Chunk measurement defines the variety of characters per textual content section. As well as, I used an overlap in every experiment to cut back data loss at chunk boundaries. For every experiment, I computed the semantic similarity scores between the question and all chunks and ranked the highest-scoring segments.

Small Chunks (80 Characters) – Lack of Context

With very small chunks (chunk-size 80), a robust fragmentation of the content material turns into obvious: Particular person textual content segments usually include solely sentence fragments or remoted statements with out enough context. Explanations are cut up throughout a number of chunks, in order that particular person segments include solely components of the unique content material.

Formally, the retrieval nonetheless works appropriately: Semantically comparable fragments are discovered and ranked extremely.

Nonetheless, after we have a look at the precise content material, we see that the outcomes are hardly usable:

Screenshot taken by the Writer.

The returned chunks are thematically associated, however they don’t present a self-contained reply. The system roughly acknowledges what the subject is about, but it surely breaks the content material down so strongly that the person outcomes don’t say a lot on their very own.

Medium Chunks (220 characters) – Obvious Stability

With the medium chunks (chunk-size 220), the outcomes already improved clearly. A lot of the returned textual content segments contained full explanations and have been content-wise believable. At first look, the retrieval appeared secure and dependable: It often returned precisely the knowledge one would count on.

Nonetheless, a concrete downside turned obvious when distinguishing between inexperienced and yellow highlighted textual content. No matter whether or not I requested in regards to the which means of the inexperienced or the yellow highlighting, the system returned the chunk in regards to the yellow highlighting as the highest end in each circumstances. The proper chunk was current, but it surely was not chosen as Prime-1.

Shows the results of the retrieval experiment with chunk size 220. — Screenshot taken by the creator.

The rationale lies within the very comparable similarity scores of the 2 high outcomes:

Rating for Prime-1: 0.873
Rating for Prime-2: 0.774

The system can hardly distinguish between the 2 candidates semantically and finally selects the chunk with the marginally larger rating.

The issue? It doesn’t match the query content-wise and is just incorrect.

For us as people, that is very straightforward to acknowledge. For a sentence transformer like all-MiniLM-L6-v2, it appears to be a problem.

What issues right here is that this: If we solely have a look at the Prime-1 outcome, this error stays invisible. Solely by evaluating the scores can we see that the system is unsure on this scenario. Since it’s pressured to make a transparent choice in our setup, it returns the Prime-1 chunk as the reply.

Massive Chunks (500 characters) – Strong Contexts

With the bigger chunks (chunk-size 500), the textual content segments include far more coherent context. There may be additionally hardly any fragmentation anymore: Explanations are now not cut up throughout a number of chunks.

And certainly, the error in distinguishing between inexperienced and yellow now not happens. The questions on inexperienced and yellow highlighting at the moment are appropriately distinguished, and the respective matching chunk is clearly ranked as the highest outcome. We will additionally see that the similarity scores of the related chunks at the moment are extra clearly separated.

Shows the result of the retrieval experiment with chunk size 500. — Screenshot taken by the creator.

This makes the rating extra secure and simpler to know. The draw back of this setting, nevertheless, is the coarser granularity: Particular person chunks include extra data and are much less finely tailor-made to particular features.

In our setup with three Markdown information, the place the content material is already thematically nicely separated, this draw back hardly performs a task. With in a different way structured documentation, akin to lengthy steady texts with a number of subjects per part, an excessively massive chunk measurement may result in irrelevant data being retrieved along with related content material.

On my Substack Data Science Espresso, I share sensible guides and bite-sized updates from the world of Knowledge Science, Python, AI, Machine Studying, and Tech — made for curious minds like yours.

Take a look and subscribe on Medium or on Substack if you wish to keep within the loop.

5 – Ultimate Ideas

The outcomes of the three quite simple experiments could be traced again to how retrieval works. Every chunk is represented as a vector, and its proximity to the question is calculated utilizing cosine similarity. The ensuing rating signifies how comparable the query and the textual content section are within the semantic area.

What’s necessary right here is that the rating shouldn’t be a measure of correctness. It’s a measure of relative comparability throughout the accessible chunks for a given query in a single run.

When a number of segments are semantically very comparable, even minimal variations within the scores can decide which chunk is returned as Prime-1. One instance of this was the inaccurate distinction between inexperienced and yellow within the medium chunk measurement.

One attainable extension can be to permit the system to explicitly sign uncertainty. If the scores of the Prime-1 and Prime-2 chunks are very shut, the system may return an “I don’t know” or “I’m unsure” response as a substitute of forcing a call.

Primarily based on this small RAG system experiment, it’s not actually attainable to derive a “finest chunk measurement” conclusion.

However what we will observe as a substitute is the next:

Small chunks result in excessive variance: Retrieval reacts very exactly to particular person phrases however rapidly loses the general context.
Medium-sized chunks: Seem secure at first look, however can create harmful ambiguities when a number of candidates are scored nearly equally.
Massive chunks: Present extra sturdy context and clearer rankings, however they’re coarser and fewer exactly tailor-made.

Chunk measurement subsequently, determines how sharply retrieval can distinguish between comparable items of content material.

On this small setup, this didn’t play a serious function. Nonetheless, after we take into consideration bigger RAG techniques in manufacturing environments, this type of retrieval instability may turn into an actual downside: Because the variety of paperwork grows, the variety of semantically comparable chunks will increase as nicely. Which means that many conditions with very small rating variations are prone to happen. I may also think about that such results are sometimes masked by downstream language fashions, when an LLM turns incomplete or solely partially matching textual content segments into believable solutions.

The place Can You Proceed Studying?

Source link

Achieving 5x Agentic Coding Performance with Few-Shot Prompting

Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found

From Transactions to Trends: Predict When a Customer Is About to Stop Buying

How Not to Mislead with Your Data-Driven Story

OpenAI har lanserat GPT-5 och introducerat flera uppdateringar för ChatGPT

Odyssey AI-modell förvandlar videor till interaktiva 3D-värld

Shaip Expands Availability of High-Quality Healthcare Data throughPartnership with Protege

Chunk Size as an Experimental Variable in RAG Systems

Most Popular

Apply Sphinx’s Functionality to Create Documentation for Your Next Data Science Project

Natasha Lyonne to Direct AI-Powered Sci-Fi Film That Could Redefine Hollywood

Data Visualization Explained (Part 4): A Review of Python Essentials

Our Picks

Achieving 5x Agentic Coding Performance with Few-Shot Prompting

Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found

From Transactions to Trends: Predict When a Customer Is About to Stop Buying

Chunk Size as an Experimental Variable in RAG Systems

1 – Why Chunk Measurement Is Extra Than Only a Parameter

2 – How Does Chunk Measurement Affect the Retrieval Ends in Small RAG Techniques?

3 – Minimal RAG System With out Output Era

4 – Three Experiments: Chunk Measurement as a Variable

Small Chunks (80 Characters) – Lack of Context

Medium Chunks (220 characters) – Obvious Stability

Massive Chunks (500 characters) – Strong Contexts

5 – Ultimate Ideas

The place Can You Proceed Studying?

Related Posts