with an unlimited corpus of textual content knowledge, the place they, throughout their pre-training stage, basically devour the whole web. LLMs thrive after they have entry to all related knowledge to reply to consumer questions appropriately. Nonetheless, in lots of circumstances, we restrict the capabilities of our LLMs by not offering them sufficient knowledge. On this article, I’ll focus on why it’s best to care about feeding our LLM extra knowledge, methods to fetch this knowledge, and particular functions.
I’ll additionally begin with a brand new characteristic in my articles: Writing out my important purpose, what I need to obtain with the article, and what it’s best to know after studying it. If profitable, I’ll begin writing it into every of my articles:
My purpose for this text is to spotlight the significance of offering LLMs with related knowledge, and how one can feed it into your LLMs for improved efficiency
You can even learn my articles on How to Analyze and Optimize Your LLMs in 3 Steps and Document QA using Multimodal LLMs.
Desk of contents
Why add extra knowledge to LLMs?
I’ll begin my article off by declaring why it’s necessary. LLMs are extremely knowledge hungry, which means that they require loads of knowledge to -work effectively. That is generally proven within the pre-training corpus of LLMs, which consists of trillions of textual content tokens getting used for coaching the LLM.
Nonetheless, the idea of using loads of knowledge additionally applies to LLMs throughout inference time (if you make the most of the LLM in manufacturing). It’s essential present the LLM with all mandatory knowledge to reply a consumer request.
In loads of circumstances, you inadvertently cut back the LLM’s efficiency by not offering related data.
For instance, in case you create a query answering system, the place customers can add recordsdata and speak to them. Naturally, you present the textual content contents of every file in order that the consumer can chat with the doc; nevertheless, you may, for instance, overlook so as to add the filenames of the paperwork to the context the consumer is chatting with. This can impression the LLM’s efficiency, for instance, if some data is simply current within the filename or the consumer references the filename within the chat. Another particular LLM functions the place extra knowledge is helpful are:
- Classification
- Data extraction
- Key phrase seek for discovering related paperwork to feed to LLM
In the remainder of the article, I’ll focus on the place yow will discover such knowledge, strategies to retrieve extra knowledge, and a few particular use circumstances for the info.
On this part, I’ll focus on knowledge that you just doubtless have already got accessible in your software. One instance is my final analogy, the place you may have a query answering system for recordsdata, however overlook so as to add the filename to the context. Another examples are:
- File extensions (.pdf, .docx, .xlsx)
- Folder path (if the consumer uploaded a folder)
- Timestamps (for instance, if a consumer asks about the newest doc, that is required)
- Web page numbers (the consumer may ask the LLM to fetch particular data positioned on web page 5)

There are a ton of different such examples of knowledge you doubtless have already got accessible, or you could shortly fetch and add to your LLM’s context.
The kind of knowledge you may have accessible will fluctuate broadly from software to software. Loads of the examples I’ve offered on this article are tailor-made to text-based AI, since that’s the house I spend probably the most time in. Nonetheless, in case you, for instance, work extra on visible AI or audio-based AI, I urge you to search out comparable examples in your house.
For visible AI, it could possibly be:
- Location knowledge for the place the picture/video was taken
- The filename of the picture/video file
- The writer of the picture/video file
Or for audio AI, it could possibly be
- Metadata about who’s talking when
- Timestamps for every sentence
- Location knowledge from the place the audio was recorded
My level being, there’s a plethora of accessible knowledge on the market; all it’s essential to do is search for it and contemplate how it may be helpful to your software.
Typically, the info you have already got accessible shouldn’t be sufficient. You need to present your LLM with much more knowledge to assist it reply questions appropriately. On this case, it’s essential to retrieve extra knowledge. Naturally, since we’re within the age of LLMs, we’ll make the most of LLMs to fetch this knowledge.
Retrieving data beforehand
The simplest method is to retrieve extra knowledge by fetching it earlier than processing any stay requests. For doc AI, this implies extracting particular data from paperwork throughout processing. You may extract the kind of doc (authorized doc, tax doc, or gross sales brochure) or particular data contained within the doc (dates, names, places, …).
The benefit of fetching the knowledge beforehand is:
- Pace (in manufacturing, you solely have to fetch the worth out of your database)
- You’ll be able to reap the benefits of batch processing to cut back prices
In the present day, fetching this type of data is slightly easy. You arrange an LLM with a particular system immediate to fetch data, and feed the immediate together with the textual content into the LLM. The LLM will then course of the textual content and extract the related data for you. You may need to contemplate evaluating the efficiency of your data extraction, through which case you may learn my article on Evaluating 5 Millions LLM Requests with Automated Evals.
You doubtless additionally need to map out all the knowledge factors to retrieve, for instance:
When you may have created this record, you may retrieve all of your metadata and retailer it within the database.
Nonetheless, the primary draw back of fetching data beforehand is that it’s important to predetermine which data to extract. That is troublesome in loads of situations, through which case you are able to do stay data retrieval, which I cowl within the subsequent part.
On-demand data retrieval
When you may’t decide which data to retrieve beforehand, you may fetch it on demand. This implies organising a generic operate that takes in a knowledge level to extract and the textual content to extract it from. For instance
import json
def retrieve_info(data_point: str, textual content: str) -> str:
immediate = f"""
Extract the next knowledge level from the textual content under and return it in a JSON object.
Information Level: {data_point}
Textual content: {textual content}
Instance JSON Output: {{"end result": "instance worth"}}
"""
return json.hundreds(call_llm(immediate))
You outline this operate as a device your LLM has entry to, and which it may name every time it wants data. That is basically how Anthropic has set up their deep research system, the place they create one orchestrator agent that may spawn sub-agents to fetch extra data. Observe that giving your LLM entry to make use of extra prompts can result in loads of token utilization, so it’s best to take note of you’re LLM’s token spend.
Till now, I’ve mentioned why it’s best to make the most of extra knowledge and methods to come up with it. Nonetheless, to totally grasp the content material of this text, I’ll additionally present particular functions the place this knowledge improves LLM efficiency.
Metadata filtering search

My first instance is you could carry out a search with metadata filtering. Offering data comparable to:
- file-type (pdf, xlsx, docx, …)
- file dimension
- Filename
It could possibly assist your software when fetching related data. This may, for instance, be data fetched to be fed into your LLM’s context, like when performing RAG. You’ll be able to make the most of the extra metadata to filter away irrelevant recordsdata.
A consumer might need requested a query pertaining to solely Excel paperwork. Utilizing RAG to fetch chunks from recordsdata aside from Excel paperwork is, due to this fact, dangerous utilization of the LLM’s context window. It is best to as an alternative filter accessible chunks to solely discover Excel paperwork, and make the most of chunks from Excel paperwork to greatest reply the consumer’s question. You’ll be able to be taught extra about dealing with LLM contexts in my article on building effective AI agents.
AI agent web search
One other instance is in case you’re asking your AI agent questions on latest historical past that occurred after the pre-training cutoff for the LLM. LLMs usually has a coaching knowledge cutoff for pre-training knowledge, as a result of the info must be fastidiously curated, and retaining it totally updated is difficult.
This presents an issue when customers ask questions on latest historical past, for instance, about latest occasions within the information. On this case, the AI agent answering the question wants entry to an web search (basically performing data extraction on the web). That is an instance of on-demand data extraction.
Conclusion
On this article, I’ve mentioned methods to considerably improve your LLM by offering it with extra knowledge. You’ll be able to both discover this knowledge in your current metadata (filenames, file-size, location knowledge), or you may retrieve the info by data extraction (doc kind, names talked about in a doc, and so forth). This data is usually important to an LLM’s capability to efficiently reply consumer queries, and in lots of cases, the dearth of this knowledge basically ensures the LLM’s failure to reply a query appropriately.
👉 Discover me on socials:
🧑💻 Get in touch
✍️ Medium
