are highly effective fashions that take photographs as inputs, as a substitute of textual content like conventional LLMs. This opens up loads of prospects, contemplating we are able to immediately course of the contents of a doc, as a substitute of utilizing OCR to extract textual content, after which feeding this textual content into an LLM.
On this article, I’ll focus on how one can apply imaginative and prescient language fashions (VLMs) for lengthy context doc understanding duties. This implies making use of VLMs to both very lengthy paperwork over 100 pages or very dense paperwork that include loads of info, similar to drawings. I’ll focus on what to contemplate when making use of VLMs, and how much duties you’ll be able to carry out with them.
Why do we’d like VLMs?
I’ve mentioned VLMs so much in my earlier articles, and lined why they’re so essential to grasp the contents of some paperwork. The principle cause VLMs are required is that loads of info in paperwork, requires the visible enter to grasp.
The choice to VLMs is to make use of OCR, after which use an LLM. The issue right here is that you just’re solely extracting the textual content from the doc, and never together with the visible info, similar to:
- The place completely different textual content is positioned relative to different textual content
- Non-text info (primarily every little thing that isn’t a letter, similar to symbols, or drawings)
- The place textual content is positioned relative to different info
This info is commonly essential to essentially perceive the doc, and also you’re thus usually higher off utilizing VLMs immediately, the place you feed within the picture immediately, and might subsequently additionally interpret the visible info.
For lengthy paperwork, utilizing VLMs is a challenges, contemplating you want loads of tokens to characterize visible info. Processing hundres of pages is thus an enormous problem. Nevertheless, with loads of current developments in VLM expertise, the fashions have gotten higher and higher and compressing the visible info into affordable context lengths, making it doable and usable to use VLMs to lengthy paperwork for doc understanding duties.

OCR utilizing VLMs
One good choice to course of lengthy paperwork, and nonetheless embrace the visible info, is to make use of VLMs to carry out OCR. Conventional OCR like Tesseract, solely extracts the textual content immediately from paperwork along with the bounding field of the textual content. Nevertheless, VLMs are additionally educated to carry out OCR, and might carry out extra superior textual content extraction, similar to:
- Extracting Markdown
- Explaining purely visible info (i.e. if there’s a drawing, clarify the drawing with textual content)
- Including lacking info (i.e. if there’s a field saying Date and a clean area after, you’ll be able to inform the OCR to extract Date <empty>)
Lately, Deepseek launched a robust VLM based mostly OCR mannequin, which has gotten loads of consideration and traction currently, making VLMs for OCR extra well-liked.
Markdown
Markdown could be very highly effective, since you extract formatted textual content. This permits the mannequin to:
- Present headers and subheaders
- Symbolize tables precisely
- Make daring textual content
This permits the mannequin to extract extra consultant textual content, will extra precisely depicts the textual content contents of the paperwork. In the event you now apply LLMs to this textual content, the LLMs will carry out approach higher than in case you utilized then to easy textual content extracted with conventional OCR.
LLMs carry out higher on formatted textual content like Markdown, than on pure textual content extracted utilizing conventional OCR.
Clarify visible info
One other factor you need to use VLM OCR for is to clarify visible info. For instance, if in case you have a drawing with no textual content in it, conventional OCR wouldn’t extract any info, because it’s solely educated to extract textual content characters. Nevertheless, you need to use VLMs to clarify the visible contents of the picture.
Think about you’ve gotten the next doc:
That is the introduction textual content of the doc
<picture displaying the Eiffel tower>
That is the conclusion of the doc
In the event you utilized conventional OCR like Tesseract, you’ll get the next output:
That is the introduction textual content of the doc
That is the conclusion of the doc
That is clearly a difficulty, because you’re not together with details about the picture displaying the Eiffel tower. As a substitute, you must use VLMs, which might output one thing like:
That is the introduction textual content of the doc
<picture>
This picture depicts the Eiffel tower through the day
</picture>
That is the conclusion of the doc
In the event you used an LLM on the primary textual content, it after all wouldn’t know the doc accommodates a picture of the Eiffel tower. Nevertheless, in case you used an LLM on the second textual content extracted with a VLM, the LLM would naturally be higher at responding to questions concerning the doc.
Add lacking info
You can even immediate VLMs to output contents if there’s lacking info. To know this idea, have a look at the picture under:

In the event you utilized conventional OCR to this picture, you’ll get:
Handle Highway 1
Date
Firm Google
Nevertheless, it could be extra consultant in case you used VLMs, which if instructed, may output:
Handle Highway 1
Date <empty>
Firm Google
That is extra informative, as a result of we’re info any downstream mannequin, that the date area is empty. If we don’t present this info, it’s inconceivable to know late if the date is solely lacking, the OCR wasn’t capable of extract it, or another cause.
Nevertheless, OCR utilizing VLMs nonetheless undergo from a number of the points that conventional OCR struggles with, as a result of it’s not processing visible info immediately. You’ve most likely heard the saying that a picture is price a thousand phrases, which regularly holds true for processing visible info in paperwork. Sure, you’ll be able to present a textual content description of a drawing with a VLM as OCR, however this article will by no means be as descriptive because the drawing itself. Thus, I argue you’re in loads of circumstances higher off immediately processing the paperwork utilizing VLMs, as I’ll cowl within the following sections.
Open supply vs closed supply fashions
There are loads of VLMs accessible. I follw the HuggingFace VLM leaderboard to concentrate to any new excessive performing fashions. In keeping with this leaderboard, you must go for both Gemini 2.5 Professional, or GPT-5 if you wish to use closed supply fashions by an API. From my expeirence, these are nice choices, which works nicely for lengthy doc understanding, and dealing with complicated paperwork.
Nevertheless, you may additionally wish to make the most of open-source fashions, as a result of privateness, price, or to have extra management over your individual software. On this case, SenseNova-V6-5-Professional tops the leaderboard. I havn’t tried this mannequin personally, however I’ve used Qwen 3 VL so much, which I’ve good expertise with. Qwen has additionally launched a specific cookbook for long document understanding.
VLMs on lengthy paperwork
On this part I’ll speak about making use of VLMs to lengthy paperwork, and issues you need to make when doing it.
Processing energy issues
In the event you’re working an open-source mannequin, certainly one of your essential issues is how giant of a mannequin you’ll be able to run, and the way lengthy it takes. You’re relying on entry to a bigger GPU, atleast an A100 typically. Fortunately that is extensively accessible, and comparatively low-cost (sometimes price 1.5 – 2 USD per hour an loads of cloud suppliers now). Nevertheless, you need to additional take into account the latency you’ll be able to settle for. Runing VLMs require loads of processing, and you need to take into account the next components:
- How lengthy is appropriate to spend processing one request
- Which picture decision do you want?
- What number of pages do you have to course of
When you have a dwell chat for instance, you want fast course of, nevertheless in case you’re merely processing within the background, you’ll be able to permit for longer processing occasions.
Picture decision can be an essential consideration. In the event you want to have the ability to learn the textual content in paperwork, you want high-resolution photographs, sometimes over 2048×2048, although it naturally is dependent upon the doc. Detailed drawings for instance with small textual content in them, would require even larger decision. Improve decision, enormously will increase processing time and is a vital consideration. You need to intention for the bottom doable decision that also means that you can carry out all of the duties you wish to carry out. Moreover, the variety of pages is an identical consideration. Including extra pages is commonly essential to have entry to all the data in a doc. Nevertheless, usually, a very powerful info is contained early within the doc, so you may get away with solely processing the primary 10 pages for instance.
Reply dependent processing
One thing you’ll be able to attempt to decrease the required processing energy, is to begin of straightforward, and solely advance to heavier processing in case you don’t get the specified solutions.
For instance, you may begin of solely wanting on the first 10 pages, and seeing in case you’re capable of correctly clear up the duty at hand, similar to extracting a chunk of knowledge from a doc. Provided that we’re not capable of extract the piece of information, we begin taking a look at extra pages. You may apply the identical idea to the decision of your photographs, beginning with decrease decision photographs, and shifting to larger decision of required.
This will of hierarchical processing reduces the required processing energy, since most duties might be solved solely wanting on the first 10 pages, or utilizing decrease decision photographs. Then, provided that vital, we transfer on to course of extra photographs, or larger decision photographs.
Value
Value is a vital consideration when utilizing VLMs. I’ve processed loads of paperwork, and I sometimes see round a 10x improve in variety of tokens when utilizing photographs (VLMs) as a substitute of textual content (LLMs). Since enter tokens are sometimes the motive force of prices in lengthy doc duties, utilizing VLMs normally considerably will increase price. Observe that for OCR, the purpose about extra enter tokens than output tokens doesn’t apply, since OCR naturally produces loads of output tokens when outputting all textual content in photographs.
Thus, when utilizing VLMs, is extremely essential to maximise your utilization of cached tokens, a subject I mentioned in my recent article about optimizing LLMs for cost and latency.
Conclusion
On this article I mentioned how one can apply imaginative and prescient language fashions (VLMs) to lengthy paperwork, to deal with complicated doc understanding duties. I mentioned why VLMs are so essential, and approaches to utilizing VLMs on lengthy paperwork. You may for instance use VLMs for extra complicated OCR, or immediately apply VLMs to lengthy paperwork, although with precautions about required processing energy, price and latency. I feel VLMs have gotten increasingly more essential, highlighted by the current launch of Deepseek OCR. I thus suppose VLMs for doc understanding is a subject you must get entangled with, and you must discover ways to use VLMs for doc processing functions.
👉 Discover me on socials:
🧑💻 Get in touch
✍️ Medium
You can even learn my different articles:
