Close Menu
    Trending
    • Implementing DRIFT Search with Neo4j and LlamaIndex
    • Agentic AI in Finance: Opportunities and Challenges for Indonesia
    • Dispatch: Partying at one of Africa’s largest AI gatherings
    • Topp 10 AI-filmer genom tiderna
    • OpenAIs nya webbläsare ChatGPT Atlas
    • Creating AI that matters | MIT News
    • Scaling Recommender Transformers to a Billion Parameters
    • Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Using Vision Language Models to Process Millions of Documents
    Artificial Intelligence

    Using Vision Language Models to Process Millions of Documents

    ProfitlyAIBy ProfitlyAISeptember 26, 2025No Comments10 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    (VLMs) are highly effective machine-learning fashions that may course of each visible and textual data. With the latest launch of Qwen 3 VL, I need to make a deep dive into how one can make the most of these highly effective VLMs to course of paperwork.

    Desk of contents

    Why it’s essential to use VLMs

    To focus on why some duties require VLMs, I need to begin off with an instance job, the place we have to interpret textual content and the visible data of textual content.

    Think about you take a look at the picture under. The checkboxes characterize whether or not a doc ought to be included in a report or not, and now it’s essential to decide which paperwork to incorporate.

    This determine highlights an acceptable downside for VLMs. You might have a picture containing textual content about paperwork, together with checkboxes. You now want to find out which paperwork have been checked off the checkboxes. That is tough to unravel with LLMs, since you first want to use OCR to the picture. The textual content then loses its visible place, which is required to correctly resolve the duty. With VLMs, you may simply each learn the textual content within the doc, and make the most of its visible place (if the textual content is above a checked off checkbox or not), and efficiently resolve the duty. Picture by the writer.

    For a human, it is a easy job; clearly, paperwork 1 and three ought to be included, whereas doc 2 ought to be excluded. Nonetheless, if you happen to tried to unravel this downside by way of a pure LLM, you’d encounter points.

    To run a pure LLM, you’d first have to OCR the picture, the place the OCR output would look one thing like under, if you happen to use Google’s Tesseract, for instance, which extracts the textual content line by line.

    Doc 1  Doc 2  Doc 3  X   X 

    As you might need already found, the LLM could have points deciding which paperwork to incorporate, as a result of it’s inconceivable to know which paperwork the Xs belong to. This is only one of many eventualities the place VLMs are extraordinarily environment friendly at fixing an issue.

    The principle level right here is that understanding which paperwork have a checkboxed X requires each visible and textual data. You must know the textual content and the visible place of the textual content within the picture. I summarize this within the quote under:

    VLMs are required when the which means of textual content relies on its visible place

    Utility areas

    There are a plethora of areas you may apply VLMs to. On this part, I’ll cowl some completely different areas the place VLMs have confirmed helpful, and the place I’ve additionally efficiently utilized VLMs.

    Agentic use instances

    Brokers are within the wind these days, and VLMs additionally play a task on this. I’ll spotlight two primary areas the place VLMs can be utilized in an agentic context, although there are of course many different such areas.

    Pc use

    Pc use is an fascinating use case for VLMs. With pc use, I check with a VLM a body out of your pc and deciding which motion to take subsequent. One instance of that is OpenAI’s Operator. This will, for instance, be a body of this text you’re studying proper now, and scrolling right down to learn extra from this text.

    VLMs are helpful for pc use, as a result of LLMs are usually not sufficient to determine which actions to take. When working on a pc, you usually need to interpret the visible place of buttons and data, which, as I described at first, is without doubt one of the prime areas of use for VLMs.

    Debugging

    Debugging code can also be a brilliant helpful agentic utility space for VLMs. Think about that you’re creating an online utility, and uncover a bug.

    One choice is to start out logging to the console, copy the logs, describe to Cursor what you probably did, and immediate Cursor to repair it. That is naturally time-consuming, because it requires plenty of guide steps from the person.

    Another choice is thus to make the most of VLMs to raised resolve the issue. Ideally, you describe easy methods to reproduce the problem, a VLM can go into your utility, recreate the move, try the problem, and thus debug what goes unsuitable. There are functions being constructed for areas like this, although most haven’t come far in growth from what I’ve seen.

    Query answering

    Using VLMs for visible query answering is without doubt one of the basic approaches to utilizing VLMs. Query answering is the use case I described earlier on this article about determining which checkbox belongs to which paperwork. You feed the VLM with a person query, and a picture (or a number of photos), for the VLM to course of. The VLM will then present a solution in textual content format. You may see how this course of works within the determine under.

    This determine highlights a query answering job the place I’ve utilized a VLM to unravel the issue. You feed within the picture containing the issue, and the query containing the duty to unravel. The VLM then processes this data and outputs the anticipated data. Picture by the writer,

    You need to, nonetheless, weigh the trade-offs of utilizing VLMs vs LLMs. Naturally, when a job requires textual and visible data, it’s essential to make the most of VLMs to get a correct outcome. Nonetheless, VLMs are additionally normally rather more costly to run, as they should course of extra tokens. It is because photos comprise plenty of data, which thus results in many enter tokens to course of.

    Moreover, if the VLM is to course of textual content, you additionally want high-resolution photos, permitting the VLM to interpret the pixels making up letters. With decrease resolutions, the VLM struggles to learn the textual content within the photos, and also you’ll obtain low-quality outcomes.

    Classification

    This determine covers how one can apply VLMs to classification duties. You feed the VLM with a picture of a doc, and a query to categorise the doc into one in all a pre-defined set of classes. These classes ought to be included within the query, however are usually not included within the determine due to house limitations. The VLM then outputs the anticipated classification label. Picture by the writer.

    One other fascinating utility space for VLMs is classification. With classification, I check with the state of affairs the place you might have a predetermined set of classes and wish to find out which class a picture belongs to.

    You may make the most of VLMs for classification, with the identical strategy as utilizing LLMs. You create a structured immediate containing all related data, together with the attainable output classes. Moreover, you ideally cowl the completely different edge instances, for instance, in eventualities the place two classes are each very possible, and the VLM has to determine between the 2 classes.

    You may, for instance, have a immediate akin to:

    def get_prompt():
        return """
            ## Normal directions
            You must decide which class a given doc belongs to. 
            The obtainable classes are "authorized", "technical", "monetary".
    
            ## Edge case dealing with
            - Within the state of affairs the place you might have a authorized doc overlaying monetary data, the doc belongs to the monetary class
            - ...
            ## Return format
            Reply solely with the corresponding class, and no different textual content 
        """
    

    You can too successfully make the most of VLMs for data extraction, and there are plenty of data extraction duties requiring visible data. You create an analogous immediate to the classification immediate I created above, and usually immediate the VLM to reply in a structured format, akin to a JSON object.

    When performing data extraction, it’s essential to take into account what number of information factors you need to extract. For instance, if it’s essential to extract 20 completely different information factors from a doc, you most likely don’t need to extract all of them without delay. It is because the mannequin will possible battle to precisely extract that a lot data in a single go.

    As an alternative, it’s best to take into account splitting up the duty, for instance, extracting 10 information factors, with two completely different requests, simplifying the duty for the mannequin. On the opposite facet of the argument, you’ll typically encounter that some information factors are associated to one another, which means they need to be extracted in the identical request. Moreover, sending a number of requests will increase the inference price.

    This determine highlights how one can make the most of VLMs to carry out data extraction. You once more feed the VLM the picture of the doc, and in addition immediate the VLM to extract particular information factors. On this determine, I immediate the VLM to extract the date of the doc, the situation talked about within the doc, and the doc kind. The VLM then analyzes the immediate and the doc picture, and outputs a JSON object containing the requested data. Picture by the writer.

    When VLMs are problematic

    VLMs are wonderful fashions that may carry out duties that had been unimaginable to unravel with AI only a few years in the past. Nonetheless, in addition they have their limitations, which I’ll cowl on this part.

    Price of working VLMs

    The primary limitation is the price of working VLMs, which I’ve additionally briefly mentioned earlier on this article. VLMs course of photos, which include plenty of pixels. These pixels characterize plenty of data, which is encoded into tokens that the VLM can course of. The difficulty is that since photos comprise a lot data, it’s essential to create plenty of tokens per picture, which once more will increase the fee to run VLMs.

    Moreover, you usually want high-resolution photos, because the VLM is required to learn textual content in photos, resulting in much more tokens to course of. VLMs are thus costly to run, each over an API, however in compute prices if you happen to determine to self-host the VLM.

    Can not course of lengthy paperwork

    The quantity of tokens contained in photos additionally limits the variety of pages a VLM can course of without delay. VLMs are restricted by their context home windows, identical to conventional LLMs. It is a downside if you wish to course of lengthy paperwork containing a whole bunch of pages. Naturally, you might break up the doc into chunks, however you would possibly encounter issues the place the VLM doesn’t have entry to all of the contents of the doc in a single go.

    For instance, if in case you have a 100-page doc, you might first course of pages 1-50, after which course of pages 51-100. Nonetheless, if some data on web page 53 would possibly want the context from web page 1 (for instance, the title or date of the doc), it will result in points.

    To learn to cope with this downside, I learn Qwen 3’s cookbook, the place they’ve a web page on how to utilize Qwen 3 for ultralong documents. I’ll you should definitely check this out and talk about how properly it really works in a future article.

    Conclusion

    On this article, I’ve mentioned imaginative and prescient language fashions and how one can apply them to completely different downside areas. I first described easy methods to combine VLMs in agentic programs, for instance, as a pc use agent, or to debug internet functions. Persevering with, I lined areas akin to query answering, classification, and data extraction. Lastly, I additionally lined some limitations of VLMs, discussing the computational price of working VLMs and the way they battle with lengthy paperwork.

    👉 Discover me on socials:

    🧑‍💻 Get in touch

    🔗 LinkedIn

    🐦 X / Twitter

    ✍️ Medium



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleOpenAI stödjer AI animerad film kallad Critterz
    Next Article Why MissForest Fails in Prediction Tasks: A Key Limitation You Need to Keep in Mind
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Implementing DRIFT Search with Neo4j and LlamaIndex

    October 22, 2025
    Artificial Intelligence

    Agentic AI in Finance: Opportunities and Challenges for Indonesia

    October 22, 2025
    Artificial Intelligence

    Creating AI that matters | MIT News

    October 21, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Context Engineering — A Comprehensive Hands-On Tutorial with DSPy

    August 6, 2025

    The End-to-End Data Scientist’s Prompt Playbook

    September 8, 2025

    How to Fine-Tune Small Language Models to Think with Reinforcement Learning

    July 9, 2025

    AlphaEvolve: Google DeepMinds revolutionerande algoritmiska kodningsagent

    May 15, 2025

    The Future of AI Agents at Work, Building an AI Roadmap, Choosing the Right Tools, & Responsible AI Use

    June 19, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    AI and machine learning for engineering design | MIT News

    September 7, 2025

    Introducing the AI-3P Assessment Framework: Score AI Projects Before Committing Resources

    September 24, 2025

    A Practical Introduction to Google Analytics

    May 30, 2025
    Our Picks

    Implementing DRIFT Search with Neo4j and LlamaIndex

    October 22, 2025

    Agentic AI in Finance: Opportunities and Challenges for Indonesia

    October 22, 2025

    Dispatch: Partying at one of Africa’s largest AI gatherings

    October 22, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.