Close Menu
    Trending
    • Creating AI that matters | MIT News
    • Scaling Recommender Transformers to a Billion Parameters
    • Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know
    • Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI
    • ChatGPT Gets More Personal. Is Society Ready for It?
    • Why the Future Is Human + Machine
    • Why AI Is Widening the Gap Between Top Talent and Everyone Else
    • Implementing the Fourier Transform Numerically in Python: A Step-by-Step Guide
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » How to Use Frontier Vision LLMs: Qwen3-VL
    Artificial Intelligence

    How to Use Frontier Vision LLMs: Qwen3-VL

    ProfitlyAIBy ProfitlyAIOctober 20, 2025No Comments13 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    (VLMs) are highly effective fashions able to inputting each pictures and textual content, and responding with textual content. This enables us to carry out visible info extraction on paperwork and pictures. On this article, I’ll talk about the newly launched Qwen 3 VL, and the highly effective capabilities VLMs possess.

    Qwen 3 VL was launched a number of weeks in the past, initially with the 235B-A22B mannequin, which is sort of a big mannequin. They then launched the 30B-A3B, and simply now launched the dense 4B and 8B variations. My purpose for this text is to spotlight the capabilities of imaginative and prescient language fashions and inform you of their capabilities on a excessive stage. I’ll use Qwen 3 VL as a particular instance on this article, although there are a lot of different high-quality VLMs out there. I’m not affiliated in any manner with Qwen when writing this text.

    This infographic covers the principle matters of this text. I’ll talk about imaginative and prescient language fashions, and the way they’re in lots of situations higher than utilizing OCR + LLMs to grasp paperwork. Moreover, I’ll talk about utilizing VLMs for OCR, and knowledge extraction with Qwen 3 VL, and at last I’ll talk about some downsides of VLMs. Picture by ChatGPT.

    Why do we want imaginative and prescient language fashions

    Imaginative and prescient language fashions are obligatory as a result of the choice is to as an alternative depend on OCR and feed the OCR-ed textual content into an LLM. This has a number of points:

    • OCR isn’t excellent, and the LLM must cope with imperfect textual content extraction
    • You lose the data contained within the visible place of the textual content

    Conventional OCR engines like Tesseract have lengthy been tremendous essential to doc processing. OCR has allowed us to enter pictures and extract the textual content from them, enabling additional processing of the contents of the doc. Nonetheless, conventional OCR is much from excellent, and it might battle with points like small textual content, skewed pictures, vertical textual content, and so forth. If in case you have poor OCR output, you’ll battle with all downstream duties, whether or not you’re utilizing regex or an LLM. Feeding pictures on to VLMs, as an alternative of OCR-ed textual content to LLMs, is to this point simpler in using info.

    The visible place of textual content is usually important to understanding the which means of the textual content. Think about the instance within the picture under, the place you’ve gotten checkboxes highlighting which textual content is related, the place some checkboxes are ticked off, and a few aren’t. You may then have some textual content corresponding to every checkbox, the place solely the textual content beside the ticked-off checkbox is related. Extracting this info utilizing OCR + LLMs is difficult, as a result of you may’t know which textual content the ticked checkbox belongs to. Nonetheless, fixing this job utilizing imaginative and prescient language fashions is trivial.

    This instance highlights a scenario the place imaginative and prescient language fashions are required. When you merely OCR the textual content, you’ll lose the visible place of the ticked off checkboxes, and it’s thus difficult to know which of the three paperwork is related. Fixing this job utilizing a imaginative and prescient language mannequin, nonetheless, is easy. Picture by the creator.

    I fed the picture above to Qwen 3 VL, and it replied with the response proven under:

    Primarily based on the picture offered, the paperwork which might be checked off are:
    
    - **Doc 1** (marked with an "X")
    - **Doc 3** (marked with an "X")
    
    **Doc 2** is just not checked (it's clean).

    As you may see, Qwen 3 VL simply solved the issue accurately.


    Another excuse we want VLMs is that we additionally get video understanding. Actually understanding video clips can be immensely difficult utilizing OCR, as plenty of the data in movies is just not displayed with textual content, however reasonably proven as a picture immediately. OCR is thus not efficient. Nonetheless, the brand new era of VLMs permits you to enter a whole lot of pictures, for instance, representing a video, permitting you to carry out video understanding duties.

    Imaginative and prescient language mannequin duties

    There are a lot of duties you may apply imaginative and prescient language fashions to. I’ll talk about a number of of essentially the most related duties.

    • OCR
    • Info extraction

    The information

    I’ll use the picture under for example picture for my testing.

    I’ll use this picture for my testing of Qwen 3 VL. The picture is an brazenly out there doc from the planning authority of Oslo Municipality in Norway (“Plan og bygningsetaten”). Im utilizing this picture as it’s an instance of an actual doc you may need to apply a imaginative and prescient language mannequin to. Word that I’ve cropped the picture, because it initially contained a drawing as nicely. Sadly, my native laptop is just not highly effective sufficient to course of that giant a picture, so I made a decision to crop it. This enables me to run the picture via Qwen 3 VL with excessive decision. The picture has a decision of (768, 136), which is sufficient on this occasion to carry out OCR. It was cropped from a JPG, fetched from a PDF with 600 DPI.

    I’ll use this picture as a result of it’s an instance of an actual doc, very related to use Qwen 3 VL on. Moreover, I’ve cropped the picture to its present form, in order that I can feed the picture with a excessive decision into Qwen 3 VL on my native laptop. Sustaining a excessive decision is important if you wish to carry out OCR on the picture. I’ve extracted the JPG from a PDF utilizing 600 DPI. Usually, 300 DPI is sufficient for OCR, however I stored the next DPI simply to make certain, which works on this small picture.

    Put together Qwen 3 VL

    I would like the next imports to run Qwen 3 VL:

    torch
    speed up
    pillow
    torchvision
    git+https://github.com/huggingface/transformers

    It’s essential to set up Transformers from supply (GitHub), as Qwen 3 VL is just not but out there within the newest Transformers model.

    The next code masses the imports, mannequin, and processor, and creates an inference perform:

    from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
    from PIL import Picture
    import os
    import time
    
    # default: Load the mannequin on the out there gadget(s)
    mannequin = Qwen3VLForConditionalGeneration.from_pretrained(
        "Qwen/Qwen3-VL-4B-Instruct", dtype="auto", device_map="auto"
    )
    
    processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Instruct")
    
    
    
    def _resize_image_if_needed(image_path: str, max_size: int = 1024) -> str:
        """Resize picture if wanted to a most dimension of max_size. Maintain the facet ratio."""
        img = Picture.open(image_path)
        width, peak = img.dimension
        
        if width <= max_size and peak <= max_size:
            return image_path
        
        ratio = min(max_size / width, max_size / peak)
        new_width = int(width * ratio)
        new_height = int(peak * ratio)
        
        img_resized = img.resize((new_width, new_height), Picture.Resampling.LANCZOS)
        
        base_name = os.path.splitext(image_path)[0]
        ext = os.path.splitext(image_path)[1]
        resized_path = f"{base_name}_resized{ext}"
        
        img_resized.save(resized_path)
        return resized_path
    
    
    def _build_messages(system_prompt: str, user_prompt: str, image_paths: record[str] | None = None, max_image_size: int | None = None):
        messages = [
            {"role": "system", "content": [{"type": "text", "text": system_prompt}]}
        ]
        
        user_content = []
        if image_paths:
            if max_image_size is just not None:
                processed_paths = [_resize_image_if_needed(path, max_image_size) for path in image_paths]
            else:
                processed_paths = image_paths
            user_content.lengthen([
                {"type": "image", "min_pixels": 512*32*32, "max_pixels": 2048*32*32, "image": image_path}
                for image_path in processed_paths
            ])
        user_content.append({"kind": "textual content", "textual content": user_prompt})
        
        messages.append({
            "position": "person",
            "content material": user_content,
        })
        
        return messages
    
    
    def inference(system_prompt: str, user_prompt: str, max_new_tokens: int = 1024, image_paths: record[str] | None = None, max_image_size: int | None = None):
        messages = _build_messages(system_prompt, user_prompt, image_paths, max_image_size)
        
        inputs = processor.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,
            return_dict=True,
            return_tensors="pt"
        )
        inputs = inputs.to(mannequin.gadget)
        
        start_time = time.time()
        generated_ids = mannequin.generate(**inputs, max_new_tokens=max_new_tokens)
        generated_ids_trimmed = [
            out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        output_text = processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )
        end_time = time.time()
        print(f"Time taken: {end_time - start_time} seconds")
        
        return output_text[0]
    

    OCR

    OCR is a job that almost all VLMs are educated for. You possibly can for instance learn the technical experiences of the Qwen VL fashions, the place they point out how OCR knowledge is part of the coaching set. To coach VLMs to carry out OCR they provide the mannequin a sequence of pictures, and the textual content contained in these pictures. The mannequin then learns to extract the textual content from the pictures.

    I’ll apply OCR to the picture with the immediate under, which is identical immediate the Qwen workforce makes use of to carry out OCR in keeping with the Qwen 3 VL cookbook.

    user_prompt = "Learn all of the textual content within the picture."

    Now I’ll run the mannequin. I referred to as the check picture we’re working on, for example-doc-site-plan-cropped.jpg

    system_prompt = """
    You're a useful assistant that may reply questions and assist with duties.
    """
    
    user_prompt = "Learn all of the textual content within the picture."
    max_new_tokens = 1024
    
    image_paths = ["example-doc-site-plan-cropped.jpg"]
    output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
    print(output)
    

    Which outputs:

    Plan- og
    bygningsetaten
    
    Dato: 23.01.2014
    Bruker: HKN
    Målestokk 1:500
    Ekvidistanse 1m
    Høydegrunnlag: Oslo lokal
    Koordinatsystem: EUREF89 - UTM sone 32
    © Plan- og bygningsetaten,
    Oslo kommune
    Originalformat A3
    
    Adresse:
    Camilla Colletts vei 15
    
    Gnr/Bnr:
    .
    
    Kartet er sammenstilt for:
    .
    
    PlotID: / Greatest.nr.:
    27661 /
    
    Deres ref: Camilla Colletts vei 15
    
    Kommentar:
    Gjeldende kommunedelplaner:
    KDP-BB, KDP-13, KDP-5
    
    Kartutsnittet gjelder vertikalinvå 2.
    I tillegg finnes det regulering i
    følgende vertikalinvå:
    (Hvis clean: Ingen øvrige.)
    
    Det er ikke registrert
    naturn mangfold innenfor
    Se tegnforklaring på eget ark.
    
    Beskrivelse:
    NR:
    Dato:
    Revidert dato:

    This output is from my testing, utterly appropriate, and covers all of the textual content within the picture, and extracts all appropriate characters.

    Info extraction

    You can too carry out info extraction utilizing imaginative and prescient language fashions. This may, for instance, be used to extract essential metadata from pictures. You usually additionally need to extract this metadata right into a JSON format, so it’s simply parsable and can be utilized for downstream duties. On this instance, I’ll extract:

    • Date – 23.01.2024 on this instance
    • Tackle – Camilla Colletts vei 15 on this instance
    • Gnr (avenue quantity) – which within the check picture is a clean discipline
    • Målestokk (scale) – 1:500

    I’m working the next code:

    user_prompt = """
    Extract the next info from the picture, and reply in JSON format:
    {
        "date": "The date of the doc. In format YYYY-MM-DD.",
        "tackle": "The tackle talked about within the doc.",
        "gnr": "The road quantity (Gnr) talked about within the doc.",
        "scale": "The dimensions (målestokk) talked about within the doc.",
    }
    When you can't discover the data, reply with None. The return object have to be a legitimate JSON object. Reply solely the JSON object, no different textual content.
    """
    max_new_tokens = 1024
    
    image_paths = ["example-doc-site-plan-cropped.jpg"]
    output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
    print(output)
    
    

    Which outputs:

    {
        "date": "2014-01-23",
        "tackle": "Camilla Colletts vei 15",
        "gnr": "15",
        "scale": "1:500"
    }
    

    The JSON object is in a legitimate format, and Qwen has efficiently extracted the date, tackle, and scale fields. Nonetheless, Qwen has truly returned a gnr. Initially, once I noticed this consequence, I assumed this was a hallucination, because the Gnr discipline within the check picture is clean. Nonetheless, Qwen has truly made a pure assumption that the Gnr is offered within the tackle, which is appropriate on this occasion.

    To make certain of its capabilities to reply None if it might’t discover something, I requested Qwen to extract the Bnr (constructing quantity), which isn’t out there on this instance. Working the code under:

    user_prompt = """
    Extract the next info from the picture, and reply in JSON format:
    {
        "date": "The date of the doc. In format YYYY-MM-DD.",
        "tackle": "The tackle talked about within the doc.",
        "Bnr": "The constructing quantity (Bnr) talked about within the doc.",
        "scale": "The dimensions (målestokk) talked about within the doc.",
    }
    When you can't discover the data, reply with None. The return object have to be a legitimate JSON object. Reply solely the JSON object, no different textual content.
    """
    max_new_tokens = 1024
    
    image_paths = ["example-doc-site-plan-cropped.jpg"]
    output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
    print(output)
    

    I get:

    {
        "date": "2014-01-23",
        "tackle": "Camilla Colletts vei 15",
        "Bnr": None,
        "scale": "1:500"
    }

    In order you may see, Qwen does handle to tell us if info is just not current within the doc.

    Imaginative and prescient language fashions’ downsides

    I’d additionally like to notice that there are some points with imaginative and prescient language fashions as nicely. The picture I examined OCR and knowledge extraction with is a comparatively easy picture. To really check the capabilities of Qwen 3, I must expose it to more difficult duties, for instance, extracting extra textual content from an extended doc or making it extract extra metadata fields.

    The primary present downsides with VLMs, from what I’ve seen, are:

    • Generally lacking textual content with OCR
    • Inference is sluggish

    VLMs lacking textual content when performing OCR is one thing I’ve noticed a number of instances. When it occurs, the VLM usually simply misses a bit of the doc and utterly ignores the textual content. That is naturally very problematic, because it may miss textual content that’s important for downstream duties like performing key phrase searches. The rationale this occurs is a sophisticated matter that’s out of scope for this text, nevertheless it’s an issue try to be conscious of if you happen to’re performing OCR with VLMs.

    Moreover, VLMs require plenty of processing energy. I’m working regionally on my PC, although I’m additionally working a really small mannequin. I began experiencing reminiscence points once I merely wished to course of a picture with dimensions of 2048×2048, which is problematic if I need to carry out textual content extraction from bigger paperwork. You possibly can thus think about how resource-intensive it’s to use VLMs to both:

    • Extra pictures directly (for instance, processing a 10-page doc)
    • Processing paperwork of upper resolutions
    • Utilizing a bigger VLM, with extra parameters

    Conclusion

    On this article, I’ve mentioned VLMs, the place I began off discussing why we want VLMs, highlighting how some duties require each textual content and the visible place of the textual content. Moreover, I highlighted some duties you may carry out with VLMs and the way Qwen 3 VL was capable of carry out these duties. I believe the imaginative and prescient modality will probably be an increasing number of essential within the coming years. Up till a yr in the past, virtually all focus was on pure textual content fashions. Nonetheless, to achieve much more highly effective fashions, we have to make the most of the imaginative and prescient modality, which is the place I imagine VLMs will probably be extremely essential.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHow I Tailored the Resume That Landed Me $100K+ Data Science and ML Offers
    Next Article How to Build An AI Agent with Function Calling and GPT-5
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Creating AI that matters | MIT News

    October 21, 2025
    Artificial Intelligence

    Scaling Recommender Transformers to a Billion Parameters

    October 21, 2025
    Artificial Intelligence

    Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know

    October 21, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    MCP Client Development with Streamlit: Build Your AI-Powered Web App

    July 21, 2025

    A glimpse into OpenAI’s largest ambitions

    August 5, 2025

    What is AI Image Recognition? How It Works & Examples

    April 4, 2025

    Microsoft släpper sin egen AI-sökmotor kallad Copilot Search

    April 4, 2025

    Finding “Silver Bullet” Agentic AI Flows with syftr

    August 19, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    MCP in Practice | Towards Data Science

    September 29, 2025

    Automating Invoice Data Extraction: An End-to-End Workflow Guide

    September 5, 2025

    DoE selects MIT to establish a Center for the Exascale Simulation of Coupled High-Enthalpy Fluid–Solid Interactions | MIT News

    September 10, 2025
    Our Picks

    Creating AI that matters | MIT News

    October 21, 2025

    Scaling Recommender Transformers to a Billion Parameters

    October 21, 2025

    Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know

    October 21, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.