(VLMs) are highly effective fashions able to inputting each pictures and textual content, and responding with textual content. This enables us to carry out visible info extraction on paperwork and pictures. On this article, I’ll talk about the newly launched Qwen 3 VL, and the highly effective capabilities VLMs possess.
Qwen 3 VL was launched a number of weeks in the past, initially with the 235B-A22B mannequin, which is sort of a big mannequin. They then launched the 30B-A3B, and simply now launched the dense 4B and 8B variations. My purpose for this text is to spotlight the capabilities of imaginative and prescient language fashions and inform you of their capabilities on a excessive stage. I’ll use Qwen 3 VL as a particular instance on this article, although there are a lot of different high-quality VLMs out there. I’m not affiliated in any manner with Qwen when writing this text.
Why do we want imaginative and prescient language fashions
Imaginative and prescient language fashions are obligatory as a result of the choice is to as an alternative depend on OCR and feed the OCR-ed textual content into an LLM. This has a number of points:
- OCR isn’t excellent, and the LLM must cope with imperfect textual content extraction
- You lose the data contained within the visible place of the textual content
Conventional OCR engines like Tesseract have lengthy been tremendous essential to doc processing. OCR has allowed us to enter pictures and extract the textual content from them, enabling additional processing of the contents of the doc. Nonetheless, conventional OCR is much from excellent, and it might battle with points like small textual content, skewed pictures, vertical textual content, and so forth. If in case you have poor OCR output, you’ll battle with all downstream duties, whether or not you’re utilizing regex or an LLM. Feeding pictures on to VLMs, as an alternative of OCR-ed textual content to LLMs, is to this point simpler in using info.
The visible place of textual content is usually important to understanding the which means of the textual content. Think about the instance within the picture under, the place you’ve gotten checkboxes highlighting which textual content is related, the place some checkboxes are ticked off, and a few aren’t. You may then have some textual content corresponding to every checkbox, the place solely the textual content beside the ticked-off checkbox is related. Extracting this info utilizing OCR + LLMs is difficult, as a result of you may’t know which textual content the ticked checkbox belongs to. Nonetheless, fixing this job utilizing imaginative and prescient language fashions is trivial.

I fed the picture above to Qwen 3 VL, and it replied with the response proven under:
Primarily based on the picture offered, the paperwork which might be checked off are:
- **Doc 1** (marked with an "X")
- **Doc 3** (marked with an "X")
**Doc 2** is just not checked (it's clean).
As you may see, Qwen 3 VL simply solved the issue accurately.
Another excuse we want VLMs is that we additionally get video understanding. Actually understanding video clips can be immensely difficult utilizing OCR, as plenty of the data in movies is just not displayed with textual content, however reasonably proven as a picture immediately. OCR is thus not efficient. Nonetheless, the brand new era of VLMs permits you to enter a whole lot of pictures, for instance, representing a video, permitting you to carry out video understanding duties.
Imaginative and prescient language mannequin duties
There are a lot of duties you may apply imaginative and prescient language fashions to. I’ll talk about a number of of essentially the most related duties.
- OCR
- Info extraction
The information
I’ll use the picture under for example picture for my testing.

I’ll use this picture as a result of it’s an instance of an actual doc, very related to use Qwen 3 VL on. Moreover, I’ve cropped the picture to its present form, in order that I can feed the picture with a excessive decision into Qwen 3 VL on my native laptop. Sustaining a excessive decision is important if you wish to carry out OCR on the picture. I’ve extracted the JPG from a PDF utilizing 600 DPI. Usually, 300 DPI is sufficient for OCR, however I stored the next DPI simply to make certain, which works on this small picture.
Put together Qwen 3 VL
I would like the next imports to run Qwen 3 VL:
torch
speed up
pillow
torchvision
git+https://github.com/huggingface/transformers
It’s essential to set up Transformers from supply (GitHub), as Qwen 3 VL is just not but out there within the newest Transformers model.
The next code masses the imports, mannequin, and processor, and creates an inference perform:
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Picture
import os
import time
# default: Load the mannequin on the out there gadget(s)
mannequin = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-4B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Instruct")
def _resize_image_if_needed(image_path: str, max_size: int = 1024) -> str:
"""Resize picture if wanted to a most dimension of max_size. Maintain the facet ratio."""
img = Picture.open(image_path)
width, peak = img.dimension
if width <= max_size and peak <= max_size:
return image_path
ratio = min(max_size / width, max_size / peak)
new_width = int(width * ratio)
new_height = int(peak * ratio)
img_resized = img.resize((new_width, new_height), Picture.Resampling.LANCZOS)
base_name = os.path.splitext(image_path)[0]
ext = os.path.splitext(image_path)[1]
resized_path = f"{base_name}_resized{ext}"
img_resized.save(resized_path)
return resized_path
def _build_messages(system_prompt: str, user_prompt: str, image_paths: record[str] | None = None, max_image_size: int | None = None):
messages = [
{"role": "system", "content": [{"type": "text", "text": system_prompt}]}
]
user_content = []
if image_paths:
if max_image_size is just not None:
processed_paths = [_resize_image_if_needed(path, max_image_size) for path in image_paths]
else:
processed_paths = image_paths
user_content.lengthen([
{"type": "image", "min_pixels": 512*32*32, "max_pixels": 2048*32*32, "image": image_path}
for image_path in processed_paths
])
user_content.append({"kind": "textual content", "textual content": user_prompt})
messages.append({
"position": "person",
"content material": user_content,
})
return messages
def inference(system_prompt: str, user_prompt: str, max_new_tokens: int = 1024, image_paths: record[str] | None = None, max_image_size: int | None = None):
messages = _build_messages(system_prompt, user_prompt, image_paths, max_image_size)
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(mannequin.gadget)
start_time = time.time()
generated_ids = mannequin.generate(**inputs, max_new_tokens=max_new_tokens)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
return output_text[0]
OCR
OCR is a job that almost all VLMs are educated for. You possibly can for instance learn the technical experiences of the Qwen VL fashions, the place they point out how OCR knowledge is part of the coaching set. To coach VLMs to carry out OCR they provide the mannequin a sequence of pictures, and the textual content contained in these pictures. The mannequin then learns to extract the textual content from the pictures.
I’ll apply OCR to the picture with the immediate under, which is identical immediate the Qwen workforce makes use of to carry out OCR in keeping with the Qwen 3 VL cookbook.
user_prompt = "Learn all of the textual content within the picture."
Now I’ll run the mannequin. I referred to as the check picture we’re working on, for example-doc-site-plan-cropped.jpg
system_prompt = """
You're a useful assistant that may reply questions and assist with duties.
"""
user_prompt = "Learn all of the textual content within the picture."
max_new_tokens = 1024
image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)
Which outputs:
Plan- og
bygningsetaten
Dato: 23.01.2014
Bruker: HKN
Målestokk 1:500
Ekvidistanse 1m
Høydegrunnlag: Oslo lokal
Koordinatsystem: EUREF89 - UTM sone 32
© Plan- og bygningsetaten,
Oslo kommune
Originalformat A3
Adresse:
Camilla Colletts vei 15
Gnr/Bnr:
.
Kartet er sammenstilt for:
.
PlotID: / Greatest.nr.:
27661 /
Deres ref: Camilla Colletts vei 15
Kommentar:
Gjeldende kommunedelplaner:
KDP-BB, KDP-13, KDP-5
Kartutsnittet gjelder vertikalinvå 2.
I tillegg finnes det regulering i
følgende vertikalinvå:
(Hvis clean: Ingen øvrige.)
Det er ikke registrert
naturn mangfold innenfor
Se tegnforklaring på eget ark.
Beskrivelse:
NR:
Dato:
Revidert dato:
This output is from my testing, utterly appropriate, and covers all of the textual content within the picture, and extracts all appropriate characters.
Info extraction
You can too carry out info extraction utilizing imaginative and prescient language fashions. This may, for instance, be used to extract essential metadata from pictures. You usually additionally need to extract this metadata right into a JSON format, so it’s simply parsable and can be utilized for downstream duties. On this instance, I’ll extract:
- Date – 23.01.2024 on this instance
- Tackle – Camilla Colletts vei 15 on this instance
- Gnr (avenue quantity) – which within the check picture is a clean discipline
- Målestokk (scale) – 1:500
I’m working the next code:
user_prompt = """
Extract the next info from the picture, and reply in JSON format:
{
"date": "The date of the doc. In format YYYY-MM-DD.",
"tackle": "The tackle talked about within the doc.",
"gnr": "The road quantity (Gnr) talked about within the doc.",
"scale": "The dimensions (målestokk) talked about within the doc.",
}
When you can't discover the data, reply with None. The return object have to be a legitimate JSON object. Reply solely the JSON object, no different textual content.
"""
max_new_tokens = 1024
image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)
Which outputs:
{
"date": "2014-01-23",
"tackle": "Camilla Colletts vei 15",
"gnr": "15",
"scale": "1:500"
}
The JSON object is in a legitimate format, and Qwen has efficiently extracted the date, tackle, and scale fields. Nonetheless, Qwen has truly returned a gnr. Initially, once I noticed this consequence, I assumed this was a hallucination, because the Gnr discipline within the check picture is clean. Nonetheless, Qwen has truly made a pure assumption that the Gnr is offered within the tackle, which is appropriate on this occasion.
To make certain of its capabilities to reply None if it might’t discover something, I requested Qwen to extract the Bnr (constructing quantity), which isn’t out there on this instance. Working the code under:
user_prompt = """
Extract the next info from the picture, and reply in JSON format:
{
"date": "The date of the doc. In format YYYY-MM-DD.",
"tackle": "The tackle talked about within the doc.",
"Bnr": "The constructing quantity (Bnr) talked about within the doc.",
"scale": "The dimensions (målestokk) talked about within the doc.",
}
When you can't discover the data, reply with None. The return object have to be a legitimate JSON object. Reply solely the JSON object, no different textual content.
"""
max_new_tokens = 1024
image_paths = ["example-doc-site-plan-cropped.jpg"]
output = inference(system_prompt, user_prompt, max_new_tokens, image_paths, max_image_size=1536)
print(output)
I get:
{
"date": "2014-01-23",
"tackle": "Camilla Colletts vei 15",
"Bnr": None,
"scale": "1:500"
}
In order you may see, Qwen does handle to tell us if info is just not current within the doc.
Imaginative and prescient language fashions’ downsides
I’d additionally like to notice that there are some points with imaginative and prescient language fashions as nicely. The picture I examined OCR and knowledge extraction with is a comparatively easy picture. To really check the capabilities of Qwen 3, I must expose it to more difficult duties, for instance, extracting extra textual content from an extended doc or making it extract extra metadata fields.
The primary present downsides with VLMs, from what I’ve seen, are:
- Generally lacking textual content with OCR
- Inference is sluggish
VLMs lacking textual content when performing OCR is one thing I’ve noticed a number of instances. When it occurs, the VLM usually simply misses a bit of the doc and utterly ignores the textual content. That is naturally very problematic, because it may miss textual content that’s important for downstream duties like performing key phrase searches. The rationale this occurs is a sophisticated matter that’s out of scope for this text, nevertheless it’s an issue try to be conscious of if you happen to’re performing OCR with VLMs.
Moreover, VLMs require plenty of processing energy. I’m working regionally on my PC, although I’m additionally working a really small mannequin. I began experiencing reminiscence points once I merely wished to course of a picture with dimensions of 2048×2048, which is problematic if I need to carry out textual content extraction from bigger paperwork. You possibly can thus think about how resource-intensive it’s to use VLMs to both:
- Extra pictures directly (for instance, processing a 10-page doc)
- Processing paperwork of upper resolutions
- Utilizing a bigger VLM, with extra parameters
Conclusion
On this article, I’ve mentioned VLMs, the place I began off discussing why we want VLMs, highlighting how some duties require each textual content and the visible place of the textual content. Moreover, I highlighted some duties you may carry out with VLMs and the way Qwen 3 VL was capable of carry out these duties. I believe the imaginative and prescient modality will probably be an increasing number of essential within the coming years. Up till a yr in the past, virtually all focus was on pure textual content fashions. Nonetheless, to achieve much more highly effective fashions, we have to make the most of the imaginative and prescient modality, which is the place I imagine VLMs will probably be extremely essential.