earlier than we begin:
- I’m a developer at Google Cloud. Ideas and opinions expressed listed here are totally my very own.
- The entire supply code for this text, together with future updates, is on the market in this notebook underneath the Apache 2.0 license.
- All new photographs on this article had been generated with Gemini Nano Banana utilizing the explored proof-of-concept. All supply photographs are both within the public area or free to make use of (reference hyperlinks are offered within the code output).
- You’ll be able to experiment with Gemini fashions without cost in Google AI Studio. For programmatic API entry, please observe that whereas a free tier is on the market for some fashions (i.e., you may carry out object detection), picture technology is a pay-as-you-go service.
✨ Overview
Conventional laptop imaginative and prescient fashions are sometimes educated to detect a hard and fast set of object courses, like “particular person”, “cat”, or “automotive”. If you wish to detect one thing particular that wasn’t within the coaching set, reminiscent of an “illustration” in a ebook {photograph}, you often have to collect a dataset, label it manually, and practice a customized mannequin, which might take hours and even days.
On this exploration, we’ll take a look at a distinct strategy utilizing Gemini. We’ll leverage its spatial understanding capabilities to carry out open-vocabulary object detection. This permits us to seek out objects based mostly solely on a pure language description, with none coaching.
As soon as the visible objects are detected, we’ll extract them after which use Gemini’s picture modifying capabilities (particularly the Nano Banana fashions) to revive and creatively remodel them.
🔥 Problem
We’re coping with unstructured information: pictures of books, magazines, and objects within the wild. These photographs current a number of difficulties for conventional laptop imaginative and prescient:
- Selection: The objects we need to discover (illustrations, engravings, and any visuals usually) range wildly in type and content material.
- Distortion: Pages are curved, pictures are taken at angles, and lighting is uneven.
- Noise: Previous books have stains, paper grain, and textual content bleeding by way of from the opposite facet.
Our problem is to construct a sturdy pipeline that may detect these objects regardless of the distortions, extract them cleanly, and edit them to seem like high-quality digital belongings… all utilizing easy textual content prompts.
🏁 Setup
🐍 Python packages
We’ll use the next packages:
google-genai: the Google Gen AI Python SDK lets us name Gemini with a couple of traces of codepillowfor picture administrationmatplotlibfor outcome visualization
We’ll additionally use these packages (dependencies of google-genai):
pydanticfor information administrationtenacityfor request administration
pip set up --quiet "google-genai>=1.63.0" "pillow>=11.3.0" "matplotlib>=3.10.0"
🔗 Gemini API
To make use of the Gemini API, we’ve two major choices:
- By way of Vertex AI with a Google Cloud undertaking
- By way of Google AI Studio with a Gemini API key
The Google Gen AI SDK offers a unified interface to those APIs, and we are able to use surroundings variables for the configuration. 🔽
🛠️ Choice 1 – Gemini API through Vertex AI
Necessities:
Gen AI SDK surroundings variables:
GOOGLE_GENAI_USE_VERTEXAI="True"GOOGLE_CLOUD_PROJECT="<PROJECT_ID>"GOOGLE_CLOUD_LOCATION="<LOCATION>"
💡 For preview fashions, the placement have to be set to
world. For typically obtainable fashions, we are able to select the closest location among the many Google model endpoint locations.ℹ️ Be taught extra about setting up a project and a development environment.
🛠️ Choice 2 – Gemini API through Google AI Studio
Requirement:
Gen AI SDK surroundings variables:
GOOGLE_GENAI_USE_VERTEXAI="False"GOOGLE_API_KEY="<API_KEY>"
ℹ️ Be taught extra about getting a Gemini API key from Google AI Studio.
💡 You’ll be able to retailer your surroundings configuration exterior of the supply code:
| Atmosphere | Technique |
|---|---|
| IDE | .env file (or equal) |
| Colab | Colab Secrets and techniques (🗝️ icon in left panel, see code beneath) |
| Colab Enterprise | Google Cloud undertaking and placement are mechanically outlined |
| Vertex AI Workbench | Google Cloud undertaking and placement are mechanically outlined |
Outline the next surroundings detection capabilities. You may as well outline your configuration manually if wanted. 🔽
import os
import sys
from collections.abc import Callable
from google import genai
# Handbook setup (go away unchanged if setup is environment-defined)
# @markdown **Which API: Vertex AI or Google AI Studio?**
GOOGLE_GENAI_USE_VERTEXAI = True # @param {kind: "boolean"}
# @markdown **Choice A - Google Cloud undertaking [+location]**
GOOGLE_CLOUD_PROJECT = "" # @param {kind: "string"}
GOOGLE_CLOUD_LOCATION = "world" # @param {kind: "string"}
# @markdown **Choice B - Google AI Studio API key**
GOOGLE_API_KEY = "" # @param {kind: "string"}
def check_environment() -> bool:
check_colab_user_authentication()
return check_manual_setup() or check_vertex_ai() or check_colab() or check_local()
def check_manual_setup() -> bool:
return check_define_env_vars(
GOOGLE_GENAI_USE_VERTEXAI,
GOOGLE_CLOUD_PROJECT.strip(), # Might need been pasted with line return
GOOGLE_CLOUD_LOCATION,
GOOGLE_API_KEY,
)
def check_vertex_ai() -> bool:
# Workbench and Colab Enterprise
match os.getenv("VERTEX_PRODUCT", ""):
case "WORKBENCH_INSTANCE":
go
case "COLAB_ENTERPRISE":
if not running_in_colab_env():
return False
case _:
return False
return check_define_env_vars(
True,
os.getenv("GOOGLE_CLOUD_PROJECT", ""),
os.getenv("GOOGLE_CLOUD_REGION", ""),
"",
)
def check_colab() -> bool:
if not running_in_colab_env():
return False
# Colab Enterprise was checked earlier than, so that is Colab solely
from google.colab import auth as colab_auth # kind: ignore
colab_auth.authenticate_user()
# Use Colab Secrets and techniques (🗝️ icon in left panel) to retailer the surroundings variables
# Secrets and techniques are personal, seen solely to you and the notebooks that you choose
# - Vertex AI: Retailer your settings as secrets and techniques
# - Google AI: Immediately import your Gemini API key from the UI
vertexai, undertaking, location, api_key = get_vars(get_colab_secret)
return check_define_env_vars(vertexai, undertaking, location, api_key)
def check_local() -> bool:
vertexai, undertaking, location, api_key = get_vars(os.getenv)
return check_define_env_vars(vertexai, undertaking, location, api_key)
def running_in_colab_env() -> bool:
# Colab or Colab Enterprise
return "google.colab" in sys.modules
def check_colab_user_authentication() -> None:
if running_in_colab_env():
from google.colab import auth as colab_auth # kind: ignore
colab_auth.authenticate_user()
def get_colab_secret(secret_name: str, default: str) -> str:
from google.colab import errors, userdata # kind: ignore
attempt:
return userdata.get(secret_name)
besides errors.SecretNotFoundError:
return default
def disable_colab_cell_scrollbar() -> None:
if running_in_colab_env():
from google.colab import output # kind: ignore
output.no_vertical_scroll()
def get_vars(getenv: Callable[[str, str], str]) -> tuple[bool, str, str, str]:
# Restrict getenv calls to the minimal (could set off UI affirmation for secret entry)
vertexai_str = getenv("GOOGLE_GENAI_USE_VERTEXAI", "")
if vertexai_str:
vertexai = vertexai_str.decrease() in ["true", "1"]
else:
vertexai = bool(getenv("GOOGLE_CLOUD_PROJECT", ""))
undertaking = getenv("GOOGLE_CLOUD_PROJECT", "") if vertexai else ""
location = getenv("GOOGLE_CLOUD_LOCATION", "") if undertaking else ""
api_key = getenv("GOOGLE_API_KEY", "") if not undertaking else ""
return vertexai, undertaking, location, api_key
def check_define_env_vars(
vertexai: bool,
undertaking: str,
location: str,
api_key: str,
) -> bool:
match (vertexai, bool(undertaking), bool(location), bool(api_key)):
case (True, True, _, _):
# Vertex AI - Google Cloud undertaking [+location]
location = location or "world"
define_env_vars(vertexai, undertaking, location, "")
case (True, False, _, True):
# Vertex AI - API key
define_env_vars(vertexai, "", "", api_key)
case (False, _, _, True):
# Google AI Studio - API key
define_env_vars(vertexai, "", "", api_key)
case _:
return False
return True
def define_env_vars(vertexai: bool, undertaking: str, location: str, api_key: str) -> None:
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = str(vertexai)
os.environ["GOOGLE_CLOUD_PROJECT"] = undertaking
os.environ["GOOGLE_CLOUD_LOCATION"] = location
os.environ["GOOGLE_API_KEY"] = api_key
def check_configuration(shopper: genai.Shopper) -> None:
service = "Vertex AI" if shopper.vertexai else "Google AI Studio"
print(f"✅ Utilizing the {service} API", finish="")
if shopper._api_client.undertaking:
print(f' with undertaking "{shopper._api_client.undertaking[:7]}…"', finish="")
print(f' in location "{shopper._api_client.location}"')
elif shopper._api_client.api_key:
api_key = shopper._api_client.api_key
print(f' with API key "{api_key[:5]}…{api_key[-5:]}"', finish="")
print(f" (in case of error, make certain it was created for {service})")
print("✅ Atmosphere capabilities outlined")
🤖 Gen AI SDK
To ship Gemini requests, create a google.genai shopper:
from google import genai
check_environment()
shopper = genai.Shopper()
check_configuration(shopper)
🖼️ Picture take a look at suite
Let’s outline an inventory of photographs for our checks: 🔽
from dataclasses import dataclass
from enum import StrEnum
Url = str
class Supply(StrEnum):
incunable = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014rosen0487:0165/full/pct:25/0/default.jpg"
engravings = "https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:34:07:66:92:1:00340766921:0121/full/pct:50/0/default.jpg"
museum_guidebook = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014gen34181:0033/full/pct:75/0/default.jpg"
denver_illustrated = "https://tile.loc.gov/image-services/iiif/service:gdc:gdclccn:rc:01:00:04:94:rc01000494:0051/full/pct:50/0/default.jpg"
physics_textbook = "https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:03:64:87:31:8:00036487318:0103/full/pct:50/0/default.jpg"
portrait_miniatures = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2024:2024rosen013592v02:0249/full/pct:50/0/default.jpg"
wizard_of_oz_drawings = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2006:2006gen32405:0048/full/pct:25/0/default.jpg"
work = "https://photographs.unsplash.com/photo-1714146681164-f26fed839692?h=1440"
alice_drawing = "https://photographs.unsplash.com/photo-1630595011903-689853b04ee2?h=800"
ebook = "https://photographs.unsplash.com/photo-1643451533573-ee364ba6e330?h=800"
handbook = "https://photographs.unsplash.com/photo-1623666936367-a100f62ba9b7?h=800"
electronics = "https://photographs.unsplash.com/photo-1757397584789-8b2c5bfcdbc3?h=1440"
@dataclass
class SourceMetadata:
title: str
webpage_url: Url
credit_line: str
LOC = "Library of Congress"
LOC_RARE_BOOKS = "Library of Congress, Uncommon E-book and Particular Collections Division"
LOC_MEETING_FRONTIERS = "Library of Congress, Assembly of Frontiers"
metadata_by_source: dict[Source, SourceMetadata] = {
Supply.incunable: SourceMetadata(
"Vergaderinge der historien van Troy (1485)",
"https://www.loc.gov/useful resource/rbc0001.2014rosen0487/?sp=165",
LOC_RARE_BOOKS,
),
Supply.engravings: SourceMetadata(
"Harper's illustrated catalogue (1847)",
"https://www.loc.gov/useful resource/gdcscd.00340766921/?sp=121",
LOC,
),
Supply.museum_guidebook: SourceMetadata(
"Barnum's American Museum illustrated (1850)",
"https://www.loc.gov/useful resource/rbc0001.2014gen34181/?sp=33",
LOC_RARE_BOOKS,
),
Supply.denver_illustrated: SourceMetadata(
"Denver illustrated (1893)",
"https://www.loc.gov/useful resource/gdclccn.rc01000494/?sp=51",
LOC_MEETING_FRONTIERS,
),
Supply.physics_textbook: SourceMetadata(
"Classes in physics (1916)",
"https://www.loc.gov/useful resource/gdcscd.00036487318/?sp=103",
LOC,
),
Supply.portrait_miniatures: SourceMetadata(
"The historical past of portrait miniatures (1904)",
"https://www.loc.gov/useful resource/rbc0001.2024rosen013592v02/?sp=249",
LOC_RARE_BOOKS,
),
Supply.wizard_of_oz_drawings: SourceMetadata(
"The fantastic Wizard of Oz (1899)",
"https://www.loc.gov/useful resource/rbc0001.2006gen32405/?sp=48",
LOC_RARE_BOOKS,
),
Supply.work: SourceMetadata(
"Open ebook exhibiting work by Vincent van Gogh",
"https://unsplash.com/pictures/9hD7qrxICag",
"Photograph by Trung Manh cong on Unsplash",
),
Supply.alice_drawing: SourceMetadata(
"Open ebook exhibiting an illustration and textual content from Alice's Adventures in Wonderland",
"https://unsplash.com/pictures/bewzr_Q9u2o",
"Photograph by Brett Jordan on Unsplash",
),
Supply.ebook: SourceMetadata(
"Open ebook exhibiting two botanical illustrations",
"https://unsplash.com/pictures/4IDqcNj827I",
"Photograph by Ranurte on Unsplash",
),
Supply.handbook: SourceMetadata(
"Open consumer handbook for classic digital camera",
"https://unsplash.com/pictures/aaFU96eYASk",
"Photograph by Annie Spratt on Unsplash",
),
Supply.electronics: SourceMetadata(
"Circuit board with digital elements",
"https://unsplash.com/pictures/Aqa1pHQ57pw",
"Photograph by Albert Stoynov on Unsplash",
),
}
print("✅ Check photographs outlined")
🧠 Gemini fashions
Gemini is available in completely different versions. We are able to at present use the next fashions:
- For object detection: Gemini 2.5 or Gemini 3, every obtainable in Flash or Professional variations.
- For object modifying: Gemini 2.5 Flash Picture or Gemini 3 Professional Picture, also referred to as Nano Banana and Nano Banana Professional.
🛠️ Helpers
Now, let’s add core helper courses and capabilities: 🔽
from enum import auto
from pathlib import Path
from typing import Any, forged
import IPython.show
import matplotlib.pyplot as plt
import pydantic
import tenacity
from google.genai.errors import ClientError
from google.genai.varieties import (
FinishReason,
GenerateContentConfig,
GenerateContentResponse,
PIL_Image,
ThinkingConfig,
ThinkingLevel,
)
# Multimodal fashions with spatial understanding and structured outputs
class MultimodalModel(StrEnum):
# Typically Accessible (GA)
GEMINI_2_5_FLASH = "gemini-2.5-flash"
GEMINI_2_5_PRO = "gemini-2.5-pro"
# Preview
GEMINI_3_FLASH_PREVIEW = "gemini-3-flash-preview"
GEMINI_3_1_PRO_PREVIEW = "gemini-3.1-pro-preview"
# Default mannequin used for object detection
DEFAULT = GEMINI_3_FLASH_PREVIEW
# Picture technology and modifying fashions
class ImageModel(StrEnum):
# Typically Accessible (GA)
GEMINI_2_5_FLASH_IMAGE = "gemini-2.5-flash-image" # Nano Banana 🍌
# Preview
GEMINI_3_PRO_IMAGE_PREVIEW = "gemini-3-pro-image-preview" # Nano Banana Professional 🍌
# Default mannequin used for picture modifying
DEFAULT = GEMINI_2_5_FLASH_IMAGE
Mannequin = MultimodalModel | ImageModel
def generate_content(
contents: record[Any],
mannequin: Mannequin,
config: GenerateContentConfig | None,
should_display_response_info: bool = False,
) -> GenerateContentResponse | None:
response = None
shopper = check_client_for_model(mannequin)
for try in get_retrier():
with try:
response = shopper.fashions.generate_content(
mannequin=mannequin.worth,
contents=contents,
config=config,
)
if should_display_response_info:
display_response_info(response, config)
return response
def check_client_for_model(mannequin: Mannequin) -> genai.Shopper:
if (
mannequin.worth.endswith("-preview")
and shopper.vertexai
and shopper._api_client.location != "world"
):
# Preview fashions are solely obtainable on the "world" location
return genai.Shopper(location="world")
return shopper
def display_response_info(
response: GenerateContentResponse | None,
config: GenerateContentConfig | None,
) -> None:
if response is None:
print("❌ No response")
return
if usage_metadata := response.usage_metadata:
if usage_metadata.prompt_token_count:
print(f"Enter tokens : {usage_metadata.prompt_token_count:9,d}")
if usage_metadata.candidates_token_count:
print(f"Output tokens : {usage_metadata.candidates_token_count:9,d}")
if usage_metadata.thoughts_token_count:
print(f"Ideas tokens: {usage_metadata.thoughts_token_count:9,d}")
if (
config shouldn't be None
and config.response_mime_type == "software/json"
and response.parsed is None
):
print("❌ Couldn't parse the JSON response")
return
if not response.candidates:
print("❌ No `response.candidates`")
return
if (finish_reason := response.candidates[0].finish_reason) != FinishReason.STOP:
print(f"❌ {finish_reason = }")
if not response.textual content:
print("❌ No `response.textual content`")
return
def generate_image(
sources: record[PIL_Image],
immediate: str,
mannequin: ImageModel,
config: GenerateContentConfig | None = None,
) -> PIL_Image | None:
contents = [*sources, prompt.strip()]
response = generate_content(contents, mannequin, config)
return check_get_output_image_from_response(response)
def check_get_output_image_from_response(
response: GenerateContentResponse | None,
) -> PIL_Image | None:
if response is None:
print("❌ No `response`")
return None
if not response.candidates:
print("❌ No `response.candidates`")
if response.prompt_feedback:
if block_reason := response.prompt_feedback.block_reason:
print(f"{block_reason = :s}")
if block_reason_message := response.prompt_feedback.block_reason_message:
print(f"{block_reason_message = }")
return None
if not (content material := response.candidates[0].content material):
print("❌ No `response.candidates[0].content material`")
return None
if not (components := content material.components):
print("❌ No `response.candidates[0].content material.components`")
return None
output_image: PIL_Image | None = None
for half in components:
if half.textual content:
display_markdown(half.textual content)
proceed
sdk_image = half.as_image()
assert sdk_image shouldn't be None
output_image = sdk_image._pil_image
assert output_image shouldn't be None
break # There must be a single picture
return output_image
def get_thinking_config(mannequin: Mannequin) -> ThinkingConfig | None:
match mannequin:
case MultimodalModel.GEMINI_2_5_FLASH:
return ThinkingConfig(thinking_budget=0)
case MultimodalModel.GEMINI_2_5_PRO:
return ThinkingConfig(thinking_budget=128, include_thoughts=False)
case MultimodalModel.GEMINI_3_FLASH_PREVIEW:
return ThinkingConfig(thinking_level=ThinkingLevel.MINIMAL)
case MultimodalModel.GEMINI_3_1_PRO_PREVIEW:
return ThinkingConfig(thinking_level=ThinkingLevel.LOW)
case _:
return None # Default
def display_markdown(markdown: str) -> None:
IPython.show.show(IPython.show.Markdown(markdown))
def display_image(picture: PIL_Image) -> None:
IPython.show.show(picture)
def get_retrier() -> tenacity.Retrying:
return tenacity.Retrying(
cease=tenacity.stop_after_attempt(7),
wait=tenacity.wait_incrementing(begin=10, increment=1),
retry=should_retry_request,
reraise=True,
)
def should_retry_request(retry_state: tenacity.RetryCallState) -> bool:
if not retry_state.consequence:
return False
err = retry_state.consequence.exception()
if not isinstance(err, ClientError):
return False
print(f"❌ ClientError {err.code}: {err.message}")
retry = False
match err.code:
case 400 if err.message shouldn't be None and " attempt once more " in err.message:
# Workshop: first time entry to Cloud Storage (service agent provisioning)
retry = True
case 429:
# Workshop: non permanent undertaking with 1 QPM quota
retry = True
print(f"🔄 Retry: {retry}")
return retry
print("✅ Helpers outlined")
🔍 Detecting visible objects
To carry out visible object detection, craft the immediate to point what you’d wish to detect and the way outcomes must be returned. In the identical request, it’s doable to additionally extract extra details about every detected object. This may be nearly something, from labels reminiscent of “furnishings”, “desk”, or “chair”, to extra exact classifications like “mammals” or “reptiles”, or to contextual information reminiscent of captions, colours, shapes, and many others.
For the subsequent checks, we’ll experiment with detecting illustrations inside ebook pictures. Right here’s a doable immediate:
OBJECT_DETECTION_PROMPT = """
Detect each illustration inside the ebook photograph and extract the next information for every:
- `box_2d`: Bounding field coordinates of the illustration solely (ignoring any caption).
- `caption`: Verbatim caption or legend reminiscent of "Determine 1". Use "" if not discovered.
- `label`: Single-word label describing the illustration. Use "" if not discovered.
"""
Notes:
- Bounding bins are very helpful for finding or extracting the detected objects.
- Usually, for Gemini fashions, a
box_2dbounding field represents coordinates normalized to a(0, 0, 1000, 1000)area for a(0, 0, width, peak)enter picture. - We’re additionally requesting to extract captions (metadata usually current in reference books) and labels (dynamic metadata).
To automate response processing, it’s handy to outline a Pydantic class that matches the immediate, reminiscent of:
class DetectedObject(pydantic.BaseModel):
box_2d: record[int]
caption: str
label: str
DetectedObjects: TypeAlias = record[DetectedObject]
Then, request a structured output with config fields response_mime_type and response_schema:
config = GenerateContentConfig(
# …,
response_mime_type="software/json",
response_schema=DetectedObjects,
# …,
)
This may generate a JSON response which the SDK can parse mechanically, letting us straight use object cases:
detected_objects = forged(DetectedObjects, response.parsed)
Let’s add a couple of object-detection-specific courses and capabilities: 🔽
import io
import urllib.request
from collections.abc import Iterator
from dataclasses import subject
from datetime import datetime
import PIL.Picture
from google.genai.varieties import Half, PartMediaResolutionLevel
from PIL.PngImagePlugin import PngInfo
OBJECT_DETECTION_PROMPT = """
Detect each illustration inside the ebook photograph and extract the next information for every:
- `box_2d`: Bounding field coordinates of the illustration solely (ignoring any caption).
- `caption`: Verbatim caption or legend reminiscent of "Determine 1". Use "" if not discovered.
- `label`: Single-word label describing the illustration. Use "" if not discovered.
"""
# Margin added to detected/cropped objects, giving extra context for a greater understanding of spatial distortions
CROP_MARGIN_PX = 10
# Set to True to avoid wasting every generated picture
SAVE_GENERATED_IMAGES = False
OUTPUT_IMAGES_PATH = Path("./object_detection_and_editing")
# Matching class for structured output technology
class DetectedObject(pydantic.BaseModel):
box_2d: record[int]
caption: str
label: str
# Misc information courses
InputImage = Path | Url
DetectedObjects = record[DetectedObject]
WorkflowStepImages = record[PIL_Image]
class WorkflowStep(StrEnum):
SOURCE = auto()
CROPPED = auto()
RESTORED = auto()
COLORIZED = auto()
CINEMATIZED = auto()
@dataclass
class VisualObjectWorkflow:
source_image: PIL_Image
detected_objects: DetectedObjects
images_by_step: dict[WorkflowStep, WorkflowStepImages] = subject(default_factory=dict)
def __post_init__(self) -> None:
denormalize_bounding_boxes(self)
workflow_by_image: dict[InputImage, VisualObjectWorkflow] = {}
def denormalize_bounding_boxes(self: VisualObjectWorkflow) -> None:
"""Convert the box_2d coordinates.
- Earlier than: [y1, x1, y2, x2] normalized to 0-1000, as returned by Gemini
- After: [x1, y1, x2, y2] in source_image coordinates, as utilized in Pillow
"""
def to_image_coord(coord: int, dim: int) -> int:
return int(coord * dim / 1000 + 0.5)
w, h = self.source_image.dimension
for obj in self.detected_objects:
y1, x1, y2, x2 = obj.box_2d
x1, x2 = to_image_coord(x1, w), to_image_coord(x2, w)
y1, y2 = to_image_coord(y1, h), to_image_coord(y2, h)
obj.box_2d = [x1, y1, x2, y2]
def detect_objects(
picture: InputImage,
immediate: str = OBJECT_DETECTION_PROMPT,
mannequin: MultimodalModel = MultimodalModel.DEFAULT,
config: GenerateContentConfig | None = None,
media_resolution: PartMediaResolutionLevel | None = None,
display_results: bool = True,
) -> None:
display_image_source_info(picture)
pil_image, content_part = get_pil_image_and_part(picture, mannequin, media_resolution)
immediate = immediate.strip()
contents = [content_part, prompt]
config = config or get_object_detection_config(mannequin)
response = generate_content(contents, mannequin, config)
if response shouldn't be None and response.parsed shouldn't be None:
detected_objects = forged(DetectedObjects, response.parsed)
else:
detected_objects = DetectedObjects()
workflow = VisualObjectWorkflow(pil_image, detected_objects)
workflow_by_image[image] = workflow
add_cropped_objects(workflow, picture, immediate)
if display_results:
display_detected_objects(workflow)
def get_pil_image_and_part(
picture: InputImage,
mannequin: MultimodalModel,
media_resolution: PartMediaResolutionLevel | None,
) -> tuple[PIL_Image, Part]:
if isinstance(picture, Path):
image_bytes = picture.read_bytes()
else:
headers = {"Person-Agent": "Mozilla/5.0"}
req = urllib.request.Request(picture, headers=headers)
with urllib.request.urlopen(req, timeout=10) as response:
image_bytes = response.learn()
pil_image = PIL.Picture.open(io.BytesIO(image_bytes))
content_part = Half.from_bytes(
information=image_bytes,
mime_type="picture/*",
media_resolution=media_resolution,
)
return pil_image, content_part
def get_object_detection_config(mannequin: Mannequin) -> GenerateContentConfig:
# Low randomness for extra determinism
return GenerateContentConfig(
temperature=0.0,
top_p=0.0,
seed=42,
response_mime_type="software/json",
response_schema=DetectedObjects,
thinking_config=get_thinking_config(mannequin),
)
def add_cropped_objects(
workflow: VisualObjectWorkflow,
enter: InputImage,
immediate: str,
crop_margin: int = CROP_MARGIN_PX,
) -> None:
cropped_images: record[PIL_Image] = []
obj_count = len(workflow.detected_objects)
for obj_order, obj in enumerate(workflow.detected_objects, 1):
cropped_image, _ = extract_object_image(workflow.source_image, obj, crop_margin)
cropped_images.append(cropped_image)
save_workflow_image(
WorkflowStep.SOURCE,
WorkflowStep.CROPPED,
enter,
obj_order,
obj_count,
cropped_image,
dict(immediate=immediate, crop_margin=str(crop_margin)),
)
workflow.images_by_step[WorkflowStep.CROPPED] = cropped_images
def extract_object_image(
picture: PIL_Image,
obj: DetectedObject,
margin: int = 0,
) -> tuple[PIL_Image, tuple[int, int, int, int]]:
def clamp(coord: int, dim: int) -> int:
return min(max(coord, 0), dim)
x1, y1, x2, y2 = obj.box_2d
w, h = picture.dimension
if margin != 0:
x1, x2 = clamp(x1 - margin, w), clamp(x2 + margin, w)
y1, y2 = clamp(y1 - margin, h), clamp(y2 + margin, h)
field = (x1, y1, x2, y2)
object_image = picture.crop(field)
return object_image, field
def save_workflow_image(
source_step: WorkflowStep,
target_step: WorkflowStep,
input_image: InputImage,
obj_order: int,
obj_count: int,
target_image: PIL_Image | None,
image_info: dict[str, str] | None = None,
) -> None:
if not SAVE_GENERATED_IMAGES or target_image is None:
return
if not OUTPUT_IMAGES_PATH.is_dir():
OUTPUT_IMAGES_PATH.mkdir(dad and mom=True)
time_str = datetime.now().strftime("%Y-%m-%d_percentH-%M-%S")
attempt:
filename = f"{Supply(input_image).title}_"
besides ValueError:
filename = ""
filename += f"{obj_order}o{obj_count}_{source_step}_{target_step}_{time_str}.png"
image_path = OUTPUT_IMAGES_PATH.joinpath(filename)
params = {}
if image_info:
png_info = PngInfo()
for okay, v in image_info.objects():
png_info.add_text(okay, v)
params.replace(pnginfo=png_info)
target_image.save(image_path, **params)
# Matplotlib
FIGURE_FG_COLOR = "#F1F3F4"
FIGURE_BG_COLOR = "#202124"
EDGE_COLOR = "#80868B"
rcParams = {
"determine.dpi": 300,
"textual content.colour": FIGURE_FG_COLOR,
"determine.edgecolor": FIGURE_FG_COLOR,
"axes.titlecolor": FIGURE_FG_COLOR,
"axes.edgecolor": FIGURE_FG_COLOR,
"xtick.colour": FIGURE_FG_COLOR,
"ytick.colour": FIGURE_FG_COLOR,
"determine.facecolor": FIGURE_BG_COLOR,
"axes.edgecolor": EDGE_COLOR,
"xtick.backside": False,
"xtick.high": False,
"ytick.left": False,
"ytick.proper": False,
"xtick.labelbottom": False,
"ytick.labelleft": False,
}
plt.rcParams.replace(rcParams)
def display_image_source_info(picture: InputImage) -> None:
def get_image_info_md() -> str:
if picture not in Supply:
return f"[[Source Image]({picture})]"
supply = Supply(picture)
metadata = metadata_by_source.get(supply)
if not metadata:
return f"[[Source Image]({supply.worth})]"
components = [
f"[Source Image]({supply.worth})",
f"[Source Page]({metadata.webpage_url})",
metadata.title,
metadata.credit_line,
]
separator = "•"
inner_info = f" {separator} ".be part of(components)
return f"{separator} {inner_info} {separator}"
def yield_md_rows() -> Iterator[str]:
horizontal_line = "---"
image_info = get_image_info_md()
yield horizontal_line
yield f"_{image_info}_"
yield horizontal_line
display_markdown(f"{chr(10)}{chr(10)}".be part of(yield_md_rows()))
def display_detected_objects(workflow: VisualObjectWorkflow) -> None:
source_image = workflow.source_image
detected_objects = PIL.Picture.new("RGB", source_image.dimension, "white")
for obj in workflow.detected_objects:
obj_image, field = extract_object_image(source_image, obj)
detected_objects.paste(obj_image, (field[0], field[1]))
_, (ax1, ax2) = plt.subplots(1, 2, format="compressed")
ax1.imshow(source_image)
ax2.imshow(detected_objects)
disable_colab_cell_scrollbar()
plt.present()
print("✅ Object detection helpers outlined")
🧪 Let’s begin easy: can we detect the one illustration on this incunable from 1485?
detect_objects(Supply.incunable)
• Source Image • Source Page • Vergaderinge der historien van Troy (1485) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 This works properly. The bounding field may be very exact, enclosing the hand-colored woodcut illustration very tightly.
🧪 Now, let’s test the detection of the a number of visuals on this museum guidebook:
detect_objects(Supply.museum_guidebook)
• Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 Remarks:
- The bounding bins are once more very exact.
- The outcomes are excellent: there are not any false positives and no false negatives.
- The captions beneath the visuals will not be enclosed inside the bounding bins, which was particularly requested. The bounding field granularity might be managed by altering the immediate.
🧪 What about barely warped visuals?
detect_objects(Supply.work)
• Source Image • Source Page • Open ebook exhibiting work by Vincent van Gogh • Photograph by Trung Manh cong on Unsplash •

💡 This doesn’t make a distinction. Discover how the bottom-right portray is partially lined by the orange bookmark. We’ll attempt to repair that within the restoration step.
🧪 What in regards to the tilted visuals on this ebook in regards to the structure in Denver?
detect_objects(Supply.denver_illustrated)
• Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

💡 Every visible is completely detected: spatial understanding covers tilted objects.
🧪 Lastly, let’s test the detection on this considerably warped ebook web page from Alice’s Adventures in Wonderland:
detect_objects(Supply.alice_drawing)
• Source Image • Source Page • Open ebook exhibiting an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

💡 Web page curvature and different distortions don’t stop non-rectangular objects from being detected. In truth, spatial understanding works on the pixel degree, which explains this precision for warped objects. In case you’d wish to work at a decrease degree, you may also ask for a “segmentation masks” within the immediate and also you’ll get a base64-encoded PNG (every pixel giving the 0-255 chance it belongs to the article inside the bounding field). See the segmentation doc for extra particulars.
🏷️ Textual content extraction and dynamic labeling
On high of localizing every object with its bounding field, our immediate requested to extract a verbatim caption and to assign a single-word label, when doable.
Let’s add a easy operate to show the detection information in a desk: 🔽
from collections import defaultdict
def display_detection_data(supply: Supply, show_consolidated: bool = False) -> None:
def string_with_visible_linebreaks(s: str) -> str:
return f'''"{s.exchange(chr(10), "↩️")}"'''
def yield_md_rows_consolidated(workflow: VisualObjectWorkflow) -> Iterator[str]:
yield "| label | rely | captions |"
yield "| :--- | ---: | :--- |"
stats = defaultdict(record)
for obj in workflow.detected_objects:
stats[obj.label].append(string_with_visible_linebreaks(obj.caption))
for label, captions in stats.objects():
rely = len(captions)
label_captions = " • ".be part of(sorted(captions))
yield f"| {label} | {rely} | {label_captions} |"
def yield_md_rows_with_bbox(workflow: VisualObjectWorkflow) -> Iterator[str]:
yield "| box_2d | label | caption |"
yield "| :--- | :--- | :--- |"
for obj in workflow.detected_objects:
yield f"| {obj.box_2d} | {obj.label} | {string_with_visible_linebreaks(obj.caption)} |"
workflow = workflow_by_image.get(supply)
if workflow is None:
print(f'❌ No detection for supply "{supply.title}"')
return
md_rows = record(
yield_md_rows_consolidated(workflow)
if show_consolidated
else yield_md_rows_with_bbox(workflow)
)
display_image_source_info(supply)
display_markdown(chr(10).be part of(md_rows))
Within the museum guidebook, the dynamic labeling is exact based on the context, and the captions beneath every illustration are completely extracted:
display_detection_data(Supply.museum_guidebook)
• Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •
| box_2d | label | caption |
|---|---|---|
| [954, 629, 1338, 1166] | beetle | “The Horned Beetle.” |
| [265, 984, 464, 1504] | armor | “Armor of a Man.” |
| [737, 984, 915, 1328] | armor | “Horse Armor.” |
| [1225, 1244, 1589, 1685] | beetle | “The Goliath Beetle.” |
| [264, 1766, 431, 2006] | masks | “The Masks.” |
| [937, 1769, 1260, 2087] | butterfly | “Painted Woman Butterfly.” |
| [1325, 2170, 1581, 2468] | butterfly | “The Woman Butterfly.” |
Within the ebook photograph exhibiting 4 work, that is excellent too:
display_detection_data(Supply.work)
• Source Image • Source Page • Open ebook exhibiting work by Vincent van Gogh • Photograph by Trung Manh cong on Unsplash •
| box_2d | label | caption |
|---|---|---|
| [378, 203, 837, 575] | portray | “Hái Ô-liu (Olive Choosing), tháng 12 năm 1889, sơn dầu trên toan, 28 3/4 x 35 in. [73 x 89 cm]” |
| [913, 207, 1380, 563] | portray | “Hẻm núi Les Peiroulets (Les Peiroulets Ravine), tháng 10 năm 1889, sơn dầu trên toan, 28 3/4 x 36 1/4 in. [73 x 92 cm]” |
| [387, 596, 845, 978] | portray | “Trưa: Nghỉ ngơi (phỏng theo Millet) (Midday: Relaxation from Work [after Millet]), tháng 1 năm 1890, sơn dầu trên toan, 28 3/4 x 35 7/8 in. [73 x 91 cm]” |
| [921, 611, 1397, 982] | portray | “Hoa hạnh đào (Almond Blossom), tháng 2 năm 1890, sơn dầu trên toan, 28 3/8 x 36 1/4 in. [73 x 92 cm]” |
Within the Denver structure ebook, the 4 captions are assigned to the proper illustrations, which was not an apparent process:
display_detection_data(Supply.denver_illustrated)
• Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •
| box_2d | label | caption |
|---|---|---|
| [203, 224, 741, 839] | constructing | “ERNEST AND CRANMER BUILDING.” |
| [743, 73, 1192, 758] | constructing | “PEOPLE’S BANK BUILDING.” |
| [1185, 211, 1787, 865] | constructing | “BOSTON BUILDING.” |
| [699, 754, 1238, 1203] | constructing | “COOPER BUILDING.” |
💡 When you’ve got a better take a look at the enter picture, it’s laborious to inform which caption belongs to which illustration at a look. Most of us would wish to consider it (and is likely to be incorrect). Asking Gemini reveals that the outcomes are intentional and never pure luck: Deciphering classic layouts can really feel a bit like a puzzle, however there’s often a “reading-order” logic at play. On this particular case, the captions are organized to correspond with the photographs in a clockwise or Z-pattern ranging from the highest left.
Within the “Alice’s Adventures in Wonderland” ebook web page, there was a single illustration accompanying the story textual content. As anticipated, the caption is empty (i.e., no false constructive):
display_detection_data(Supply.alice_drawing)
• Source Image • Source Page • Open ebook exhibiting an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •
| box_2d | label | caption |
|---|---|---|
| [111, 146, 1008, 593] | illustration | “” |
🔭 Generalizing object detection
We are able to use the identical ideas for different object varieties. We’ll typically hold requesting bounding bins to determine object positions inside photographs. With out altering our present output construction (i.e., no code change), we are able to use captions and labels to extract completely different object metadata relying on the enter kind.
🧪 See how we are able to detect digital elements by adapting the immediate whereas maintaining the very same code and output construction:
ELECTRONIC_COMPONENT_DETECTION_PROMPT = """
Exhaustively detect all the person digital elements within the picture and supply the next information for every:
- `box_2d`: bounding field coordinates.
- `caption`: Verbatim alphanumeric textual content seen on the element (together with authentic line breaks), or "" if no textual content is current.
- `label`: Particular kind of element.
"""
detect_objects(
Supply.electronics,
ELECTRONIC_COMPONENT_DETECTION_PROMPT,
media_resolution=PartMediaResolutionLevel.MEDIA_RESOLUTION_ULTRA_HIGH,
)
• Source Image • Source Page • Circuit board with digital elements • Photograph by Albert Stoynov on Unsplash •

💡 Remarks:
- Giant and tiny elements are detected, due to the particular instruction “exhaustively detect…”.
- By utilizing the ultra-high media decision, we guarantee extra particulars are tokenized and the “P” element (a visible outlier) will get detected.
Right here’s a consolidated view of the detected elements:
display_detection_data(Supply.electronics, show_consolidated=True)
• Source Image • Source Page • Circuit board with digital elements • Photograph by Albert Stoynov on Unsplash •
| label | rely | captions |
|---|---|---|
| built-in circuit | 3 | “49240↩️020S6K” • “8105↩️0:35” • “P4010↩️9NA0” |
| resistor | 4 | “” • “” • “105” • “R020” |
| inductor | 1 | “n1W” |
| diode | 3 | “Ok” • “L” • “P” |
| capacitor | 6 | “” • “” • “” • “” • “” • “” |
| transistor | 1 | “41” |
| connector | 1 | “” |
💡 Remarks:
- Elements are detected together with their textual content markings, regardless of the three completely different textual content orientations (upright, sideways, and the wrong way up), the blur, and the photograph noise.
- We eliminated the diploma of freedom for multi-line textual content by specifying the inclusion of “authentic line breaks” within the immediate: responses now persistently embody the road breaks for the three built-in circuits (displayed with the ↩️ emoji for higher visibility).
- The final diploma of freedom lies within the labeling. Whereas most elements have been correctly labeled, it’s unclear whether or not the “P” element is a diode, a resistor, or a fuse. Making the directions extra particular (e.g., itemizing the doable labels, utilizing an enum for the
labelsubject within the Pydantic class, or offering tips and extra particulars in regards to the anticipated circuit boards) will make the immediate extra “closed” and the outcomes extra deterministic and correct.
It’s additionally doable to allow/replace thethinking_configconfiguration, which is able to set off a sequence of thought earlier than producing the ultimate reply. In all of the detections carried out, our code usedThinkingLevel.MINIMAL, which didn’t devour any thought tokens (with Gemini 3 Flash). Updating the parameter toThinkingLevel.LOW,ThinkingLevel.MEDIUM, orThinkingLevel.HIGHwill use thought tokens and may result in higher outputs in advanced instances.
This demonstrates the flexibility of the strategy. With out retraining a mannequin, we switched from detecting Fifteenth-century woodcuts and illustrations with classic layouts to figuring out trendy electronics simply by altering the immediate. Such detections, together with caption and label metadata, could possibly be used to auto-crop elements for a components catalog, confirm meeting traces, or create interactive schematics… all with out a single labeled coaching picture.
🪄 Enhancing visible objects
Now that we are able to detect visible objects, we are able to envision an automation workflow to extract and reuse them. For this, we’ll use Gemini 2.5 Flash Picture (also referred to as Nano Banana 🍌) by default, a state-of-the-art picture technology and modifying mannequin.
Our object modifying capabilities will observe the identical template, taking one step as enter and producing an edited picture for the output step. Let’s outline core helpers for this: 🔽
from typing import Protocol
class ObjectEditingFunction(Protocol):
def __call__(
self,
picture: InputImage,
immediate: str | None = None,
mannequin: ImageModel | None = None,
config: GenerateContentConfig | None = None,
display_results: bool = True,
) -> None: ...
SourceTargetSteps = tuple[WorkflowStep, WorkflowStep]
registered_functions: dict[SourceTargetSteps, ObjectEditingFunction] = {}
DEFAULT_EDITING_CONFIG = GenerateContentConfig(response_modalities=["IMAGE"])
EMPTY_IMAGE = PIL.Picture.new("1", (1, 1), "white")
def object_editing_function(
default_prompt: str,
source_step: WorkflowStep,
target_step: WorkflowStep,
default_model: ImageModel = ImageModel.DEFAULT,
default_config: GenerateContentConfig = DEFAULT_EDITING_CONFIG,
) -> ObjectEditingFunction:
def editing_function(
picture: InputImage,
immediate: str | None = default_prompt,
mannequin: ImageModel | None = default_model,
config: GenerateContentConfig | None = default_config,
display_results: bool = True,
) -> None:
workflow, source_images = get_workflow_and_step_images(picture, source_step)
if immediate is None:
immediate = default_prompt
immediate = immediate.strip()
if mannequin is None:
mannequin = default_model
# Be aware: "config is None" is legitimate and can use the mannequin endpoint default config
target_images: record[PIL_Image] = []
display_image_source_info(picture)
obj_count = len(source_images)
for obj_order, source_image in enumerate(source_images, 1):
target_image = generate_image([source_image], immediate, mannequin, config)
save_workflow_image(
source_step,
target_step,
picture,
obj_order,
obj_count,
target_image,
dict(immediate=immediate),
)
target_images.append(target_image if target_image else EMPTY_IMAGE)
workflow.images_by_step[target_step] = target_images
if display_results:
display_sources_and_targets(workflow, source_step, target_step)
registered_functions[(source_step, target_step)] = editing_function
return editing_function
def get_workflow_and_step_images(
picture: InputImage,
step: WorkflowStep,
) -> tuple[VisualObjectWorkflow, list[PIL_Image]]:
# Objects detected?
if picture not in workflow_by_image:
detect_objects(picture, display_results=False)
workflow = workflow_by_image.get(picture)
assert workflow shouldn't be None
# Workflow step objects? (single degree, could possibly be prolonged to a dynamical graph)
operation = (WorkflowStep.CROPPED, step)
if step not in workflow.images_by_step and operation in registered_functions:
source_function = registered_functions[operation]
source_function(picture, display_results=False)
# Supply photographs
source_images = workflow.images_by_step.get(step)
assert source_images shouldn't be None
return workflow, source_images
def display_sources_and_targets(
workflow: VisualObjectWorkflow,
source_step: WorkflowStep,
target_step: WorkflowStep,
) -> None:
source_images = workflow.images_by_step[source_step]
target_images = workflow.images_by_step[target_step]
if not source_images:
print("❌ No photographs to show")
return
fig = plt.determine(format="compressed")
if horizontal := (len(source_images) >= 2):
rows, cols = 2, len(source_images)
else:
rows, cols = len(source_images), 2
gs = fig.add_gridspec(rows, cols)
for i, (source_image, target_image) in enumerate(
zip(source_images, target_images, strict=True)
):
for dim, picture in enumerate([source_image, target_image]):
grid_spec = gs[dim, i] if horizontal else gs[i, dim]
ax = fig.add_subplot(grid_spec)
ax.set_axis_off()
ax.imshow(picture)
disable_colab_cell_scrollbar()
plt.present()
print("✅ Object modifying helpers outlined")
Now, let’s outline a primary modifying step to revive the detected objects that may include many real-life artifacts…
✨ Restoring visible objects
For this restoration step, we have to craft a immediate that’s generic sufficient (to cowl most use instances) but additionally particular sufficient (to consider restoration wants).
A picture modifying immediate is predicated on pure language, sometimes utilizing crucial or declarative directions. With an crucial immediate, you describe the actions to carry out on the enter, whereas with a declarative immediate, you describe the anticipated output. Each are doable and can present equal outcomes. Your selection is known as a matter of choice, so long as the immediate is smart.
Our take a look at suite is usually composed of ebook pictures, which might include varied photographic and paper artifacts. The Nano Banana fashions perceive these subtleties and may edit photographs accordingly, which simplifies the immediate.
Here’s a doable restoration operate utilizing an crucial immediate:
RESTORATION_PROMPT = """
- Isolate and straighten the visible on a pure white background, excluding any surrounding textual content.
- Clear up all bodily artifacts and noise whereas preserving each authentic element.
- Middle the outcome and scale it to suit the canvas with minimal, symmetrical margins, making certain no distortion or cropping.
"""
# Default config with low randomness for extra deterministic restoration outputs
RESTORATION_CONFIG = GenerateContentConfig(
temperature=0.0,
top_p=0.0,
seed=42,
response_modalities=["IMAGE"],
)
restore_objects = object_editing_function(
RESTORATION_PROMPT,
WorkflowStep.CROPPED,
WorkflowStep.RESTORED,
default_config=RESTORATION_CONFIG,
)
print("✅ Restoration operate outlined")
🧪 Let’s attempt to restore the illustration from the 1485 incunable:
restore_objects(Supply.incunable)
• Source Image • Source Page • Vergaderinge der historien van Troy (1485) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 We now have a pleasant restoration of the hand-colored woodcut illustration. Be aware that our immediate is generic (“clear up all bodily artifacts”) and could possibly be made extra particular to take away extra or fewer artifacts. On this instance, there are remaining artifacts, such because the paper discoloration within the sword or the bleeding ink within the armor. We’ll see if we are able to repair these within the colorization step.
🧪 What in regards to the illustrations from the museum guidebook?
restore_objects(Supply.museum_guidebook)
• Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 All good!
🧪 What in regards to the barely warped visuals?
restore_objects(Supply.work)
• Source Image • Source Page • Open ebook exhibiting work by Vincent van Gogh • Photograph by Trung Manh cong on Unsplash •

💡 Remarks:
- Discover how, on the final portray, the orange bookmark is correctly eliminated and the hidden half inpainted to finish the portray.
- We requested to “fill the canvas with minimal uniform margins, with out distortion or cropping”. Relying on the facet ratio and sort of the visible, this diploma of freedom may end up in completely different white margins.
- This instance exhibits well-known work by Vincent Van Gogh. Nano Banana doesn’t fetch any reference photographs and solely makes use of the offered enter. If these had been pictures of personal work, they might be restored in the identical manner.
Within the Denver structure ebook, the illustrations might be tilted, which our generic immediate doesn’t absolutely consider. When a number of geometric transformations are concerned, it may be difficult to craft an crucial immediate that particulars all of the operations to carry out. As a substitute, a descriptive immediate might be extra easy by straight describing the anticipated output.
🧪 Right here’s an instance of a descriptive immediate specializing in the restoration of tilted visuals:
tilted_visual_prompt = """
An upright, high-fidelity rendition of the visible remoted towards a pure white background, filling the canvas with minimal uniform margins. The output is clear, sharp, and freed from bodily artifacts.
"""
restore_objects(Supply.denver_illustrated, tilted_visual_prompt)
• Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

💡 Remarks:
- To get these outcomes, the immediate focuses on requesting an “upright” visible “filling the canvas”, which proves extra easy to put in writing than attempting to account for all doable geometric corrections.
- The native visible understanding mechanically identifies the content material kind (photograph, illustration, and many others.) and the completely different artifacts (photographic, paper, printing, scanning…), permitting for exact restorations out of the field.
- Discover how the consistency is preserved: the final visible is restored as an illustration, whereas the primary visuals preserve their photographic type.
- The outcomes, with this relatively generic immediate, are spectacular. It’s, after all, doable to be extra particular and request explicit lighting, kinds, colours…
On this final take a look at, the enter visible has distortions not solely from the web page curvature but additionally from the photograph perspective.
🧪 Right here’s an instance of a descriptive immediate specializing in restoring warped illustrations:
warped_visual_prompt = """
An edge-to-edge digital extraction of the illustration from the offered ebook photograph, excluding any peripheral textual content. All web page curvature and perspective distortions are corrected, leading to a picture framed in an ideal rectangle, on a pure white canvas with minimal margins.
"""
restore_objects(Supply.alice_drawing, warped_visual_prompt)
• Source Image • Source Page • Open ebook exhibiting an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

💡 It’s actually spectacular that such a restoration might be carried out in a single step. Be aware that this immediate shouldn’t be secure and may generate much less optimum outcomes (it could profit from being extra exact). When you’ve got advanced transformations, take a look at descriptive prompts iteratively, utilizing exact and concise directions, and also you is likely to be pleasantly stunned. Within the worst case, it’s additionally doable to course of the transformations in successive, simpler steps.
Now, let’s add a colorization step…
🎨 Colorization
Our restoration step revered the unique kinds of the enter photographs. Current picture modifying fashions excel at reworking picture kinds, beginning with colours. This may typically be carried out straight with a easy, exact instruction.
Here’s a doable colorization operate utilizing an crucial immediate:
COLORIZATION_PROMPT = """
Colorize this picture in a contemporary ebook illustration type, sustaining all authentic particulars with none additions.
"""
colorize = object_editing_function(
COLORIZATION_PROMPT,
WorkflowStep.RESTORED,
WorkflowStep.COLORIZED,
)
print("✅ Colorization operate outlined")
🧪 Let’s modernize our 1485 illustration:
colorize(Supply.incunable)
• Source Image • Source Page • Vergaderinge der historien van Troy (1485) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 All particulars are preserved, as requested within the immediate. Discover how the colorization can naturally repair some remaining artifacts (e.g., the paper discoloration within the sword or the bleeding ink within the armor).
🧪 Let’s colorize our museum guidebook illustrations:
colorize(Supply.museum_guidebook)
• Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 Our immediate may be very open because it solely specifies “trendy ebook illustration type”. This may generate very inventive colorizations, however all of them appear to make excellent sense.
🧪 What about our Denver buildings?
colorize(Supply.denver_illustrated)
• Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

💡 As requested, all of them seem like trendy illustrations, together with the primary visuals (originating from noisy pictures).
It’s doable to go additional by not solely “colorizing” but additionally “reworking” the picture right into a considerably completely different one.
🧪 Let’s make our “Alice’s Adventures in Wonderland” drawing right into a watercolor portray:
watercolor_prompt = """
Rework this visible right into a heat, watercolor portray.
"""
colorize(Supply.alice_drawing, watercolor_prompt)
• Source Image • Source Page • Open ebook exhibiting an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

🧪 What about making it a standard portray?
painting_prompt = """
Rework this visible into a standard portray.
"""
colorize(Supply.alice_drawing, painting_prompt)
• Source Image • Source Page • Open ebook exhibiting an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

We are able to additionally change picture compositions. Relying on the context, some compositions are kind of implied by default. For instance, illustrations usually have margins, whereas pictures typically have edge-to-edge (full-bleed within the printing world) compositions. When doable, it’s attention-grabbing to check with a sort of visible (which intrinsically brings quite a lot of semantics to the context) and regulate the directions accordingly.
🧪 Let’s see how we are able to detect engravings on this 1847 ebook, restore them, and remodel them into trendy digital graphics:
detect_objects(Supply.engravings)
• Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

restore_objects(Supply.engravings)
• Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

visual_to_digital_graphic_prompt = """
Rework this visible right into a full-color, flat digital graphic, extending the content material for a full-bleed impact.
"""
colorize(Supply.engravings, visual_to_digital_graphic_prompt)
• Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

🧪 We are able to additionally remodel the identical engravings into pictures with a quite simple immediate:
visual_to_photo_prompt = """
Rework this visible right into a high-end, trendy digital camera {photograph}.
"""
colorize(Supply.engravings, visual_to_photo_prompt)
• Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

💡 As pictures are typically full-bleed, the immediate doesn’t must specify a composition.
It’s actually as much as our creativeness, as Nano Banana appears to understand each facet of the visible semantics.
Let’s add a closing step to see how far we are able to go, reimagining photographs as cinematic film stills…
🎞️ Cinematization
We’ve used relatively “closed” prompts to this point, crafting particular directions and constraints to regulate the outputs. It’s doable to go even additional with “open” prompts and generate photographs in full inventive mode. Notably, it may be attention-grabbing to check with photographic or cinematographic terminology because it encompasses many visible strategies.
Here’s a doable generic cinematization operate to reimagine photographs as film stills:
CINEMATIZATION_PROMPT = """
Reimagine this picture as a joyful, trendy live-action cinematic film nonetheless that includes skilled lighting and composition.
"""
cinematize = object_editing_function(
CINEMATIZATION_PROMPT,
WorkflowStep.RESTORED,
WorkflowStep.CINEMATIZED,
)
🧪 Let’s cinematize the “Alice’s Adventures in Wonderland” drawing:
cinematize(Supply.alice_drawing)
• Source Image • Source Page • Open ebook exhibiting an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

💡 This seems to be like a high-budget film nonetheless. There are many levels of freedom within the immediate, however you’re prone to get foreground figures in sharp focus, a gradual background blur, “golden hour” lighting (a magical ingredient for a lot of cinematographers), and detailed textures. Such compositions actually evoke completely different atmospheres in comparison with the pictures generated within the earlier take a look at.
🧪 Let’s take a look at the workflow on a web page from the Great Wizard of Oz containing three drawings:
detect_objects(Supply.wizard_of_oz_drawings)
• Source Image • Source Page • The fantastic Wizard of Oz (1899) • Library of Congress, Uncommon E-book and Particular Collections Division •

restore_objects(Supply.wizard_of_oz_drawings)
• Source Image • Source Page • The fantastic Wizard of Oz (1899) • Library of Congress, Uncommon E-book and Particular Collections Division •

cinematize(Supply.wizard_of_oz_drawings)
• Source Image • Source Page • The fantastic Wizard of Oz (1899) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 The forged for a brand new film is prepared 😉
Cinematic photographs have varied use instances:
- These cinematized stills might be excellent “reference photographs” for video technology fashions like Veo. See Generate Veo videos from reference images.
- As they’re photorealistic representations, they will also be a supply for producing 2D or 3D visuals, in any type, with lifelike figures, excellent proportions, superior lighting, enhanced compositions…
- You should utilize them in {many professional} contexts or for high-end merchandise: displays, magazines, posters, storyboards, brainstorming periods…
🏁 Conclusion
- Gemini’s native spatial understanding permits the detection of particular visible objects based mostly on a single immediate in pure language.
- We examined the detection of illustrations in ebook pictures, which conventional machine studying (ML) fashions often miss, as they’re sometimes educated to detect individuals, animals, autos, meals, and a finite set of bodily object courses.
- We examined the detection of straight, tilted, and even considerably warped illustrations, they usually had been at all times exactly recognized.
- The core implementation was easy, requiring minimal code utilizing the Python SDK and customised prompts. By comparability, fine-tuning a standard object detection mannequin is time-consuming: it includes assembling a picture dataset, labeling objects, and managing coaching jobs.
- This resolution may be very versatile: we may change from detecting illustrations to digital elements, by adapting the immediate, whereas maintaining the code unchanged.
- Utilizing structured outputs (with a JSON schema or Pydantic courses, and the Python SDK) makes the code each simple to implement and able to deploy to manufacturing.
- Then, Nano Banana permits modifying these visible objects in nearly any manner possible.
- We examined a workflow with restoration, colorization, and even cinematization steps, utilizing crucial and descriptive prompts.
- The probabilities appear actually limitless, and the ideas on this exploration might be reused in numerous contexts.
➕ Extra!
Thanks for studying. Let me know should you create one thing cool!
