Close Menu
    Trending
    • Designing Data and AI Systems That Hold Up in Production
    • Finding value with AI and Industry 5.0 transformation
    • A Generalizable MARL-LP Approach for Scheduling in Logistics
    • Detecting and Editing Visual Objects with Gemini
    • Take a Deep Dive into Filtering in DAX
    • New method could increase LLM training efficiency | MIT News
    • Scaling Feature Engineering Pipelines with Feast and Ray
    • Mixing generative AI with physics to create personal items that work in the real world | MIT News
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Detecting and Editing Visual Objects with Gemini
    Artificial Intelligence

    Detecting and Editing Visual Objects with Gemini

    ProfitlyAIBy ProfitlyAIFebruary 26, 2026No Comments37 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email



    earlier than we begin:

    • I’m a developer at Google Cloud. Ideas and opinions expressed listed here are totally my very own.
    • The entire supply code for this text, together with future updates, is on the market in this notebook underneath the Apache 2.0 license.
    • All new photographs on this article had been generated with Gemini Nano Banana utilizing the explored proof-of-concept. All supply photographs are both within the public area or free to make use of (reference hyperlinks are offered within the code output).
    • You’ll be able to experiment with Gemini fashions without cost in Google AI Studio. For programmatic API entry, please observe that whereas a free tier is on the market for some fashions (i.e., you may carry out object detection), picture technology is a pay-as-you-go service.

    ✨ Overview

    Conventional laptop imaginative and prescient fashions are sometimes educated to detect a hard and fast set of object courses, like “particular person”, “cat”, or “automotive”. If you wish to detect one thing particular that wasn’t within the coaching set, reminiscent of an “illustration” in a ebook {photograph}, you often have to collect a dataset, label it manually, and practice a customized mannequin, which might take hours and even days.

    On this exploration, we’ll take a look at a distinct strategy utilizing Gemini. We’ll leverage its spatial understanding capabilities to carry out open-vocabulary object detection. This permits us to seek out objects based mostly solely on a pure language description, with none coaching.

    As soon as the visible objects are detected, we’ll extract them after which use Gemini’s picture modifying capabilities (particularly the Nano Banana fashions) to revive and creatively remodel them.


    🔥 Problem

    We’re coping with unstructured information: pictures of books, magazines, and objects within the wild. These photographs current a number of difficulties for conventional laptop imaginative and prescient:

    • Selection: The objects we need to discover (illustrations, engravings, and any visuals usually) range wildly in type and content material.
    • Distortion: Pages are curved, pictures are taken at angles, and lighting is uneven.
    • Noise: Previous books have stains, paper grain, and textual content bleeding by way of from the opposite facet.

    Our problem is to construct a sturdy pipeline that may detect these objects regardless of the distortions, extract them cleanly, and edit them to seem like high-quality digital belongings… all utilizing easy textual content prompts.


    🏁 Setup

    🐍 Python packages

    We’ll use the next packages:

    • google-genai: the Google Gen AI Python SDK lets us name Gemini with a couple of traces of code
    • pillow for picture administration
    • matplotlib for outcome visualization

    We’ll additionally use these packages (dependencies of google-genai):

    • pydantic for information administration
    • tenacity for request administration
    pip set up --quiet "google-genai>=1.63.0" "pillow>=11.3.0" "matplotlib>=3.10.0"

    🔗 Gemini API

    To make use of the Gemini API, we’ve two major choices:

    1. By way of Vertex AI with a Google Cloud undertaking
    2. By way of Google AI Studio with a Gemini API key
    The Google Gen AI SDK offers a unified interface to those APIs, and we are able to use surroundings variables for the configuration. 🔽

    🛠️ Choice 1 – Gemini API through Vertex AI

    Necessities:

    Gen AI SDK surroundings variables:

    • GOOGLE_GENAI_USE_VERTEXAI="True"
    • GOOGLE_CLOUD_PROJECT="<PROJECT_ID>"
    • GOOGLE_CLOUD_LOCATION="<LOCATION>"

    💡 For preview fashions, the placement have to be set to world. For typically obtainable fashions, we are able to select the closest location among the many Google model endpoint locations.

    ℹ️ Be taught extra about setting up a project and a development environment.

    🛠️ Choice 2 – Gemini API through Google AI Studio

    Requirement:

    Gen AI SDK surroundings variables:

    • GOOGLE_GENAI_USE_VERTEXAI="False"
    • GOOGLE_API_KEY="<API_KEY>"

    ℹ️ Be taught extra about getting a Gemini API key from Google AI Studio.

    💡 You’ll be able to retailer your surroundings configuration exterior of the supply code:

    Atmosphere Technique
    IDE .env file (or equal)
    Colab Colab Secrets and techniques (🗝️ icon in left panel, see code beneath)
    Colab Enterprise Google Cloud undertaking and placement are mechanically outlined
    Vertex AI Workbench Google Cloud undertaking and placement are mechanically outlined
    Outline the next surroundings detection capabilities. You may as well outline your configuration manually if wanted. 🔽
    import os
    import sys
    from collections.abc import Callable
    
    from google import genai
    
    # Handbook setup (go away unchanged if setup is environment-defined)
    
    # @markdown **Which API: Vertex AI or Google AI Studio?**
    GOOGLE_GENAI_USE_VERTEXAI = True  # @param {kind: "boolean"}
    
    # @markdown **Choice A - Google Cloud undertaking [+location]**
    GOOGLE_CLOUD_PROJECT = ""  # @param {kind: "string"}
    GOOGLE_CLOUD_LOCATION = "world"  # @param {kind: "string"}
    
    # @markdown **Choice B - Google AI Studio API key**
    GOOGLE_API_KEY = ""  # @param {kind: "string"}
    
    
    def check_environment() -> bool:
        check_colab_user_authentication()
        return check_manual_setup() or check_vertex_ai() or check_colab() or check_local()
    
    
    def check_manual_setup() -> bool:
        return check_define_env_vars(
            GOOGLE_GENAI_USE_VERTEXAI,
            GOOGLE_CLOUD_PROJECT.strip(),  # Might need been pasted with line return
            GOOGLE_CLOUD_LOCATION,
            GOOGLE_API_KEY,
        )
    
    
    def check_vertex_ai() -> bool:
        # Workbench and Colab Enterprise
        match os.getenv("VERTEX_PRODUCT", ""):
            case "WORKBENCH_INSTANCE":
                go
            case "COLAB_ENTERPRISE":
                if not running_in_colab_env():
                    return False
            case _:
                return False
    
        return check_define_env_vars(
            True,
            os.getenv("GOOGLE_CLOUD_PROJECT", ""),
            os.getenv("GOOGLE_CLOUD_REGION", ""),
            "",
        )
    
    
    def check_colab() -> bool:
        if not running_in_colab_env():
            return False
    
        # Colab Enterprise was checked earlier than, so that is Colab solely
        from google.colab import auth as colab_auth  # kind: ignore
    
        colab_auth.authenticate_user()
    
        # Use Colab Secrets and techniques (🗝️ icon in left panel) to retailer the surroundings variables
        # Secrets and techniques are personal, seen solely to you and the notebooks that you choose
        # - Vertex AI: Retailer your settings as secrets and techniques
        # - Google AI: Immediately import your Gemini API key from the UI
        vertexai, undertaking, location, api_key = get_vars(get_colab_secret)
    
        return check_define_env_vars(vertexai, undertaking, location, api_key)
    
    
    def check_local() -> bool:
        vertexai, undertaking, location, api_key = get_vars(os.getenv)
    
        return check_define_env_vars(vertexai, undertaking, location, api_key)
    
    
    def running_in_colab_env() -> bool:
        # Colab or Colab Enterprise
        return "google.colab" in sys.modules
    
    
    def check_colab_user_authentication() -> None:
        if running_in_colab_env():
            from google.colab import auth as colab_auth  # kind: ignore
    
            colab_auth.authenticate_user()
    
    
    def get_colab_secret(secret_name: str, default: str) -> str:
        from google.colab import errors, userdata  # kind: ignore
    
        attempt:
            return userdata.get(secret_name)
        besides errors.SecretNotFoundError:
            return default
    
    
    def disable_colab_cell_scrollbar() -> None:
        if running_in_colab_env():
            from google.colab import output  # kind: ignore
    
            output.no_vertical_scroll()
    
    
    def get_vars(getenv: Callable[[str, str], str]) -> tuple[bool, str, str, str]:
        # Restrict getenv calls to the minimal (could set off UI affirmation for secret entry)
        vertexai_str = getenv("GOOGLE_GENAI_USE_VERTEXAI", "")
        if vertexai_str:
            vertexai = vertexai_str.decrease() in ["true", "1"]
        else:
            vertexai = bool(getenv("GOOGLE_CLOUD_PROJECT", ""))
    
        undertaking = getenv("GOOGLE_CLOUD_PROJECT", "") if vertexai else ""
        location = getenv("GOOGLE_CLOUD_LOCATION", "") if undertaking else ""
        api_key = getenv("GOOGLE_API_KEY", "") if not undertaking else ""
    
        return vertexai, undertaking, location, api_key
    
    
    def check_define_env_vars(
        vertexai: bool,
        undertaking: str,
        location: str,
        api_key: str,
    ) -> bool:
        match (vertexai, bool(undertaking), bool(location), bool(api_key)):
            case (True, True, _, _):
                # Vertex AI - Google Cloud undertaking [+location]
                location = location or "world"
                define_env_vars(vertexai, undertaking, location, "")
            case (True, False, _, True):
                # Vertex AI - API key
                define_env_vars(vertexai, "", "", api_key)
            case (False, _, _, True):
                # Google AI Studio - API key
                define_env_vars(vertexai, "", "", api_key)
            case _:
                return False
    
        return True
    
    
    def define_env_vars(vertexai: bool, undertaking: str, location: str, api_key: str) -> None:
        os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = str(vertexai)
        os.environ["GOOGLE_CLOUD_PROJECT"] = undertaking
        os.environ["GOOGLE_CLOUD_LOCATION"] = location
        os.environ["GOOGLE_API_KEY"] = api_key
    
    
    def check_configuration(shopper: genai.Shopper) -> None:
        service = "Vertex AI" if shopper.vertexai else "Google AI Studio"
        print(f"✅ Utilizing the {service} API", finish="")
    
        if shopper._api_client.undertaking:
            print(f' with undertaking "{shopper._api_client.undertaking[:7]}…"', finish="")
            print(f' in location "{shopper._api_client.location}"')
        elif shopper._api_client.api_key:
            api_key = shopper._api_client.api_key
            print(f' with API key "{api_key[:5]}…{api_key[-5:]}"', finish="")
            print(f" (in case of error, make certain it was created for {service})")
    
    
    print("✅ Atmosphere capabilities outlined")

    🤖 Gen AI SDK

    To ship Gemini requests, create a google.genai shopper:

    from google import genai
    
    check_environment()
    
    shopper = genai.Shopper()
    
    check_configuration(shopper)

    🖼️ Picture take a look at suite

    Let’s outline an inventory of photographs for our checks: 🔽
    from dataclasses import dataclass
    from enum import StrEnum
    
    Url = str
    
    
    class Supply(StrEnum):
        incunable = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014rosen0487:0165/full/pct:25/0/default.jpg"
        engravings = "https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:34:07:66:92:1:00340766921:0121/full/pct:50/0/default.jpg"
        museum_guidebook = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014gen34181:0033/full/pct:75/0/default.jpg"
        denver_illustrated = "https://tile.loc.gov/image-services/iiif/service:gdc:gdclccn:rc:01:00:04:94:rc01000494:0051/full/pct:50/0/default.jpg"
        physics_textbook = "https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:03:64:87:31:8:00036487318:0103/full/pct:50/0/default.jpg"
        portrait_miniatures = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2024:2024rosen013592v02:0249/full/pct:50/0/default.jpg"
        wizard_of_oz_drawings = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2006:2006gen32405:0048/full/pct:25/0/default.jpg"
        work = "https://photographs.unsplash.com/photo-1714146681164-f26fed839692?h=1440"
        alice_drawing = "https://photographs.unsplash.com/photo-1630595011903-689853b04ee2?h=800"
        ebook = "https://photographs.unsplash.com/photo-1643451533573-ee364ba6e330?h=800"
        handbook = "https://photographs.unsplash.com/photo-1623666936367-a100f62ba9b7?h=800"
        electronics = "https://photographs.unsplash.com/photo-1757397584789-8b2c5bfcdbc3?h=1440"
    
    
    @dataclass
    class SourceMetadata:
        title: str
        webpage_url: Url
        credit_line: str
    
    
    LOC = "Library of Congress"
    LOC_RARE_BOOKS = "Library of Congress, Uncommon E-book and Particular Collections Division"
    LOC_MEETING_FRONTIERS = "Library of Congress, Assembly of Frontiers"
    
    metadata_by_source: dict[Source, SourceMetadata] = {
        Supply.incunable: SourceMetadata(
            "Vergaderinge der historien van Troy (1485)",
            "https://www.loc.gov/useful resource/rbc0001.2014rosen0487/?sp=165",
            LOC_RARE_BOOKS,
        ),
        Supply.engravings: SourceMetadata(
            "Harper's illustrated catalogue (1847)",
            "https://www.loc.gov/useful resource/gdcscd.00340766921/?sp=121",
            LOC,
        ),
        Supply.museum_guidebook: SourceMetadata(
            "Barnum's American Museum illustrated (1850)",
            "https://www.loc.gov/useful resource/rbc0001.2014gen34181/?sp=33",
            LOC_RARE_BOOKS,
        ),
        Supply.denver_illustrated: SourceMetadata(
            "Denver illustrated (1893)",
            "https://www.loc.gov/useful resource/gdclccn.rc01000494/?sp=51",
            LOC_MEETING_FRONTIERS,
        ),
        Supply.physics_textbook: SourceMetadata(
            "Classes in physics (1916)",
            "https://www.loc.gov/useful resource/gdcscd.00036487318/?sp=103",
            LOC,
        ),
        Supply.portrait_miniatures: SourceMetadata(
            "The historical past of portrait miniatures (1904)",
            "https://www.loc.gov/useful resource/rbc0001.2024rosen013592v02/?sp=249",
            LOC_RARE_BOOKS,
        ),
        Supply.wizard_of_oz_drawings: SourceMetadata(
            "The fantastic Wizard of Oz (1899)",
            "https://www.loc.gov/useful resource/rbc0001.2006gen32405/?sp=48",
            LOC_RARE_BOOKS,
        ),
        Supply.work: SourceMetadata(
            "Open ebook exhibiting work by Vincent van Gogh",
            "https://unsplash.com/pictures/9hD7qrxICag",
            "Photograph by Trung Manh cong on Unsplash",
        ),
        Supply.alice_drawing: SourceMetadata(
            "Open ebook exhibiting an illustration and textual content from Alice's Adventures in Wonderland",
            "https://unsplash.com/pictures/bewzr_Q9u2o",
            "Photograph by Brett Jordan on Unsplash",
        ),
        Supply.ebook: SourceMetadata(
            "Open ebook exhibiting two botanical illustrations",
            "https://unsplash.com/pictures/4IDqcNj827I",
            "Photograph by Ranurte on Unsplash",
        ),
        Supply.handbook: SourceMetadata(
            "Open consumer handbook for classic digital camera",
            "https://unsplash.com/pictures/aaFU96eYASk",
            "Photograph by Annie Spratt on Unsplash",
        ),
        Supply.electronics: SourceMetadata(
            "Circuit board with digital elements",
            "https://unsplash.com/pictures/Aqa1pHQ57pw",
            "Photograph by Albert Stoynov on Unsplash",
        ),
    }
    
    print("✅ Check photographs outlined")

    🧠 Gemini fashions

    Gemini is available in completely different versions. We are able to at present use the next fashions:

    • For object detection: Gemini 2.5 or Gemini 3, every obtainable in Flash or Professional variations.
    • For object modifying: Gemini 2.5 Flash Picture or Gemini 3 Professional Picture, also referred to as Nano Banana and Nano Banana Professional.

    🛠️ Helpers

    Now, let’s add core helper courses and capabilities: 🔽
    from enum import auto
    from pathlib import Path
    from typing import Any, forged
    
    import IPython.show
    import matplotlib.pyplot as plt
    import pydantic
    import tenacity
    from google.genai.errors import ClientError
    from google.genai.varieties import (
        FinishReason,
        GenerateContentConfig,
        GenerateContentResponse,
        PIL_Image,
        ThinkingConfig,
        ThinkingLevel,
    )
    
    
    # Multimodal fashions with spatial understanding and structured outputs
    class MultimodalModel(StrEnum):
        # Typically Accessible (GA)
        GEMINI_2_5_FLASH = "gemini-2.5-flash"
        GEMINI_2_5_PRO = "gemini-2.5-pro"
        # Preview
        GEMINI_3_FLASH_PREVIEW = "gemini-3-flash-preview"
        GEMINI_3_1_PRO_PREVIEW = "gemini-3.1-pro-preview"
        # Default mannequin used for object detection
        DEFAULT = GEMINI_3_FLASH_PREVIEW
    
    
    # Picture technology and modifying fashions
    class ImageModel(StrEnum):
        # Typically Accessible (GA)
        GEMINI_2_5_FLASH_IMAGE = "gemini-2.5-flash-image"  # Nano Banana 🍌
        # Preview
        GEMINI_3_PRO_IMAGE_PREVIEW = "gemini-3-pro-image-preview"  # Nano Banana Professional 🍌
        # Default mannequin used for picture modifying
        DEFAULT = GEMINI_2_5_FLASH_IMAGE
    
    
    Mannequin = MultimodalModel | ImageModel
    
    
    def generate_content(
        contents: record[Any],
        mannequin: Mannequin,
        config: GenerateContentConfig | None,
        should_display_response_info: bool = False,
    ) -> GenerateContentResponse | None:
        response = None
        shopper = check_client_for_model(mannequin)
    
        for try in get_retrier():
            with try:
                response = shopper.fashions.generate_content(
                    mannequin=mannequin.worth,
                    contents=contents,
                    config=config,
                )
        if should_display_response_info:
            display_response_info(response, config)
    
        return response
    
    
    def check_client_for_model(mannequin: Mannequin) -> genai.Shopper:
        if (
            mannequin.worth.endswith("-preview")
            and shopper.vertexai
            and shopper._api_client.location != "world"
        ):
            # Preview fashions are solely obtainable on the "world" location
            return genai.Shopper(location="world")
    
        return shopper
    
    
    def display_response_info(
        response: GenerateContentResponse | None,
        config: GenerateContentConfig | None,
    ) -> None:
        if response is None:
            print("❌ No response")
            return
    
        if usage_metadata := response.usage_metadata:
            if usage_metadata.prompt_token_count:
                print(f"Enter tokens   : {usage_metadata.prompt_token_count:9,d}")
            if usage_metadata.candidates_token_count:
                print(f"Output tokens  : {usage_metadata.candidates_token_count:9,d}")
            if usage_metadata.thoughts_token_count:
                print(f"Ideas tokens: {usage_metadata.thoughts_token_count:9,d}")
    
        if (
            config shouldn't be None
            and config.response_mime_type == "software/json"
            and response.parsed is None
        ):
            print("❌ Couldn't parse the JSON response")
            return
        if not response.candidates:
            print("❌ No `response.candidates`")
            return
        if (finish_reason := response.candidates[0].finish_reason) != FinishReason.STOP:
            print(f"❌ {finish_reason = }")
        if not response.textual content:
            print("❌ No `response.textual content`")
            return
    
    
    def generate_image(
        sources: record[PIL_Image],
        immediate: str,
        mannequin: ImageModel,
        config: GenerateContentConfig | None = None,
    ) -> PIL_Image | None:
        contents = [*sources, prompt.strip()]
    
        response = generate_content(contents, mannequin, config)
    
        return check_get_output_image_from_response(response)
    
    
    def check_get_output_image_from_response(
        response: GenerateContentResponse | None,
    ) -> PIL_Image | None:
        if response is None:
            print("❌ No `response`")
            return None
        if not response.candidates:
            print("❌ No `response.candidates`")
            if response.prompt_feedback:
                if block_reason := response.prompt_feedback.block_reason:
                    print(f"{block_reason = :s}")
                if block_reason_message := response.prompt_feedback.block_reason_message:
                    print(f"{block_reason_message = }")
            return None
        if not (content material := response.candidates[0].content material):
            print("❌ No `response.candidates[0].content material`")
            return None
        if not (components := content material.components):
            print("❌ No `response.candidates[0].content material.components`")
            return None
    
        output_image: PIL_Image | None = None
        for half in components:
            if half.textual content:
                display_markdown(half.textual content)
                proceed
            sdk_image = half.as_image()
            assert sdk_image shouldn't be None
            output_image = sdk_image._pil_image
            assert output_image shouldn't be None
            break  # There must be a single picture
    
        return output_image
    
    
    def get_thinking_config(mannequin: Mannequin) -> ThinkingConfig | None:
        match mannequin:
            case MultimodalModel.GEMINI_2_5_FLASH:
                return ThinkingConfig(thinking_budget=0)
            case MultimodalModel.GEMINI_2_5_PRO:
                return ThinkingConfig(thinking_budget=128, include_thoughts=False)
            case MultimodalModel.GEMINI_3_FLASH_PREVIEW:
                return ThinkingConfig(thinking_level=ThinkingLevel.MINIMAL)
            case MultimodalModel.GEMINI_3_1_PRO_PREVIEW:
                return ThinkingConfig(thinking_level=ThinkingLevel.LOW)
            case _:
                return None  # Default
    
    
    def display_markdown(markdown: str) -> None:
        IPython.show.show(IPython.show.Markdown(markdown))
    
    
    def display_image(picture: PIL_Image) -> None:
        IPython.show.show(picture)
    
    
    def get_retrier() -> tenacity.Retrying:
        return tenacity.Retrying(
            cease=tenacity.stop_after_attempt(7),
            wait=tenacity.wait_incrementing(begin=10, increment=1),
            retry=should_retry_request,
            reraise=True,
        )
    
    
    def should_retry_request(retry_state: tenacity.RetryCallState) -> bool:
        if not retry_state.consequence:
            return False
        err = retry_state.consequence.exception()
        if not isinstance(err, ClientError):
            return False
        print(f"❌ ClientError {err.code}: {err.message}")
    
        retry = False
        match err.code:
            case 400 if err.message shouldn't be None and " attempt once more " in err.message:
                # Workshop: first time entry to Cloud Storage (service agent provisioning)
                retry = True
            case 429:
                # Workshop: non permanent undertaking with 1 QPM quota
                retry = True
        print(f"🔄 Retry: {retry}")
    
        return retry
    
    
    print("✅ Helpers outlined")

    🔍 Detecting visible objects

    To carry out visible object detection, craft the immediate to point what you’d wish to detect and the way outcomes must be returned. In the identical request, it’s doable to additionally extract extra details about every detected object. This may be nearly something, from labels reminiscent of “furnishings”, “desk”, or “chair”, to extra exact classifications like “mammals” or “reptiles”, or to contextual information reminiscent of captions, colours, shapes, and many others.

    For the subsequent checks, we’ll experiment with detecting illustrations inside ebook pictures. Right here’s a doable immediate:

    OBJECT_DETECTION_PROMPT = """
    Detect each illustration inside the ebook photograph and extract the next information for every:
    - `box_2d`: Bounding field coordinates of the illustration solely (ignoring any caption).
    - `caption`: Verbatim caption or legend reminiscent of "Determine 1". Use "" if not discovered.
    - `label`: Single-word label describing the illustration. Use "" if not discovered.
    """

    Notes:

    • Bounding bins are very helpful for finding or extracting the detected objects.
    • Usually, for Gemini fashions, a box_2d bounding field represents coordinates normalized to a (0, 0, 1000, 1000) area for a (0, 0, width, peak) enter picture.
    • We’re additionally requesting to extract captions (metadata usually current in reference books) and labels (dynamic metadata).

    To automate response processing, it’s handy to outline a Pydantic class that matches the immediate, reminiscent of:

    class DetectedObject(pydantic.BaseModel):
        box_2d: record[int]
        caption: str
        label: str
    
    DetectedObjects: TypeAlias = record[DetectedObject]

    Then, request a structured output with config fields response_mime_type and response_schema:

    config = GenerateContentConfig(
        # …,
        response_mime_type="software/json",
        response_schema=DetectedObjects,
        # …,
    )

    This may generate a JSON response which the SDK can parse mechanically, letting us straight use object cases:

    detected_objects = forged(DetectedObjects, response.parsed)
    Let’s add a couple of object-detection-specific courses and capabilities: 🔽
    import io
    import urllib.request
    from collections.abc import Iterator
    from dataclasses import subject
    from datetime import datetime
    
    import PIL.Picture
    from google.genai.varieties import Half, PartMediaResolutionLevel
    from PIL.PngImagePlugin import PngInfo
    
    OBJECT_DETECTION_PROMPT = """
    Detect each illustration inside the ebook photograph and extract the next information for every:
    - `box_2d`: Bounding field coordinates of the illustration solely (ignoring any caption).
    - `caption`: Verbatim caption or legend reminiscent of "Determine 1". Use "" if not discovered.
    - `label`: Single-word label describing the illustration. Use "" if not discovered.
    """
    
    # Margin added to detected/cropped objects, giving extra context for a greater understanding of spatial distortions
    CROP_MARGIN_PX = 10
    
    # Set to True to avoid wasting every generated picture
    SAVE_GENERATED_IMAGES = False
    OUTPUT_IMAGES_PATH = Path("./object_detection_and_editing")
    
    
    # Matching class for structured output technology
    class DetectedObject(pydantic.BaseModel):
        box_2d: record[int]
        caption: str
        label: str
    
    
    # Misc information courses
    InputImage = Path | Url
    DetectedObjects = record[DetectedObject]
    WorkflowStepImages = record[PIL_Image]
    
    
    class WorkflowStep(StrEnum):
        SOURCE = auto()
        CROPPED = auto()
        RESTORED = auto()
        COLORIZED = auto()
        CINEMATIZED = auto()
    
    
    @dataclass
    class VisualObjectWorkflow:
        source_image: PIL_Image
        detected_objects: DetectedObjects
        images_by_step: dict[WorkflowStep, WorkflowStepImages] = subject(default_factory=dict)
    
        def __post_init__(self) -> None:
            denormalize_bounding_boxes(self)
    
    
    workflow_by_image: dict[InputImage, VisualObjectWorkflow] = {}
    
    
    def denormalize_bounding_boxes(self: VisualObjectWorkflow) -> None:
        """Convert the box_2d coordinates.
        - Earlier than: [y1, x1, y2, x2] normalized to 0-1000, as returned by Gemini
        - After:  [x1, y1, x2, y2] in source_image coordinates, as utilized in Pillow
        """
    
        def to_image_coord(coord: int, dim: int) -> int:
            return int(coord * dim / 1000 + 0.5)
    
        w, h = self.source_image.dimension
        for obj in self.detected_objects:
            y1, x1, y2, x2 = obj.box_2d
            x1, x2 = to_image_coord(x1, w), to_image_coord(x2, w)
            y1, y2 = to_image_coord(y1, h), to_image_coord(y2, h)
            obj.box_2d = [x1, y1, x2, y2]
    
    
    def detect_objects(
        picture: InputImage,
        immediate: str = OBJECT_DETECTION_PROMPT,
        mannequin: MultimodalModel = MultimodalModel.DEFAULT,
        config: GenerateContentConfig | None = None,
        media_resolution: PartMediaResolutionLevel | None = None,
        display_results: bool = True,
    ) -> None:
        display_image_source_info(picture)
        pil_image, content_part = get_pil_image_and_part(picture, mannequin, media_resolution)
        immediate = immediate.strip()
        contents = [content_part, prompt]
        config = config or get_object_detection_config(mannequin)
    
        response = generate_content(contents, mannequin, config)
    
        if response shouldn't be None and response.parsed shouldn't be None:
            detected_objects = forged(DetectedObjects, response.parsed)
        else:
            detected_objects = DetectedObjects()
    
        workflow = VisualObjectWorkflow(pil_image, detected_objects)
        workflow_by_image[image] = workflow
        add_cropped_objects(workflow, picture, immediate)
    
        if display_results:
            display_detected_objects(workflow)
    
    
    def get_pil_image_and_part(
        picture: InputImage,
        mannequin: MultimodalModel,
        media_resolution: PartMediaResolutionLevel | None,
    ) -> tuple[PIL_Image, Part]:
        if isinstance(picture, Path):
            image_bytes = picture.read_bytes()
        else:
            headers = {"Person-Agent": "Mozilla/5.0"}
            req = urllib.request.Request(picture, headers=headers)
            with urllib.request.urlopen(req, timeout=10) as response:
                image_bytes = response.learn()
    
        pil_image = PIL.Picture.open(io.BytesIO(image_bytes))
        content_part = Half.from_bytes(
            information=image_bytes,
            mime_type="picture/*",
            media_resolution=media_resolution,
        )
    
        return pil_image, content_part
    
    
    def get_object_detection_config(mannequin: Mannequin) -> GenerateContentConfig:
        # Low randomness for extra determinism
        return GenerateContentConfig(
            temperature=0.0,
            top_p=0.0,
            seed=42,
            response_mime_type="software/json",
            response_schema=DetectedObjects,
            thinking_config=get_thinking_config(mannequin),
        )
    
    
    def add_cropped_objects(
        workflow: VisualObjectWorkflow,
        enter: InputImage,
        immediate: str,
        crop_margin: int = CROP_MARGIN_PX,
    ) -> None:
        cropped_images: record[PIL_Image] = []
        obj_count = len(workflow.detected_objects)
        for obj_order, obj in enumerate(workflow.detected_objects, 1):
            cropped_image, _ = extract_object_image(workflow.source_image, obj, crop_margin)
            cropped_images.append(cropped_image)
            save_workflow_image(
                WorkflowStep.SOURCE,
                WorkflowStep.CROPPED,
                enter,
                obj_order,
                obj_count,
                cropped_image,
                dict(immediate=immediate, crop_margin=str(crop_margin)),
            )
        workflow.images_by_step[WorkflowStep.CROPPED] = cropped_images
    
    
    def extract_object_image(
        picture: PIL_Image,
        obj: DetectedObject,
        margin: int = 0,
    ) -> tuple[PIL_Image, tuple[int, int, int, int]]:
        def clamp(coord: int, dim: int) -> int:
            return min(max(coord, 0), dim)
    
        x1, y1, x2, y2 = obj.box_2d
        w, h = picture.dimension
        if margin != 0:
            x1, x2 = clamp(x1 - margin, w), clamp(x2 + margin, w)
            y1, y2 = clamp(y1 - margin, h), clamp(y2 + margin, h)
    
        field = (x1, y1, x2, y2)
        object_image = picture.crop(field)
    
        return object_image, field
    
    
    def save_workflow_image(
        source_step: WorkflowStep,
        target_step: WorkflowStep,
        input_image: InputImage,
        obj_order: int,
        obj_count: int,
        target_image: PIL_Image | None,
        image_info: dict[str, str] | None = None,
    ) -> None:
        if not SAVE_GENERATED_IMAGES or target_image is None:
            return
        if not OUTPUT_IMAGES_PATH.is_dir():
            OUTPUT_IMAGES_PATH.mkdir(dad and mom=True)
        time_str = datetime.now().strftime("%Y-%m-%d_percentH-%M-%S")
        attempt:
            filename = f"{Supply(input_image).title}_"
        besides ValueError:
            filename = ""
        filename += f"{obj_order}o{obj_count}_{source_step}_{target_step}_{time_str}.png"
        image_path = OUTPUT_IMAGES_PATH.joinpath(filename)
        params = {}
        if image_info:
            png_info = PngInfo()
            for okay, v in image_info.objects():
                png_info.add_text(okay, v)
            params.replace(pnginfo=png_info)
        target_image.save(image_path, **params)
    
    
    # Matplotlib
    FIGURE_FG_COLOR = "#F1F3F4"
    FIGURE_BG_COLOR = "#202124"
    EDGE_COLOR = "#80868B"
    rcParams = {
        "determine.dpi": 300,
        "textual content.colour": FIGURE_FG_COLOR,
        "determine.edgecolor": FIGURE_FG_COLOR,
        "axes.titlecolor": FIGURE_FG_COLOR,
        "axes.edgecolor": FIGURE_FG_COLOR,
        "xtick.colour": FIGURE_FG_COLOR,
        "ytick.colour": FIGURE_FG_COLOR,
        "determine.facecolor": FIGURE_BG_COLOR,
        "axes.edgecolor": EDGE_COLOR,
        "xtick.backside": False,
        "xtick.high": False,
        "ytick.left": False,
        "ytick.proper": False,
        "xtick.labelbottom": False,
        "ytick.labelleft": False,
    }
    plt.rcParams.replace(rcParams)
    
    
    def display_image_source_info(picture: InputImage) -> None:
        def get_image_info_md() -> str:
            if picture not in Supply:
                return f"[[Source Image]({picture})]"
            supply = Supply(picture)
            metadata = metadata_by_source.get(supply)
            if not metadata:
                return f"[[Source Image]({supply.worth})]"
            components = [
                f"[Source Image]({supply.worth})",
                f"[Source Page]({metadata.webpage_url})",
                metadata.title,
                metadata.credit_line,
            ]
            separator = "•"
            inner_info = f" {separator} ".be part of(components)
            return f"{separator} {inner_info} {separator}"
    
        def yield_md_rows() -> Iterator[str]:
            horizontal_line = "---"
            image_info = get_image_info_md()
            yield horizontal_line
            yield f"_{image_info}_"
            yield horizontal_line
    
        display_markdown(f"{chr(10)}{chr(10)}".be part of(yield_md_rows()))
    
    
    def display_detected_objects(workflow: VisualObjectWorkflow) -> None:
        source_image = workflow.source_image
        detected_objects = PIL.Picture.new("RGB", source_image.dimension, "white")
        for obj in workflow.detected_objects:
            obj_image, field = extract_object_image(source_image, obj)
            detected_objects.paste(obj_image, (field[0], field[1]))
    
        _, (ax1, ax2) = plt.subplots(1, 2, format="compressed")
        ax1.imshow(source_image)
        ax2.imshow(detected_objects)
    
        disable_colab_cell_scrollbar()
        plt.present()
    
    
    print("✅ Object detection helpers outlined")

    🧪 Let’s begin easy: can we detect the one illustration on this incunable from 1485?

    detect_objects(Supply.incunable)

    • Source Image • Source Page • Vergaderinge der historien van Troy (1485) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects detected by Gemini

    💡 This works properly. The bounding field may be very exact, enclosing the hand-colored woodcut illustration very tightly.


    🧪 Now, let’s test the detection of the a number of visuals on this museum guidebook:

    detect_objects(Supply.museum_guidebook)

    • Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects detected by Gemini

    💡 Remarks:

    • The bounding bins are once more very exact.
    • The outcomes are excellent: there are not any false positives and no false negatives.
    • The captions beneath the visuals will not be enclosed inside the bounding bins, which was particularly requested. The bounding field granularity might be managed by altering the immediate.

    🧪 What about barely warped visuals?

    detect_objects(Supply.work)

    • Source Image • Source Page • Open ebook exhibiting work by Vincent van Gogh • Photograph by Trung Manh cong on Unsplash •

    Visual objects detected by Gemini

    💡 This doesn’t make a distinction. Discover how the bottom-right portray is partially lined by the orange bookmark. We’ll attempt to repair that within the restoration step.


    🧪 What in regards to the tilted visuals on this ebook in regards to the structure in Denver?

    detect_objects(Supply.denver_illustrated)

    • Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

    Visual objects detected by Gemini

    💡 Every visible is completely detected: spatial understanding covers tilted objects.


    🧪 Lastly, let’s test the detection on this considerably warped ebook web page from Alice’s Adventures in Wonderland:

    detect_objects(Supply.alice_drawing)

    • Source Image • Source Page • Open ebook exhibiting an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

    Visual objects detected by Gemini

    💡 Web page curvature and different distortions don’t stop non-rectangular objects from being detected. In truth, spatial understanding works on the pixel degree, which explains this precision for warped objects. In case you’d wish to work at a decrease degree, you may also ask for a “segmentation masks” within the immediate and also you’ll get a base64-encoded PNG (every pixel giving the 0-255 chance it belongs to the article inside the bounding field). See the segmentation doc for extra particulars.


    🏷️ Textual content extraction and dynamic labeling

    On high of localizing every object with its bounding field, our immediate requested to extract a verbatim caption and to assign a single-word label, when doable.

    Let’s add a easy operate to show the detection information in a desk: 🔽
    from collections import defaultdict
    
    
    def display_detection_data(supply: Supply, show_consolidated: bool = False) -> None:
        def string_with_visible_linebreaks(s: str) -> str:
            return f'''"{s.exchange(chr(10), "↩️")}"'''
    
        def yield_md_rows_consolidated(workflow: VisualObjectWorkflow) -> Iterator[str]:
            yield "| label | rely | captions |"
            yield "| :--- | ---: | :--- |"
            stats = defaultdict(record)
            for obj in workflow.detected_objects:
                stats[obj.label].append(string_with_visible_linebreaks(obj.caption))
            for label, captions in stats.objects():
                rely = len(captions)
                label_captions = " • ".be part of(sorted(captions))
                yield f"| {label} | {rely} | {label_captions} |"
    
        def yield_md_rows_with_bbox(workflow: VisualObjectWorkflow) -> Iterator[str]:
            yield "| box_2d | label | caption |"
            yield "| :--- | :--- | :--- |"
            for obj in workflow.detected_objects:
                yield f"| {obj.box_2d} | {obj.label} | {string_with_visible_linebreaks(obj.caption)} |"
    
        workflow = workflow_by_image.get(supply)
        if workflow is None:
            print(f'❌ No detection for supply "{supply.title}"')
            return
        md_rows = record(
            yield_md_rows_consolidated(workflow)
            if show_consolidated
            else yield_md_rows_with_bbox(workflow)
        )
        display_image_source_info(supply)
        display_markdown(chr(10).be part of(md_rows))

    Within the museum guidebook, the dynamic labeling is exact based on the context, and the captions beneath every illustration are completely extracted:

    display_detection_data(Supply.museum_guidebook)

    • Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

    box_2d label caption
    [954, 629, 1338, 1166] beetle “The Horned Beetle.”
    [265, 984, 464, 1504] armor “Armor of a Man.”
    [737, 984, 915, 1328] armor “Horse Armor.”
    [1225, 1244, 1589, 1685] beetle “The Goliath Beetle.”
    [264, 1766, 431, 2006] masks “The Masks.”
    [937, 1769, 1260, 2087] butterfly “Painted Woman Butterfly.”
    [1325, 2170, 1581, 2468] butterfly “The Woman Butterfly.”

    Within the ebook photograph exhibiting 4 work, that is excellent too:

    display_detection_data(Supply.work)

    • Source Image • Source Page • Open ebook exhibiting work by Vincent van Gogh • Photograph by Trung Manh cong on Unsplash •

    box_2d label caption
    [378, 203, 837, 575] portray “Hái Ô-liu (Olive Choosing), tháng 12 năm 1889, sơn dầu trên toan, 28 3/4 x 35 in. [73 x 89 cm]”
    [913, 207, 1380, 563] portray “Hẻm núi Les Peiroulets (Les Peiroulets Ravine), tháng 10 năm 1889, sơn dầu trên toan, 28 3/4 x 36 1/4 in. [73 x 92 cm]”
    [387, 596, 845, 978] portray “Trưa: Nghỉ ngơi (phỏng theo Millet) (Midday: Relaxation from Work [after Millet]), tháng 1 năm 1890, sơn dầu trên toan, 28 3/4 x 35 7/8 in. [73 x 91 cm]”
    [921, 611, 1397, 982] portray “Hoa hạnh đào (Almond Blossom), tháng 2 năm 1890, sơn dầu trên toan, 28 3/8 x 36 1/4 in. [73 x 92 cm]”

    Within the Denver structure ebook, the 4 captions are assigned to the proper illustrations, which was not an apparent process:

    display_detection_data(Supply.denver_illustrated)

    • Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

    box_2d label caption
    [203, 224, 741, 839] constructing “ERNEST AND CRANMER BUILDING.”
    [743, 73, 1192, 758] constructing “PEOPLE’S BANK BUILDING.”
    [1185, 211, 1787, 865] constructing “BOSTON BUILDING.”
    [699, 754, 1238, 1203] constructing “COOPER BUILDING.”

    💡 When you’ve got a better take a look at the enter picture, it’s laborious to inform which caption belongs to which illustration at a look. Most of us would wish to consider it (and is likely to be incorrect). Asking Gemini reveals that the outcomes are intentional and never pure luck: Deciphering classic layouts can really feel a bit like a puzzle, however there’s often a “reading-order” logic at play. On this particular case, the captions are organized to correspond with the photographs in a clockwise or Z-pattern ranging from the highest left.


    Within the “Alice’s Adventures in Wonderland” ebook web page, there was a single illustration accompanying the story textual content. As anticipated, the caption is empty (i.e., no false constructive):

    display_detection_data(Supply.alice_drawing)

    • Source Image • Source Page • Open ebook exhibiting an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

    box_2d label caption
    [111, 146, 1008, 593] illustration “”

    🔭 Generalizing object detection

    We are able to use the identical ideas for different object varieties. We’ll typically hold requesting bounding bins to determine object positions inside photographs. With out altering our present output construction (i.e., no code change), we are able to use captions and labels to extract completely different object metadata relying on the enter kind.


    🧪 See how we are able to detect digital elements by adapting the immediate whereas maintaining the very same code and output construction:

    ELECTRONIC_COMPONENT_DETECTION_PROMPT = """
    Exhaustively detect all the person digital elements within the picture and supply the next information for every:
    - `box_2d`: bounding field coordinates.
    - `caption`: Verbatim alphanumeric textual content seen on the element (together with authentic line breaks), or "" if no textual content is current.
    - `label`: Particular kind of element.
    """
    
    detect_objects(
        Supply.electronics,
        ELECTRONIC_COMPONENT_DETECTION_PROMPT,
        media_resolution=PartMediaResolutionLevel.MEDIA_RESOLUTION_ULTRA_HIGH,
    )

    • Source Image • Source Page • Circuit board with digital elements • Photograph by Albert Stoynov on Unsplash •

    Visual objects detected by Gemini

    💡 Remarks:

    • Giant and tiny elements are detected, due to the particular instruction “exhaustively detect…”.
    • By utilizing the ultra-high media decision, we guarantee extra particulars are tokenized and the “P” element (a visible outlier) will get detected.

    Right here’s a consolidated view of the detected elements:

    display_detection_data(Supply.electronics, show_consolidated=True)

    • Source Image • Source Page • Circuit board with digital elements • Photograph by Albert Stoynov on Unsplash •

    label rely captions
    built-in circuit 3 “49240↩️020S6K” • “8105↩️0:35” • “P4010↩️9NA0”
    resistor 4 “” • “” • “105” • “R020”
    inductor 1 “n1W”
    diode 3 “Ok” • “L” • “P”
    capacitor 6 “” • “” • “” • “” • “” • “”
    transistor 1 “41”
    connector 1 “”

    💡 Remarks:

    • Elements are detected together with their textual content markings, regardless of the three completely different textual content orientations (upright, sideways, and the wrong way up), the blur, and the photograph noise.
    • We eliminated the diploma of freedom for multi-line textual content by specifying the inclusion of “authentic line breaks” within the immediate: responses now persistently embody the road breaks for the three built-in circuits (displayed with the ↩️ emoji for higher visibility).
    • The final diploma of freedom lies within the labeling. Whereas most elements have been correctly labeled, it’s unclear whether or not the “P” element is a diode, a resistor, or a fuse. Making the directions extra particular (e.g., itemizing the doable labels, utilizing an enum for the label subject within the Pydantic class, or offering tips and extra particulars in regards to the anticipated circuit boards) will make the immediate extra “closed” and the outcomes extra deterministic and correct.
      It’s additionally doable to allow/replace the thinking_config configuration, which is able to set off a sequence of thought earlier than producing the ultimate reply. In all of the detections carried out, our code used ThinkingLevel.MINIMAL, which didn’t devour any thought tokens (with Gemini 3 Flash). Updating the parameter to ThinkingLevel.LOW, ThinkingLevel.MEDIUM, or ThinkingLevel.HIGH will use thought tokens and may result in higher outputs in advanced instances.

    This demonstrates the flexibility of the strategy. With out retraining a mannequin, we switched from detecting Fifteenth-century woodcuts and illustrations with classic layouts to figuring out trendy electronics simply by altering the immediate. Such detections, together with caption and label metadata, could possibly be used to auto-crop elements for a components catalog, confirm meeting traces, or create interactive schematics… all with out a single labeled coaching picture.


    🪄 Enhancing visible objects

    Now that we are able to detect visible objects, we are able to envision an automation workflow to extract and reuse them. For this, we’ll use Gemini 2.5 Flash Picture (also referred to as Nano Banana 🍌) by default, a state-of-the-art picture technology and modifying mannequin.

    Our object modifying capabilities will observe the identical template, taking one step as enter and producing an edited picture for the output step. Let’s outline core helpers for this: 🔽
    from typing import Protocol
    
    
    class ObjectEditingFunction(Protocol):
        def __call__(
            self,
            picture: InputImage,
            immediate: str | None = None,
            mannequin: ImageModel | None = None,
            config: GenerateContentConfig | None = None,
            display_results: bool = True,
        ) -> None: ...
    
    
    SourceTargetSteps = tuple[WorkflowStep, WorkflowStep]
    registered_functions: dict[SourceTargetSteps, ObjectEditingFunction] = {}
    
    DEFAULT_EDITING_CONFIG = GenerateContentConfig(response_modalities=["IMAGE"])
    EMPTY_IMAGE = PIL.Picture.new("1", (1, 1), "white")
    
    
    def object_editing_function(
        default_prompt: str,
        source_step: WorkflowStep,
        target_step: WorkflowStep,
        default_model: ImageModel = ImageModel.DEFAULT,
        default_config: GenerateContentConfig = DEFAULT_EDITING_CONFIG,
    ) -> ObjectEditingFunction:
        def editing_function(
            picture: InputImage,
            immediate: str | None = default_prompt,
            mannequin: ImageModel | None = default_model,
            config: GenerateContentConfig | None = default_config,
            display_results: bool = True,
        ) -> None:
            workflow, source_images = get_workflow_and_step_images(picture, source_step)
            if immediate is None:
                immediate = default_prompt
            immediate = immediate.strip()
            if mannequin is None:
                mannequin = default_model
            # Be aware: "config is None" is legitimate and can use the mannequin endpoint default config
    
            target_images: record[PIL_Image] = []
            display_image_source_info(picture)
            obj_count = len(source_images)
            for obj_order, source_image in enumerate(source_images, 1):
                target_image = generate_image([source_image], immediate, mannequin, config)
                save_workflow_image(
                    source_step,
                    target_step,
                    picture,
                    obj_order,
                    obj_count,
                    target_image,
                    dict(immediate=immediate),
                )
                target_images.append(target_image if target_image else EMPTY_IMAGE)
    
            workflow.images_by_step[target_step] = target_images
            if display_results:
                display_sources_and_targets(workflow, source_step, target_step)
    
        registered_functions[(source_step, target_step)] = editing_function
    
        return editing_function
    
    
    def get_workflow_and_step_images(
        picture: InputImage,
        step: WorkflowStep,
    ) -> tuple[VisualObjectWorkflow, list[PIL_Image]]:
        # Objects detected?
        if picture not in workflow_by_image:
            detect_objects(picture, display_results=False)
        workflow = workflow_by_image.get(picture)
        assert workflow shouldn't be None
    
        # Workflow step objects? (single degree, could possibly be prolonged to a dynamical graph)
        operation = (WorkflowStep.CROPPED, step)
        if step not in workflow.images_by_step and operation in registered_functions:
            source_function = registered_functions[operation]
            source_function(picture, display_results=False)
    
        # Supply photographs
        source_images = workflow.images_by_step.get(step)
        assert source_images shouldn't be None
    
        return workflow, source_images
    
    
    def display_sources_and_targets(
        workflow: VisualObjectWorkflow,
        source_step: WorkflowStep,
        target_step: WorkflowStep,
    ) -> None:
        source_images = workflow.images_by_step[source_step]
        target_images = workflow.images_by_step[target_step]
        if not source_images:
            print("❌ No photographs to show")
            return
    
        fig = plt.determine(format="compressed")
        if horizontal := (len(source_images) >= 2):
            rows, cols = 2, len(source_images)
        else:
            rows, cols = len(source_images), 2
        gs = fig.add_gridspec(rows, cols)
    
        for i, (source_image, target_image) in enumerate(
            zip(source_images, target_images, strict=True)
        ):
            for dim, picture in enumerate([source_image, target_image]):
                grid_spec = gs[dim, i] if horizontal else gs[i, dim]
                ax = fig.add_subplot(grid_spec)
                ax.set_axis_off()
                ax.imshow(picture)
    
        disable_colab_cell_scrollbar()
        plt.present()
    
    
    print("✅ Object modifying helpers outlined")

    Now, let’s outline a primary modifying step to revive the detected objects that may include many real-life artifacts…


    ✨ Restoring visible objects

    For this restoration step, we have to craft a immediate that’s generic sufficient (to cowl most use instances) but additionally particular sufficient (to consider restoration wants).

    A picture modifying immediate is predicated on pure language, sometimes utilizing crucial or declarative directions. With an crucial immediate, you describe the actions to carry out on the enter, whereas with a declarative immediate, you describe the anticipated output. Each are doable and can present equal outcomes. Your selection is known as a matter of choice, so long as the immediate is smart.

    Our take a look at suite is usually composed of ebook pictures, which might include varied photographic and paper artifacts. The Nano Banana fashions perceive these subtleties and may edit photographs accordingly, which simplifies the immediate.

    Here’s a doable restoration operate utilizing an crucial immediate:

    RESTORATION_PROMPT = """
    - Isolate and straighten the visible on a pure white background, excluding any surrounding textual content.
    - Clear up all bodily artifacts and noise whereas preserving each authentic element.
    - Middle the outcome and scale it to suit the canvas with minimal, symmetrical margins, making certain no distortion or cropping.
    """
    
    # Default config with low randomness for extra deterministic restoration outputs
    RESTORATION_CONFIG = GenerateContentConfig(
        temperature=0.0,
        top_p=0.0,
        seed=42,
        response_modalities=["IMAGE"],
    )
    
    restore_objects = object_editing_function(
        RESTORATION_PROMPT,
        WorkflowStep.CROPPED,
        WorkflowStep.RESTORED,
        default_config=RESTORATION_CONFIG,
    )
    
    print("✅ Restoration operate outlined")

    🧪 Let’s attempt to restore the illustration from the 1485 incunable:

    restore_objects(Supply.incunable)

    • Source Image • Source Page • Vergaderinge der historien van Troy (1485) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana

    💡 We now have a pleasant restoration of the hand-colored woodcut illustration. Be aware that our immediate is generic (“clear up all bodily artifacts”) and could possibly be made extra particular to take away extra or fewer artifacts. On this instance, there are remaining artifacts, such because the paper discoloration within the sword or the bleeding ink within the armor. We’ll see if we are able to repair these within the colorization step.


    🧪 What in regards to the illustrations from the museum guidebook?

    restore_objects(Supply.museum_guidebook)

    • Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana

    💡 All good!


    🧪 What in regards to the barely warped visuals?

    restore_objects(Supply.work)

    • Source Image • Source Page • Open ebook exhibiting work by Vincent van Gogh • Photograph by Trung Manh cong on Unsplash •

    Visual objects edited by Nano Banana

    💡 Remarks:

    • Discover how, on the final portray, the orange bookmark is correctly eliminated and the hidden half inpainted to finish the portray.
    • We requested to “fill the canvas with minimal uniform margins, with out distortion or cropping”. Relying on the facet ratio and sort of the visible, this diploma of freedom may end up in completely different white margins.
    • This instance exhibits well-known work by Vincent Van Gogh. Nano Banana doesn’t fetch any reference photographs and solely makes use of the offered enter. If these had been pictures of personal work, they might be restored in the identical manner.

    Within the Denver structure ebook, the illustrations might be tilted, which our generic immediate doesn’t absolutely consider. When a number of geometric transformations are concerned, it may be difficult to craft an crucial immediate that particulars all of the operations to carry out. As a substitute, a descriptive immediate might be extra easy by straight describing the anticipated output.

    🧪 Right here’s an instance of a descriptive immediate specializing in the restoration of tilted visuals:

    tilted_visual_prompt = """
    An upright, high-fidelity rendition of the visible remoted towards a pure white background, filling the canvas with minimal uniform margins. The output is clear, sharp, and freed from bodily artifacts.
    """
    
    restore_objects(Supply.denver_illustrated, tilted_visual_prompt)

    • Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

    Visual objects edited by Nano Banana

    💡 Remarks:

    • To get these outcomes, the immediate focuses on requesting an “upright” visible “filling the canvas”, which proves extra easy to put in writing than attempting to account for all doable geometric corrections.
    • The native visible understanding mechanically identifies the content material kind (photograph, illustration, and many others.) and the completely different artifacts (photographic, paper, printing, scanning…), permitting for exact restorations out of the field.
    • Discover how the consistency is preserved: the final visible is restored as an illustration, whereas the primary visuals preserve their photographic type.
    • The outcomes, with this relatively generic immediate, are spectacular. It’s, after all, doable to be extra particular and request explicit lighting, kinds, colours…

    On this final take a look at, the enter visible has distortions not solely from the web page curvature but additionally from the photograph perspective.

    🧪 Right here’s an instance of a descriptive immediate specializing in restoring warped illustrations:

    warped_visual_prompt = """
    An edge-to-edge digital extraction of the illustration from the offered ebook photograph, excluding any peripheral textual content. All web page curvature and perspective distortions are corrected, leading to a picture framed in an ideal rectangle, on a pure white canvas with minimal margins.
    """
    
    restore_objects(Supply.alice_drawing, warped_visual_prompt)

    • Source Image • Source Page • Open ebook exhibiting an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

    Visual objects edited by Nano Banana

    💡 It’s actually spectacular that such a restoration might be carried out in a single step. Be aware that this immediate shouldn’t be secure and may generate much less optimum outcomes (it could profit from being extra exact). When you’ve got advanced transformations, take a look at descriptive prompts iteratively, utilizing exact and concise directions, and also you is likely to be pleasantly stunned. Within the worst case, it’s additionally doable to course of the transformations in successive, simpler steps.

    Now, let’s add a colorization step…


    🎨 Colorization

    Our restoration step revered the unique kinds of the enter photographs. Current picture modifying fashions excel at reworking picture kinds, beginning with colours. This may typically be carried out straight with a easy, exact instruction.

    Here’s a doable colorization operate utilizing an crucial immediate:

    COLORIZATION_PROMPT = """
    Colorize this picture in a contemporary ebook illustration type, sustaining all authentic particulars with none additions.
    """
    
    colorize = object_editing_function(
        COLORIZATION_PROMPT,
        WorkflowStep.RESTORED,
        WorkflowStep.COLORIZED,
    )
    
    print("✅ Colorization operate outlined")

    🧪 Let’s modernize our 1485 illustration:

    colorize(Supply.incunable)

    • Source Image • Source Page • Vergaderinge der historien van Troy (1485) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana

    💡 All particulars are preserved, as requested within the immediate. Discover how the colorization can naturally repair some remaining artifacts (e.g., the paper discoloration within the sword or the bleeding ink within the armor).


    🧪 Let’s colorize our museum guidebook illustrations:

    colorize(Supply.museum_guidebook)

    • Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana

    💡 Our immediate may be very open because it solely specifies “trendy ebook illustration type”. This may generate very inventive colorizations, however all of them appear to make excellent sense.


    🧪 What about our Denver buildings?

    colorize(Supply.denver_illustrated)

    • Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

    Visual objects edited by Nano Banana

    💡 As requested, all of them seem like trendy illustrations, together with the primary visuals (originating from noisy pictures).


    It’s doable to go additional by not solely “colorizing” but additionally “reworking” the picture right into a considerably completely different one.

    🧪 Let’s make our “Alice’s Adventures in Wonderland” drawing right into a watercolor portray:

    watercolor_prompt = """
    Rework this visible right into a heat, watercolor portray.
    """
    
    colorize(Supply.alice_drawing, watercolor_prompt)

    • Source Image • Source Page • Open ebook exhibiting an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

    Visual objects edited by Nano Banana

    🧪 What about making it a standard portray?

    painting_prompt = """
    Rework this visible into a standard portray.
    """
    
    colorize(Supply.alice_drawing, painting_prompt)

    • Source Image • Source Page • Open ebook exhibiting an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

    Visual objects edited by Nano Banana

    We are able to additionally change picture compositions. Relying on the context, some compositions are kind of implied by default. For instance, illustrations usually have margins, whereas pictures typically have edge-to-edge (full-bleed within the printing world) compositions. When doable, it’s attention-grabbing to check with a sort of visible (which intrinsically brings quite a lot of semantics to the context) and regulate the directions accordingly.

    🧪 Let’s see how we are able to detect engravings on this 1847 ebook, restore them, and remodel them into trendy digital graphics:

    detect_objects(Supply.engravings)

    • Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

    Visual objects edited by Nano Banana
    restore_objects(Supply.engravings)

    • Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

    Visual objects edited by Nano Banana
    visual_to_digital_graphic_prompt = """
    Rework this visible right into a full-color, flat digital graphic, extending the content material for a full-bleed impact.
    """
    
    colorize(Supply.engravings, visual_to_digital_graphic_prompt)

    • Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

    Visual objects edited by Nano Banana

    🧪 We are able to additionally remodel the identical engravings into pictures with a quite simple immediate:

    visual_to_photo_prompt = """
    Rework this visible right into a high-end, trendy digital camera {photograph}.
    """
    
    colorize(Supply.engravings, visual_to_photo_prompt)

    • Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

    Visual objects edited by Nano Banana

    💡 As pictures are typically full-bleed, the immediate doesn’t must specify a composition.

    It’s actually as much as our creativeness, as Nano Banana appears to understand each facet of the visible semantics.

    Let’s add a closing step to see how far we are able to go, reimagining photographs as cinematic film stills…


    🎞️ Cinematization

    We’ve used relatively “closed” prompts to this point, crafting particular directions and constraints to regulate the outputs. It’s doable to go even additional with “open” prompts and generate photographs in full inventive mode. Notably, it may be attention-grabbing to check with photographic or cinematographic terminology because it encompasses many visible strategies.

    Here’s a doable generic cinematization operate to reimagine photographs as film stills:

    CINEMATIZATION_PROMPT = """
    Reimagine this picture as a joyful, trendy live-action cinematic film nonetheless that includes skilled lighting and composition.
    """
    
    cinematize = object_editing_function(
        CINEMATIZATION_PROMPT,
        WorkflowStep.RESTORED,
        WorkflowStep.CINEMATIZED,
    )

    🧪 Let’s cinematize the “Alice’s Adventures in Wonderland” drawing:

    cinematize(Supply.alice_drawing)

    • Source Image • Source Page • Open ebook exhibiting an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

    Visual objects edited by Nano Banana

    💡 This seems to be like a high-budget film nonetheless. There are many levels of freedom within the immediate, however you’re prone to get foreground figures in sharp focus, a gradual background blur, “golden hour” lighting (a magical ingredient for a lot of cinematographers), and detailed textures. Such compositions actually evoke completely different atmospheres in comparison with the pictures generated within the earlier take a look at.


    🧪 Let’s take a look at the workflow on a web page from the Great Wizard of Oz containing three drawings:

    detect_objects(Supply.wizard_of_oz_drawings)

    • Source Image • Source Page • The fantastic Wizard of Oz (1899) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana
    restore_objects(Supply.wizard_of_oz_drawings)

    • Source Image • Source Page • The fantastic Wizard of Oz (1899) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana
    cinematize(Supply.wizard_of_oz_drawings)

    • Source Image • Source Page • The fantastic Wizard of Oz (1899) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana

    💡 The forged for a brand new film is prepared 😉


    Cinematic photographs have varied use instances:

    • These cinematized stills might be excellent “reference photographs” for video technology fashions like Veo. See Generate Veo videos from reference images.
    • As they’re photorealistic representations, they will also be a supply for producing 2D or 3D visuals, in any type, with lifelike figures, excellent proportions, superior lighting, enhanced compositions…
    • You should utilize them in {many professional} contexts or for high-end merchandise: displays, magazines, posters, storyboards, brainstorming periods…

    🏁 Conclusion

    • Gemini’s native spatial understanding permits the detection of particular visible objects based mostly on a single immediate in pure language.
    • We examined the detection of illustrations in ebook pictures, which conventional machine studying (ML) fashions often miss, as they’re sometimes educated to detect individuals, animals, autos, meals, and a finite set of bodily object courses.
    • We examined the detection of straight, tilted, and even considerably warped illustrations, they usually had been at all times exactly recognized.
    • The core implementation was easy, requiring minimal code utilizing the Python SDK and customised prompts. By comparability, fine-tuning a standard object detection mannequin is time-consuming: it includes assembling a picture dataset, labeling objects, and managing coaching jobs.
    • This resolution may be very versatile: we may change from detecting illustrations to digital elements, by adapting the immediate, whereas maintaining the code unchanged.
    • Utilizing structured outputs (with a JSON schema or Pydantic courses, and the Python SDK) makes the code each simple to implement and able to deploy to manufacturing.
    • Then, Nano Banana permits modifying these visible objects in nearly any manner possible.
    • We examined a workflow with restoration, colorization, and even cinematization steps, utilizing crucial and descriptive prompts.
    • The probabilities appear actually limitless, and the ideas on this exploration might be reused in numerous contexts.

    ➕ Extra!

    Thanks for studying. Let me know should you create one thing cool!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleTake a Deep Dive into Filtering in DAX
    Next Article A Generalizable MARL-LP Approach for Scheduling in Logistics
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Designing Data and AI Systems That Hold Up in Production

    February 26, 2026
    Artificial Intelligence

    A Generalizable MARL-LP Approach for Scheduling in Logistics

    February 26, 2026
    Artificial Intelligence

    Take a Deep Dive into Filtering in DAX

    February 26, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Accelerating science with AI and simulations | MIT News

    February 12, 2026

    Off-Beat Careers That Are the Future Of Data

    January 2, 2026

    Data Has No Moat! | Towards Data Science

    June 24, 2025

    Why AI Still Can’t Replace Analysts: A Predictive Maintenance Example

    October 14, 2025

    Svenska AI-reformen – miljoner svenskar får gratis AI-verktyg

    May 9, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Enabling small language models to solve complex reasoning tasks | MIT News

    December 12, 2025

    By putting AI into everything, Google wants to make it invisible 

    May 21, 2025

    How to Personalize Claude Code

    February 10, 2026
    Our Picks

    Designing Data and AI Systems That Hold Up in Production

    February 26, 2026

    Finding value with AI and Industry 5.0 transformation

    February 26, 2026

    A Generalizable MARL-LP Approach for Scheduling in Logistics

    February 26, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.