When I wrote Grokking GenAI: Multimodal Reasoning with Gemini last year, multimodality felt like a breakthrough. An AI that could read text, look at images, listen to audio, and even understand code already felt futuristic.

But over the past year, something important has changed.

Multimodal AI is no longer about handling different inputs. It’s about reasoning across them at once, in context, the way humans naturally do. And that shift is most visible in how Gemini has evolved.

This second part isn’t about promises or demos. It’s about what multimodal reasoning actually looks like now, in real workflows, with real code.

From perception to understanding

Think about how humans solve problems.

If you’re reviewing a checkout screen, you don’t first describe the screen, then separately think about usability. You see everything at once: the layout, the buttons, the friction points. The reasoning happens with the visual context, not after it.

That’s the difference modern Gemini models bring.

Instead of treating images, text, or audio as separate steps, Gemini reasons over them jointly. You don’t “explain the image” anymore. You think with it.

Here’s a simple example.

Imagine uploading a checkout screen and asking Gemini to analyze it from a UX perspective.

from vertexai.generative_models import GenerativeModel, Part
import vertexai

vertexai.init(project="my-project", location="us-central1")

model = GenerativeModel("gemini-1.5-pro")

image = Part.from_uri(
    uri="gs://my-bucket/mobile_checkout_screen.png",
    mime_type="image/png"
)

prompt = """
Analyze this checkout screen.
Identify UX issues and suggest improvements,
focusing on accessibility and conversion.
"""

response = model.generate_content(
    contents=[image, prompt],
    temperature=0.3
)

print(response.text)

What’s happening here is subtle but important. Gemini isn’t generating a description and then reasoning. The image is part of the reasoning context itself. That’s what makes the feedback feel closer to how a human designer would think through the problem.

When context gets large, reasoning gets deeper

Another quiet revolution in Gemini’s evolution is context size.

Modern Gemini models can handle extremely large inputs: long documents, videos, transcripts, PDFs — all at once. And multimodality makes that even more powerful.

Imagine reviewing a product demo video while cross-checking it against the official specs. Normally, that’s a painful, manual process. With Gemini, it becomes a single reasoning task.

from vertexai.generative_models import GenerativeModel, Part

model = GenerativeModel("gemini-1.5-pro")

video = Part.from_uri(
    uri="gs://my-bucket/product_demo.mp4",
    mime_type="video/mp4"
)

spec_doc = Part.from_uri(
    uri="gs://my-bucket/product_specs.pdf",
    mime_type="application/pdf"
)

prompt = """
Summarize the key product capabilities shown in the video.
Cross-check them against the spec document
and highlight any inconsistencies.
"""

response = model.generate_content(
    contents=[video, spec_doc, prompt],
    max_output_tokens=500
)

print(response.text)

This isn’t just summarization. It’s cross-modal verification. Gemini watches the video, reads the specs, and reasons about whether the claims align.

At this point, multimodal AI starts feeling less like a chatbot and more like a research assistant that never gets tired.

Design to code is no longer a leap

One of the most practical shifts in multimodal AI is how it collapses the distance between design and implementation.

If you’ve ever stared at a UI mockup and thought, “Now I have to translate all of this into code,” you’ll appreciate this.

With Gemini, the screenshot itself becomes the prompt.

from vertexai.generative_models import GenerativeModel, Part

model = GenerativeModel("gemini-1.5-pro")

ui_image = Part.from_uri(
    uri="gs://my-bucket/login_screen.png",
    mime_type="image/png"
)

prompt = """
Generate Jetpack Compose code for this screen.
Use Material 3 components and follow accessibility best practices.
"""

response = model.generate_content(
    contents=[ui_image, prompt]
)

print(response.text)

This doesn’t replace engineers. What it does is remove the tedious translation step. The model understands layout, hierarchy, and intent directly from the image.

That’s multimodal reasoning turning into real developer leverage.

Audio isn’t just transcribed anymore

For a long time, “audio AI” mostly meant transcription. But listening and understanding are not the same thing. Gemini now reasons over audio the same way it reasons over text or images.

Imagine feeding it a meeting recording and asking for outcomes, not words.

from vertexai.generative_models import GenerativeModel, Part

model = GenerativeModel("gemini-1.5-pro")

audio = Part.from_uri(
    uri="gs://my-bucket/engineering_sync.wav",
    mime_type="audio/wav"
)

prompt = """
Summarize the meeting.
Extract key decisions, action items, and owners.
"""

response = model.generate_content(
    contents=[audio, prompt]
)

print(response.text)

This is a subtle but meaningful shift. Gemini isn’t just hearing what was said. It’s inferring what matters.

Multimodality is becoming agentic

The most interesting change, though, is where multimodality is heading.

When Gemini reasons across images, text, and constraints at once, it starts to behave less like a reactive model and more like a planning system.

For example, give it a floor plan and ask it to design a seating arrangement.

from vertexai.generative_models import GenerativeModel, Part

model = GenerativeModel("gemini-1.5-pro")

floor_plan = Part.from_uri(
    uri="gs://my-bucket/office_floorplan.jpg",
    mime_type="image/jpeg"
)

prompt = """
Plan seating for a 20-person engineering team.
Maximize collaboration, minimize noise issues,
and highlight potential risks in the layout.
"""

response = model.generate_content(
    contents=[floor_plan, prompt]
)

print(response.text)

Here, Gemini isn’t answering a question. It’s evaluating trade-offs, reasoning spatially, and proposing a plan. This is where multimodal reasoning starts to blur into agentic behavior.

The quiet takeaway

What’s striking about Gemini’s recent evolution is that none of this feels flashy. There’s no single feature you can point to and say, “That’s the moment everything changed”. Instead, the shift is philosophical.

Multimodal AI is no longer about input diversity. It’s about context unity. Images, text, audio, and video are no longer separate channels. They’re all part of the same thinking space.

And once AI starts thinking in context, at scale, the question stops being “What can it see?” and becomes “What can it understand?”. That’s where things start to get interesting.

The future of AI isn’t just multimodal. It’s context-first. And we’re already living in it.

Olayinka Peter

⟵

Grokking GenAI: Multimodal Reasoning with Gemini - Part 2