Building Real-Time RAG Systems with Gemini & the Multimodal Live API
Most retrieval-augmented generation systems today feel a bit stiff. You ask a question. You wait. You get an answer. It works, but it doesn’t “feel” like a conversation.
Hi 👋🏼 I'm Olayinka Peter, a Senior ML Engineer & Google Developer Expert for Machine Learning.
Most retrieval-augmented generation systems today feel a bit stiff. You ask a question. You wait. You get an answer. It works, but it doesn’t “feel” like a conversation.
When I wrote Grokking GenAI: Multimodal Reasoning with Gemini last year, multimodality felt like a breakthrough. An AI that could read text, look at images, listen to audio, and even understand code already felt futuristic. But over the past year, something important has changed.
Imagine you’re trying to plan a trip to Hawaii. You’ve got a few pictures of beautiful beaches, a list of things you want to see, and a rough budget in mind. How do you pull it all together? You might browse travel blogs, compare prices, and even watch videos of the islands. You’re using different kinds of information – pictures, text, and video – to make sense of your trip.
Do you read a lot? No? Well, let’s say you’re at a library looking for information on a specific topic. Instead of just browsing through every book on the shelves, you ask the librarian for help.