Profile avatar
heikohotz.bsky.social
AI Engineer @ Google ๐Ÿ‘จโ€๐Ÿ’ป โ€” Educator ๐Ÿ‘จโ€๐Ÿซ โ€” Traveller โœˆ๏ธ โ€” Hobby photographer ๐Ÿ“ท โ€” Foodie ๐ŸŒฎ โ€” Film fan ๐Ÿฟ โ€” Boardgamer ๐ŸŽฒ โ€” Londoner๐Ÿ’‚โ€โ™‚๏ธ Medium: https://heiko-hotz.medium.com/ Github: https://github.com/heiko-hotz LI: https://www.linkedin.com/in/heikohotz/
447 posts 458 followers 626 following
Regular Contributor
Active Commenter
comment in response to post
i definitely hear you on that one ๐Ÿ˜… out of curiosity - what are the benefits you are looking to gain from an agent framework (in general)?
comment in response to post
Not perfect by any means, but much better already than "traditional" voice assistants, and we are only at the beginning of this journey. You can try it yourself with the Developer Guide for Gemini's Multimodal Live API ๐Ÿค— github.com/heiko-hotz/g...
comment in response to post
I believe that multimodal AI models have the potential to change that. They allow me to speak much more freely about what I want them to do and oftentimes they understand and execute in the way I expected them to.
comment in response to post
But soon I realised that these voice assistants still require a rigid syntax: I would have to phrase commands in a very specific way for the voice assistant to understand what I meant.
comment in response to post
To me it was a magical moment when I got my first Amazon Echo in 2015 and could just shout words into the air and got a response.
comment in response to post
Check it out and start building your own voice assistant ๐Ÿค— github.com/heiko-hotz/g...
comment in response to post
It is a full-featured web application for real-time conversations with audio and video input, memory, and tool use! And it works great on mobile phones, too.
comment in response to post
But Gemini's Multimodal Live API actually lets you build a comparable experience today! I'm proud to share a step-by-step developer guide that will help you build Project Pastra.
comment in response to post
Google DeepMind's Project Astra takes a new stab at this and if you have seen the video it looks like it could be a big leap into the right direction. But currently you can only get waitlisted for it.
comment in response to post
I remember the magical moment when I uttered some words into the air and a voice assistant replied instantly. That was in 2014 and it promised a future where we would just talk to our devices and they would provide useful and insightful responses. It was a promise that, in my opinion, wasn't met.
comment in response to post
github.com/heiko-hotz/g...
comment in response to post
Check out the new chapter (link in comments below), which completes the app in the sense that it now covers all input modalities. But there is still more to chapters to come, no worries ๐Ÿ˜Œ
comment in response to post
Which is great because it means we can extend the length of a session by sending fewer images or decide to send more images if our application requires it.
comment in response to post
In the REST API when we videos are used as input this is done automatically at 1 frame per second. In a live app the choice of how many frames per second we want to capture actually lies with the us.
comment in response to post
Because here is a little "secret": What is a video, anyway, if not the sequence of images? Which means, the trick is to actually send frame captures at a given rate to Gemini ๐Ÿ’ก
comment in response to post
Link: github.com/heiko-hotz/g...
comment in response to post
Feel free to check it (along with the previous chapters) and share it with folks that might be interested. Let's build our own Project Astra from scratch!
comment in response to post
But, in the end I got there and I'm proud to announce that Chapter 5 of my Gemini Multimodal Live API Developer Guide is now available on GitHub (link in comments below).
comment in response to post
I can now 100% confirm that (good) audio processing is by far the hardest component when building a live multimodal app (at least it was for me). Everything else should be smooth sailing from here on out (famous last words ๐Ÿ˜†).
comment in response to post
Merry Christmas and happy holidays!! ๐ŸŽ„
comment in response to post
github.com/heiko-hotz/g...
comment in response to post
I have still lots to learn so I aim to expand this guide over time - but I believe it might already be useful in its current state. I have received some positive feedback on it, so please check it out if youโ€™re interested in building with this new API ๐Ÿค—
comment in response to post
In the process of learning these concepts I started writing down my lessons learned, gotchas, etc and I compiled my notes into the developer guide I wish I had when starting this journey.
comment in response to post
Nevertheless, as someone who traditionally mainly worked with REST APIs and Python SDKs I had a few gaps that I wanted to fill to better understand WebSockets, JS, and how to process audio/video.
comment in response to post
Want to build your own personal Project Astra with the new Multimodal Live API but donโ€™t know how to get started? Our teams have created some excellent examples and demos showcasing the new Gemini Live API.
comment in response to post
Despite the few lines of code for the app there is quite a bit to it, see system architecture below. Link to the repo and the README with detailed explanation in the comments to get you started ๐Ÿ› ๏ธ Link to repo: github.com/heiko-hotz/g...
comment in response to post
The application captures audio input from the user's microphone, sends it to the Gemini API for processing, receives the model's audio response, and plays it back through the user's speakers. This creates an interactive and conversational experience, similar to talking to a voice assistant.
comment in response to post
Gute Besserung und frohe Weihnachten โ˜บ๏ธ
comment in response to post
like so many areas of genai this is one where it is deceptively easy to create a toy demo and incredibly hard to put sth useful into prod. i will spend several months in 2025 to understand how to do it (and whether itโ€™s even possible), and iโ€™m not a beginner by any means
comment in response to post
there are folks who find AI podcasts more enjoyable & engaging than reading papers, for instance. and it makes learning more accessible and inclusive.
comment in response to post
Another source I recommend to customers is github.com/google-gemin... The notebooks in there are fairly comprehensive and contain good information.
comment in response to post
nice one @jbarrow.bsky.social ๐Ÿค— and i see you're already working on the next part ๐Ÿ˜ƒ publish.obsidian.md/jbarrow/Misc... please keep them coming!
comment in response to post
Congrats @jhamrick.bsky.social and GDM team - the new model is fantastic ๐Ÿค— I already tested it against much larger models and it beat yhem in a math riddle ๐Ÿš€ bsky.app/profile/heik...
comment in response to post
Here the prompt: Use any mathematical signs wherever you need: 2 + 2 + 2 = 6 3 3 3 = 6 4 4 4 = 6 5 5 5 = 6 6 6 6 = 6 7 7 7 = 6 8 8 8 = 6 9 9 9 = 6
comment in response to post
Thanks for the feedback - do you mind sharing the car model?
comment in response to post
Nice one, thanks for testing ๐Ÿค—
comment in response to post
Gemini 2.0 Flash Thinking is an experimental model that explicitly shows its thoughts (i.e. not hiding them ๐Ÿ˜‰). It is built on 2.0 Flashโ€™s speed and performance and is trained to use thoughts to strengthen its reasoning. And users seem to like it ๐Ÿค—