heikohotz.bsky.social - Profile | ThreadSky | a Reddit-style client for Bluesky

comment in response to post

Tom's Guide: www.youtube.com/watch?v=Pt3q...

submitted 1 day ago

comment in response to post

WSJ interview: www.youtube.com/watch?v=NTLk...

submitted 1 day ago

comment in response to post

Building a v1 GenAI app on an existing platform while overhauling the foundation for a better 'V2' is a common strategy. But explaining this to everyday consumers is challenging. These kinds of interviews really help communicate that effectively. What did you think?

submitted 1 day ago

comment in response to post

To give an example, right out of the gate she asks, "𝙇𝙖𝙨𝙩 𝙮𝙚𝙖𝙧 𝙮𝙤𝙪 𝙖𝙣𝙣𝙤𝙪𝙣𝙘𝙚𝙙 𝙖 𝙨𝙢𝙖𝙧𝙩𝙚𝙧 𝘼𝙄-𝙙𝙧𝙞𝙫𝙚𝙣 𝙎𝙞𝙧𝙞. 𝙒𝙝𝙚𝙧𝙚 𝙞𝙨 𝙨𝙝𝙚?" From a developer's point of view, Apple's answers made a lot of sense: a 'V1' worked, but didn't meet their high quality/reliability standards when users went 'off the beaten path'.

submitted 1 day ago

comment in response to post

This year, however, Craig Federighi and Greg Joswiak were interviewed by other outlets, including Tom's Guide, TechRadar, and The Wall Street Journal. I particularly liked Joanna Stern's interview and her style: direct, concise, and challenging.

submitted 1 day ago

comment in response to post

While it's not fair to characterise Gruber as an "Apple fanboy," I consistently found his questions too long-winded and too softball. By the end, it often felt (to me, at least) like just a few folk were a bit too cosy on stage.

submitted 1 day ago

comment in response to post

i definitely hear you on that one 😅 out of curiosity - what are the benefits you are looking to gain from an agent framework (in general)?

submitted 148 days ago

comment in response to post

Not perfect by any means, but much better already than "traditional" voice assistants, and we are only at the beginning of this journey. You can try it yourself with the Developer Guide for Gemini's Multimodal Live API 🤗 github.com/heiko-hotz/g...

submitted 156 days ago

comment in response to post

I believe that multimodal AI models have the potential to change that. They allow me to speak much more freely about what I want them to do and oftentimes they understand and execute in the way I expected them to.

submitted 156 days ago

comment in response to post

But soon I realised that these voice assistants still require a rigid syntax: I would have to phrase commands in a very specific way for the voice assistant to understand what I meant.

submitted 156 days ago

comment in response to post

To me it was a magical moment when I got my first Amazon Echo in 2015 and could just shout words into the air and got a response.

submitted 156 days ago

comment in response to post

Check it out and start building your own voice assistant 🤗 github.com/heiko-hotz/g...

submitted 165 days ago

comment in response to post

It is a full-featured web application for real-time conversations with audio and video input, memory, and tool use! And it works great on mobile phones, too.

submitted 165 days ago

comment in response to post

But Gemini's Multimodal Live API actually lets you build a comparable experience today! I'm proud to share a step-by-step developer guide that will help you build Project Pastra.

submitted 165 days ago

comment in response to post

Google DeepMind's Project Astra takes a new stab at this and if you have seen the video it looks like it could be a big leap into the right direction. But currently you can only get waitlisted for it.

submitted 165 days ago

comment in response to post

I remember the magical moment when I uttered some words into the air and a voice assistant replied instantly. That was in 2014 and it promised a future where we would just talk to our devices and they would provide useful and insightful responses. It was a promise that, in my opinion, wasn't met.

submitted 165 days ago

comment in response to post

github.com/heiko-hotz/g...

submitted 168 days ago

comment in response to post

Check out the new chapter (link in comments below), which completes the app in the sense that it now covers all input modalities. But there is still more to chapters to come, no worries 😌

submitted 168 days ago

comment in response to post

Which is great because it means we can extend the length of a session by sending fewer images or decide to send more images if our application requires it.

submitted 168 days ago

comment in response to post

In the REST API when we videos are used as input this is done automatically at 1 frame per second. In a live app the choice of how many frames per second we want to capture actually lies with the us.

submitted 168 days ago

comment in response to post

Because here is a little "secret": What is a video, anyway, if not the sequence of images? Which means, the trick is to actually send frame captures at a given rate to Gemini 💡

submitted 168 days ago

comment in response to post

Link: github.com/heiko-hotz/g...

submitted 169 days ago

comment in response to post

Feel free to check it (along with the previous chapters) and share it with folks that might be interested. Let's build our own Project Astra from scratch!

submitted 169 days ago

comment in response to post

But, in the end I got there and I'm proud to announce that Chapter 5 of my Gemini Multimodal Live API Developer Guide is now available on GitHub (link in comments below).

submitted 169 days ago

comment in response to post

I can now 100% confirm that (good) audio processing is by far the hardest component when building a live multimodal app (at least it was for me). Everything else should be smooth sailing from here on out (famous last words 😆).

submitted 169 days ago

comment in response to post

Merry Christmas and happy holidays!! 🎄

submitted 171 days ago

comment in response to post

github.com/heiko-hotz/g...

submitted 171 days ago

comment in response to post

I have still lots to learn so I aim to expand this guide over time - but I believe it might already be useful in its current state. I have received some positive feedback on it, so please check it out if you’re interested in building with this new API 🤗

submitted 171 days ago

comment in response to post

In the process of learning these concepts I started writing down my lessons learned, gotchas, etc and I compiled my notes into the developer guide I wish I had when starting this journey.

submitted 171 days ago

comment in response to post

Nevertheless, as someone who traditionally mainly worked with REST APIs and Python SDKs I had a few gaps that I wanted to fill to better understand WebSockets, JS, and how to process audio/video.

submitted 171 days ago

comment in response to post

Want to build your own personal Project Astra with the new Multimodal Live API but don’t know how to get started? Our teams have created some excellent examples and demos showcasing the new Gemini Live API.

submitted 171 days ago

comment in response to post

Despite the few lines of code for the app there is quite a bit to it, see system architecture below. Link to the repo and the README with detailed explanation in the comments to get you started 🛠️ Link to repo: github.com/heiko-hotz/g...

submitted 172 days ago

comment in response to post

The application captures audio input from the user's microphone, sends it to the Gemini API for processing, receives the model's audio response, and plays it back through the user's speakers. This creates an interactive and conversational experience, similar to talking to a voice assistant.

submitted 172 days ago

comment in response to post

Gute Besserung und frohe Weihnachten ☺️

submitted 173 days ago

comment in response to post

like so many areas of genai this is one where it is deceptively easy to create a toy demo and incredibly hard to put sth useful into prod. i will spend several months in 2025 to understand how to do it (and whether it’s even possible), and i’m not a beginner by any means

submitted 173 days ago

comment in response to post

there are folks who find AI podcasts more enjoyable & engaging than reading papers, for instance. and it makes learning more accessible and inclusive.

submitted 174 days ago

comment in response to post

Another source I recommend to customers is github.com/google-gemin... The notebooks in there are fairly comprehensive and contain good information.

submitted 175 days ago

comment in response to post

nice one @jbarrow.bsky.social 🤗 and i see you're already working on the next part 😃 publish.obsidian.md/jbarrow/Misc... please keep them coming!

submitted 175 days ago

comment in response to post

Congrats @jhamrick.bsky.social and GDM team - the new model is fantastic 🤗 I already tested it against much larger models and it beat yhem in a math riddle 🚀 bsky.app/profile/heik...

submitted 175 days ago

comment in response to post

Here the prompt: Use any mathematical signs wherever you need: 2 + 2 + 2 = 6 3 3 3 = 6 4 4 4 = 6 5 5 5 = 6 6 6 6 = 6 7 7 7 = 6 8 8 8 = 6 9 9 9 = 6

submitted 175 days ago