Takeaways from OpenAI and Google's May announcements
This is a weekly newsletter about the business of the technology industry. To receive Tanay’s Newsletter in your inbox, subscribe here for free:
Hi friends,
A few weeks ago both OpenAI and Google announced a number of new AI features and products. Although it feels like forever given how quickly AI has been progressing, there were a few themes from the announcements that I felt was worth unpacking, which is what I’ll do in this piece.
I. True Multimodal Models are (almost) here
OpenAI, through their new flagship model, GPT 4-o (omni), showcased a model that can reason across audio, vision, and text in real time. What this essentially means is that it understands all those inputs directly, and can also output these elements directly.
Google also showcased Project Astra, which is a universal AI agent that can reason across these modalities in realtime and respond directly across the modalities.
As an example of how this changes things, voice agents today are typically are built using three steps:
convert the audio of the speaker to text using a speech-to-text model
determine the right response to that text using an LLM
convert that response from text to audio using a text-to-speech model
These new models can do this natively, skipping the intermediate steps and directly outputting the right speech in response to what it hears. This is a game changer for speed/latency but also in its ability to understand tone, emotion etc and output text that sounds more natural.
GPT-4o and Project Astra are the first few to demonstrate this but I doubt they will be the last, and I expect we’ll see many more similar models, or even more specialized ones such as conversational speech-to-speech models.
The biggest unlock in the short to medium from these will be in two areas:
A much better AI assistant on phones/computers
More innovation in voice agents — both for consumers (education, companions, therapy) and in the context of businesses (for scheduling and booking, customer support etc)
II. Latency and Cost Continue to Improve
For developers using OpenAI’s and Google’s models, latency and cost and the trade-off inherent between them and performance continues to be a big issue for production use cases.
Fortunately, we’re continuing to see large improvements on the cost and latency side for a given level of performance, which was again the case in both sets of announcements.
OpenAI announced that GPT-4o would be 2x the speed of GPT-4 turbo, and what be half the cost
Google announced Gemini 1.5 Flash, which is optimized for speed and cost, and is about 50% faster and 95% cheaper than GPT-4/Gemini Pro, albeit less performant.
III. AI Everywhere All at Once
Another theme across both sets of announcements, and indeed in Microsoft’s announcement as well, was the idea of AI integrated everywhere, deeply at both the operating system level and within every application.
Google announced that Gemini will be deeply integrated into practically all its products, including Photos, Gmail, Docs and of course Search (the execution of which still leaves a lot to be desired)
Google also announced deep integrations of Gemini within the Android OS
OpenAI similarly announced a desktop app for ChatGPT on Mac, as well as reports that it may power Siri in the future
Microsoft announced their Copilot+ PCs with built in AI features including Recall, among others, in addition to launching their Copilot within every Microsoft office product.
These launches continue to bring up the key question around startup opportunities: AI will be in every application from incumbents and likely accessible at the OS layer very soon — what does this mean for startup opportunities in some of these categories? Certainly, the AI version of an email/photo/docs/presentation app could be the incumbents email/photo/docs/presentation app itself. It continues to highlight the need for startups to specialize on workflows and verticals and /or stay laser focused and execute at speed.
Another interesting area as it relates to this is local models being available on phones and computers. This could make it easier for applications to use the models without issues around data privacy.
IV. OpenAI’s Consumer Ambitions
It’s rare to see a “startup” simultaneously execute on as many different initiatives across developers/B2B and consumer as OpenAI is is doing. On one hand, critics may argue there’s some cookie licking across too many different areas ongoing. But the other side is that OpenAI is continuing to execute on an ambitious and wide-ranging roadmap across B2B and consumer quite successfully.
While OpenAI’s consumer ambitions have been quite clear over the last year, with ChatGPT itself reported to be at >$1B in runrate, the recent set of announcements was another reminder of them.
Between the launch of a desktop app on mac, and a “Her” like all-encompassing assistant enabled by GPT 4-o, and the reports that OpenAI may soon partner with Apple on their next generation assistant, it’s likely that OpenAI’s models may be used by 500 million to a billion people very soon, directly from an OpenAI product.
In addition, OpenAI is making their flagship GPT-4o model available on the ChatGPT free plan, which partially is to collect more multimodal data (and highlights the progress in the cost to serve these models), but also signals they continue to want to lead the way in the race for the de-facto AI assistant for consumers, amidst the recent competition from Google and Meta with the launch of their Meta AI.
Thanks for reading! If you liked this post, give it a heart up above to help others find it or share it with your friends.
If you have any comments or thoughts, feel free to tweet at me.
If you’re not a subscriber, you can subscribe for free below. I write about things related to technology and business once a week on Mondays.