For years, investors and founders have told me that AI which talks and listens the way humans do is on the cusp of a breakout moment. We’re still not there.
Audio models are still lagging behind their text-based brethren in terms of intelligence. If you ask an audio model like the ChatGPT voice assistant a tough math question, for instance, it’s much more likely to get it wrong than a text-based chatbot, according to people who work on such technology. And if you ask the model a more conversational question, such as to explain the causes of the Cold War, it will provide a less detailed and clear response than a text-based model, they said.
Why is that? Compared to text-based models like Opus 4.6, audio models need to spend more computing power just to understand and generate sound, meaning there is less processing available for capabilities like advanced reasoning or coming up with longer, more detailed responses.
Developing the audio models also comes with challenges. Audio recordings used in training data can often have background noise, overlapping speech or microphone quality issues. Those inconsistencies need to be removed and standardized when creating a clean training dataset. This makes the process of training audio models slower and more difficult. (These issues similarly apply to so-called multimodal models, or models that are able to understand and produce both written and spoken language.)