OpenAI announces new multimodal desktop GPT with new voice and vision capabilities – Computerworld

Chirag Dekate, a Gartner vice president analyst, said that while he was impressed with OpenAI’s multimodal large language model (LLM), the company was clearly playing catch-up to competitors, in contrast to its earlier status as an industry leader in generative AI tech.

“You’re now starting to see GPT enter into the multimodal era,” Dekate said. “But they’re playing catch-up to where Google was three months ago when it announced Gemini 1.5, which is its native multimodal model with a one-million-token context window.”

Still, the capabilities demonstrated by GPT-4o and its accompanying ChatGPT chatbot are impressive for a natural language processing engine. It displayed a better conversational capability, where users can interrupt it and begin new or modified queries, and it is also versed in 50 languages. In one onstage live demonstration, the Voice Mode was able to translate back and forth between Murati speaking Italian and Barret Zoph, OpenAI’s head of post-training, speaking English.

During a live demonstration, Zoph also wrote out an algebraic equation on paper while ChatGPT watched through his phone’s camera lens. Zoph then asked the chatbot to talk him through the solution.

While the voice recognition and conversational interactions were extremely human-like, there were also noticeable glitches in the interactive bot where it cut out during conversations and picked things back up moments later.

The chatbot then was asked to tell a bedtime story. The presenters were able to interrupt the chatbot and have it add more emotion to its voice intonation and even change to a computer-like rendition of the story.

Leave a Comment