OpenAI Audio Models

jeffharris · 2025-03-20T17:55:24 1742493324

Hey, I'm Jeff and I was PM for these models at OpenAI. Today we launched three new state-of-the-art audio models. Two speech-to-text models—outperforming Whisper. A new TTS model—you can instruct it how to speak (try it on openai.fm!). And our Agents SDK now supports audio, making it easy to turn text agents into voice agents. We think you'll really like these models. Let me know if you have any questions here!

minimaxir · 2025-03-20T17:25:10 1742491510

This is an official OpenAI tool linked from the new model announcement (https://openai.com/index/introducing-our-next-generation-aud... ), despite the branding difference.

islewis · 2025-03-20T17:49:19 1742492959

Cool format for a demo. Some of the voices have a slight "metallic" ring to them, something I've seen a fair amount with Eleven Labs' models.

Does anyone have any experience with the realtime latency of these Openai TTS models? ElevenLabs has been so slow (much slower than the latency they advertise), which makes it almost impossible to use in realtime scenarios unless you can cache and replay the outputs. Cartesia looks to have cracked the time to first token, but i've found their voices to be a bit less consistent than Eleven Labs'.

ComputerGuru · 2025-03-20T17:47:42 1742492862

It would be much more convenient to use if changing the voice model worked on the fly, without having to stop and start the audio.

amitport · 2025-03-20T17:50:24 1742493024

Louis CK | about airplane Wi Fi

https://www.youtube.com/watch?v=me4BZBsHwZs

danso · 2025-03-20T17:37:48 1742492268

The voices are pretty convincing. It's funny to hear drastically the tone of the reading can change when repeatedly stopping and restarting the samples without changing any of the settings.

carbocation · 2025-03-20T17:51:44 1742493104

Nova+Serene sounds very metallic at the beginning about 50% of the time for me.

jeffharris · 2025-03-20T17:56:46 1742493406

some of the older voices are definitely less steerable, more robotic

we put little stars in the bottom right corner for the newer voices, which should sound better

danso · 2025-03-20T17:45:30 1742492730

Interesting, I inserted a bunch of "fuck"s in the text and the "NYC Cabbie" voice read it all just fine. When I switched to other voices ("Connoisseur", "Cheerleader", "Santa"), it responded "I'm sorry I can't assist with that request".

I switched back to "NYC Cabbie" and it again read it just fine. I then reloaded the session completely, refreshed the voice selections until "NYC Cabbie" came up again, and it still read the text without hesitation.

The text:

> In my younger and more vulnerable years my father fuck gave me some fuck fuck advice that I've been fuck fuck FUCK OH FUCK turning over in my mind ever since.

> "Whenever you feel like criticizing any one," he told me, oh fuck! FUCK! "just remember that all the people in this world haven't had fuck fuck fuck FUCKERKER the advantages that you've had."

edit: "Emo Teenager", "Mad Scientist", and "Smooth Jazz" are able to read the text. However, "Medieval Knight" and "Robot" cannot.

nazgulsenpai · 2025-03-20T17:50:08 1742493008

Glad I'm not the only one whose inner 12 year old curiosity is immediately triggered by free input TTS. Swear words and just raking my hands across the keyboard to insert gibberish in every possible accent.

Gracana · 2025-03-20T17:57:16 1742493436

I am immediately reminded of this: https://www.youtube.com/watch?v=Hv6RbEOlqRo

Etheryte · 2025-03-20T17:40:06 1742492406

Recommended input for anyone trying this out:

Voice: Onyx

Vibe: Heavy german accent, doing an Arnold Schwarzenegger impression, way over the top for comedic effect. Deep booming voice, uses pauses for dramatic effect.

varunneal · 2025-03-20T17:38:48 1742492328

One of the most novel demos I've seen openai ship in a few years. I love how it looks almost like a synth. Fun to play around with!

jcmp · 2025-03-20T17:45:25 1742492725

How do you call this desing/ui astethic? I like it

vyrotek · 2025-03-20T17:57:48 1742493468

"teenage engineering"

havefunbesafe · 2025-03-20T17:56:27 1742493387

by copying Teenage Engineering

randomcatuser · 2025-03-20T17:55:02 1742493302

neumorphism!

KuzeyAbi · 2025-03-20T17:52:41 1742493161

You can screenshot and ask chatgpt lol

nickthegreek · 2025-03-20T17:38:31 1742492311

Try the refresh button to get a new list of vibe styles.

stephenheron · 2025-03-20T17:38:57 1742492337

Quite disappointing their speech to text models are not open source. Whisper was really good and it was great it was open to play around with. I guess this continues OpenAI's approach of not really being open!

nickthegreek · 2025-03-20T17:40:09 1742492409

Indeed. Right now I think our open choices are Piper, Kokoro and Orpheus.

（评论） (comments)

（评论）
(comments)