(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43426022

OpenAI发布了最新的最先进的音频模型,包括两个性能优于Whisper的语音转文本模型,以及一个可以通过openai.fm访问、支持自定义语音指令的新文本转语音模型。Agents SDK现在支持音频,从而使语音代理成为可能。这一公告在Hacker News上引发了讨论,用户称赞了演示的格式和语音质量,但也有人注意到某些语音中存在金属质感的声音。 用户测试了该模型的功能,包括其处理明确语言和各种口音的能力,发现不同语音的内容过滤存在不一致性。演示的用户界面让人联想到了Teenage Engineering的设计美学。一个关键的担忧是新的语音转文本模型缺乏开源可用性,这与Whisper的开源性质形成对比。用户强调了Piper、Kokoro和Orpheus等开源替代方案。

相关文章
  • (评论) 2024-01-19
  • (评论) 2024-09-20
  • 新模型和开发者产品 2023-11-07
  • OpenVoice:多功能即时语音克隆 2024-03-30
  • (评论) 2024-02-23

  • 原文
    Hacker News new | past | comments | ask | show | jobs | submit login
    OpenAI Audio Models (openai.fm)
    70 points by KuzeyAbi 41 minutes ago | hide | past | favorite | 21 comments










    Hey, I'm Jeff and I was PM for these models at OpenAI. Today we launched three new state-of-the-art audio models. Two speech-to-text models—outperforming Whisper. A new TTS model—you can instruct it how to speak (try it on openai.fm!). And our Agents SDK now supports audio, making it easy to turn text agents into voice agents. We think you'll really like these models. Let me know if you have any questions here!


    This is an official OpenAI tool linked from the new model announcement (https://openai.com/index/introducing-our-next-generation-aud... ), despite the branding difference.


    Cool format for a demo. Some of the voices have a slight "metallic" ring to them, something I've seen a fair amount with Eleven Labs' models.

    Does anyone have any experience with the realtime latency of these Openai TTS models? ElevenLabs has been so slow (much slower than the latency they advertise), which makes it almost impossible to use in realtime scenarios unless you can cache and replay the outputs. Cartesia looks to have cracked the time to first token, but i've found their voices to be a bit less consistent than Eleven Labs'.



    It would be much more convenient to use if changing the voice model worked on the fly, without having to stop and start the audio.


    Louis CK | about airplane Wi Fi

    https://www.youtube.com/watch?v=me4BZBsHwZs



    The voices are pretty convincing. It's funny to hear drastically the tone of the reading can change when repeatedly stopping and restarting the samples without changing any of the settings.


    Nova+Serene sounds very metallic at the beginning about 50% of the time for me.


    some of the older voices are definitely less steerable, more robotic

    we put little stars in the bottom right corner for the newer voices, which should sound better



    Interesting, I inserted a bunch of "fuck"s in the text and the "NYC Cabbie" voice read it all just fine. When I switched to other voices ("Connoisseur", "Cheerleader", "Santa"), it responded "I'm sorry I can't assist with that request".

    I switched back to "NYC Cabbie" and it again read it just fine. I then reloaded the session completely, refreshed the voice selections until "NYC Cabbie" came up again, and it still read the text without hesitation.

    The text:

    > In my younger and more vulnerable years my father fuck gave me some fuck fuck advice that I've been fuck fuck FUCK OH FUCK turning over in my mind ever since.

    > "Whenever you feel like criticizing any one," he told me, oh fuck! FUCK! "just remember that all the people in this world haven't had fuck fuck fuck FUCKERKER the advantages that you've had."

    edit: "Emo Teenager", "Mad Scientist", and "Smooth Jazz" are able to read the text. However, "Medieval Knight" and "Robot" cannot.



    Glad I'm not the only one whose inner 12 year old curiosity is immediately triggered by free input TTS. Swear words and just raking my hands across the keyboard to insert gibberish in every possible accent.


    I am immediately reminded of this: https://www.youtube.com/watch?v=Hv6RbEOlqRo


    Recommended input for anyone trying this out:

    Voice: Onyx

    Vibe: Heavy german accent, doing an Arnold Schwarzenegger impression, way over the top for comedic effect. Deep booming voice, uses pauses for dramatic effect.



    One of the most novel demos I've seen openai ship in a few years. I love how it looks almost like a synth. Fun to play around with!


    How do you call this desing/ui astethic? I like it


    "teenage engineering"


    by copying Teenage Engineering


    neumorphism!


    You can screenshot and ask chatgpt lol


    Try the refresh button to get a new list of vibe styles.


    Quite disappointing their speech to text models are not open source. Whisper was really good and it was great it was open to play around with. I guess this continues OpenAI's approach of not really being open!


    Indeed. Right now I think our open choices are Piper, Kokoro and Orpheus.






    Join us for AI Startup School this June 16-17 in San Francisco!


    Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



    Search:
    联系我们 contact @ memedata.com