AudioX:用于任何内容到音频生成的扩散变换器
AudioX: Diffusion Transformer for Anything-to-Audio Generation

原始链接: https://zeyuet.github.io/AudioX/

AudioX提出了一种新颖的统一方法用于音频和音乐生成,解决了当前孤立的、特定模态模型的局限性。这款名为“扩散Transformer”的模型,旨在实现“任何内容到音频”的转换,它可以生成高质量的音频和音乐,同时无缝集成文本、视频、图像和现有音频等多种输入。其关键创新在于多模态掩码训练策略,该策略通过掩盖跨模态的输入来促进强大的跨模态表征学习。为了应对数据稀缺问题,作者精心策划了两个大型数据集:vggsound-caps(19万个音频字幕)和V2M-caps(600万个音乐字幕)。实验表明,AudioX在性能上超越了专用模型,并展现出令人印象深刻的多功能性。通过在一个单一架构中统一音频和音乐生成,并利用新颖的训练策略,AudioX推动了跨模态音频生成的最新技术发展。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 AudioX:用于任何内容到音频生成的扩散变换器 (zeyuet.github.io) gnabgib 54分钟前 21分 | 隐藏 | 过去 | 收藏 | 1条评论 Fauntleroy 13分钟前 [–] 视频到音频的例子真的令人印象深刻!乐队演奏的视频展示了这种方法的一些明显缺点(人们对5个长号会发出什么样的声音会有非常精确的预期)——但网球的例子展示了它的优势(击球声的时机不错,大型室内空间的音响效果也令人惊讶地准确)。我非常期待看到这种技术在未来几篇论文中得到改进! 回复 加入我们,参加6月16日至17日在旧金山举办的AI创业学校! 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系我们 搜索:

原文

Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture.

联系我们 contact @ memedata.com