元语言ASR：推进1600种语言的自动语音识别

元语言ASR：推进1600种语言的自动语音识别
Omnilingual ASR: Advancing automatic speech recognition for 1600 languages

原始链接: https://ai.meta.com/blog/omnilingual-asr-advancing-automatic-speech-recognition/?_fb_noscript=1

FAIR 发布了一套全面的开源模型和“全语言自动语音识别语料库”，以推进所有语言的语音技术。这包括基于 wav2vec 2.0 构建的、从轻量级到高精度 7B 版本的通用自动语音识别 (ASR) 模型，并采用宽松的许可协议（Apache 2.0 & CC-BY）。一个关键重点是扩展 ASR 到代表性不足的语言。发布的数据集是迄今为止创建的最大规模的超低资源自发 ASR 数据集，涵盖了数百种以前不受支持的语言，这得益于与当地组织和偏远地区的母语人士的合作。与 Mozilla 的 Common Voice 和 Lanfrica/NaijaVoices 等组织的合作确保了语言的准确性和文化相关性。该举措使研究人员、开发人员和语言社区能够利用最新的 PyTorch 工具，在全球范围内构建和定制语音解决方案。

## 泛语言ASR：摘要 Meta AI 发布了泛语言ASR，一种能够处理**1600种语言**的语音识别系统。该项目在[Hugging Face](https://huggingface.co/spaces/facebook/omniasr-transcription...)和[GitHub](https://github.com/facebookresearch/omnilingual-asr)上提供，旨在由社区驱动，允许用户使用最少的数据添加语言。虽然令人印象深刻，但讨论强调了潜在的挑战。人们对**声调语言**和**稀有音素**（如点击音）的准确性表示担忧。演示目前专注于“低资源”语言，字符错误率低于10%，但该比率的定义（字符与词错误）尚不清楚。用户还指出了项目语言濒危分类中的不准确之处（例如，将瑞典语标记为濒危）。尽管存在这些问题，该模型仍然显示出希望，甚至在基准测试中表现优于Whisper-large-v3，代表着朝着通用语音识别迈出的重要一步。值得注意的是，该模型擅长支持稀有语言，但可能不如在广泛使用的语言上准确。

原文

We’re releasing a full suite of models and one dataset. Built on the foundation of FAIR’s previous research, Omnilingual ASR gives stakeholders everything they need to expand and improve speech technology for any language.

The two decoding variants are available as a versatile family of models — from lightweight 300M versions designed for low-power devices to powerful 7B models that offer top-tier accuracy for a variety of use cases. Our general-purpose speech foundation model wav2vec 2.0 is also made available at various sizes. It can be used by researchers and developers alike to enable speech-related tasks beyond ASR.

All assets are released under a permissive Apache 2.0 license while the data is provided under the CC-BY license and are based on FAIR’s open source fairseq2 framework, empowering researchers, developers, and language advocates worldwide to advance and tailor speech solutions for their own use cases using the latest tools and technologies in the PyTorch ecosystem.

Omnilingual ASR also advances the state of multilingual ASR along more familiar dimensions. Its training corpus is one of the largest ever assembled for ASR in both volume and linguistic diversity, integrating publicly available datasets with community-sourced speech recordings collected through multiple partnerships.

To reach languages with little or no digital presence, we worked with local organizations that recruited and compensated native speakers, often in remote or under-documented regions. We’re releasing this commissioned part of our training corpus as Omnilingual ASR Corpus to further benefit the ASR research community. To date, it is the largest ultra-low-resource spontaneous ASR dataset ever made available, covering hundreds of languages never seen before by ASR systems. Explore the languages in the dataset here.

Beyond commissioned partnerships, collaborations through the Language Technology Partner Program have brought together linguists, researchers, and language communities from around the world, providing essential expertise and resources. We joined forces with organizations such as Mozilla Foundation’s Common Voice and Lanfrica/NaijaVoices to work directly with local communities.

These partnerships have been instrumental in infusing Omnilingual ASR with deep linguistic knowledge and cultural understanding, ensuring that the technology meets local needs and empowers diverse language communities globally.

元语言ASR：推进1600种语言的自动语音识别 Omnilingual ASR: Advancing automatic speech recognition for 1600 languages

元语言ASR：推进1600种语言的自动语音识别
Omnilingual ASR: Advancing automatic speech recognition for 1600 languages