Meta AI releases full-language ASR: a set of open source multi-language speech recognition models suitable for more than 1,600 languages

How to build a single speech recognition system that can understand 1,000 languages, including many that have never used ASR (automatic speech recognition)) Previous model? Meta AI has released full-language ASR, an open-source speech recognition suite that scales to more than 1,600 languages ​​and to unseen languages ​​with just a few speech text examples without having to retrain the model.

Data and language coverage

Supervised training data comes from a combined corpus called AllASR. AllASR contains 120,710 hours of tagged speech and transcripts in 1,690 languages. The corpus incorporates multiple sources, including open source datasets, internal and licensed corpora, partner-created data, and data titled Full language ASR corpus.

The full-language ASR corpus provides 3,350 hours of speech in 348 languages, collected through field work with local organizations and speakers in regions such as Africa and South Asia. Prompts are open-ended so speakers can produce natural monologues in their own words rather than reading canned sentences, providing more realistic acoustic and lexical variation.

For self-supervised pre-training, the wav2vec 2.0 encoder is trained on a large unlabeled speech corpus. The pre-training dataset contains 3.84 million hours of speech with language recognition in 1,239 languages, and an additional 460,000 hours without language recognition. Therefore, the total amount of unlabeled audio used for pre-training is approximately 4.3 million hours. This is still significantly less than the 12 million hours used by USM, which makes the reported results more interesting from a data efficiency perspective.

model family

Full Language ASR exposes 3 main model families, all sharing the same wav2vec 2.0 speech encoder backbone:

  1. SSL Encoder (OmniASR W2V)
    Self-supervised wav2vec 2.0 encoder with following parameter count
    omniASR_W2V_300M There are 317,390,592 parameters
    omniASR_W2V_1B There are 965,514,752 parameters
    omniASR_W2V_3B There are 3,064,124,672 parameters
    omniASR_W2V_7B There are 6,488,487,168 parameters. The models are trained using the standard wav2vec 2.0 comparison target. After training, the quantizer is discarded and the encoder is used as the speech representation backbone.
  2. CTC (Connectionist Temporal Classification) ASR model
    The CTC model adds a simple linear layer on top of the encoder and is trained end-to-end with character-level CTC loss. Published CTC models range from 325,494,996 parameters to 6,504,786,132 parameters, with real-time coefficients as low as 0.001 for a 300M model on the A100 for 30 seconds of audio with a batch size of 1.
  3. LL.M.ASR Model
    LLM ASR stacks a Transformer decoder on top of the wav2vec 2.0 encoder. Decoders are language models like Transformers that operate on character-level tokens as well as special tokens, e.g. and . Train to use standard next token prediction for sequences of the form gs(x), gt(), gt(y), gt() Where gs is a speech coder, gt is the text embedding matrix. The parameter range of the LLM ASR series is approximately 1.63B omniASR_LLM_300M 7,801,041,536 parameters omniASR_LLM_7B. a separate omniASR_LLM_7B_ZS A checkpoint with 7,810,900,608 parameters is used for zero-shot ASR.

All LLM ASR models support optional language conditioning. The language is expressed as {language_code}_{script} For example eng_Latn For English with Latin alphabet or cmn_Hans Available for Simplified Chinese Mandarin. The learned language script identifier embeddings are injected into the decoder input. During training, language ID tags are sometimes discarded, so the model can also be run without explicit language tags at inference time.

Zero-shot ASR with contextual examples and SONAR

Supervised models cover more than 1,600 languages. However, many languages ​​still do not have transcribed ASR data. To handle these cases, full-language ASR extends the LLM ASR model with a zero-shot mode trained using contextual examples.

During training of the zero-shot variant, the decoder consumes N + 1 Speech-to-text pairs from the same language. first one N The pairs act as contexts and the last pair are targets. All pairs are embedded in the speech encoder and text embedding matrices and then concatenated into a single decoder input sequence. The loss remains a prediction of the next mark on the target transcript. This teaches the decoder to infer the mapping from speech to text in a given language from small hints of language examples.

According to inference, omniASR_LLM_7B_ZS The model can receive some examples of speech text in any language, including languages ​​not present in training, and then transcribe new utterances in that language without updating the weights. This is contextual learning for ASR.

The system includes an example retrieval mechanism based on SONAR, a multilingual multimodal encoder that projects audio and text into a shared embedding space. The target audio is embedded once, and then a nearest neighbor search is performed against the speech-text database, selecting the most relevant examples for inclusion in the context window. This sonar-based selection improves zero-shot performance compared to random example selection or simple text similarity.

Quality and Benchmarks

this omniASR_LLM_7B The model achieves character error rates below 10% for 78% of the more than 1,600 supported languages.

The research team reports that the 7B LLM ASR model outperforms the 7B CTC model on multilingual benchmarks such as FLEURS 102, and also outperforms the Google USM variant in average character error rate, despite using approximately 4.3 million hours (instead of 12 million hours) and a simpler pre-training pipeline. This suggests that extending the wav2vec 2.0 encoder and adding an LLM-style decoder is an effective approach to high-coverage multilingual ASR.

Main points

  1. Full-Language ASR provides open source ASR covering more than 1,600 languages ​​and can generalize to more than 5,400 languages ​​using zero-shot learning in context.
  2. The models are built on the large-scale wav2vec 2.0 encoder, which was trained using approximately 4.3 million hours of unlabeled audio from 1,239 tagged languages, as well as other unlabeled speech.
  3. The suite includes the wav2vec 2.0 encoder, CTC ASR, LLM ASR and a dedicated zero-sample LLM ASR model, with encoder sizes ranging from 300M to 7B parameters, and LLM ASR up to about 7.8B parameters.
  4. The 7B LLM ASR model achieves character error rates below 10% on 78% of the 1,600+ supported languages, which is competitive with or better than previous multilingual systems in resource-poor settings.

Full-Language ASR is a significant system-level contribution because it treats multilingual ASR as an extensible framework rather than a fixed list of languages, combining the 7B wav2vec 2.0 encoder, CTC and LLM ASR decoders, and a zero-shot LLM ASR model that can adapt to new languages with some contextual examples while achieving character error rates below 10% on 78% of the 1,600+ supported languages and on Apache 2.0 and release all content under CC to 4.0. Overall, this release establishes full-language ASR as the most scalable open source speech recognition model currently available.


Check Papers, repurchase agreements and technical details. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


Michal Sutter is a data science professional with a master’s degree in data science from the University of Padua. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex data sets into actionable insights.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...