Meet MMBERT: A code-only language model predicted in multilingual texts in more than 1800 languages, faster than previous models 2-4×

by admin · September 11, 2025

Why are new multilingual encoders needed?

XLM-ROBERTA (XLM-R) has dominated multilingual NLP for more than 5 years, which is an abnormally long-term domination of AI research. Although coding-only models like Bert and Roberta are at the heart of early advances, most research energy turns to decoder-based generative models. However, encoders are more efficient in embedding, retrieval and classification tasks and often outperform decoders. Despite this, multilingual encoder development has stalled.

A team of researchers at Johns Hopkins University proposed MMBERT to address this gap by providing modern encoders, surpassing XLM-R and competitors’ recent large-scale models such as OpenAI’s O3 and Google’s Gemini 2.5 Pro.

Learn about mmbert’s architecture

There are two main configurations of mmbert:

Basic Model: 22 transformer layers, 1152 hidden dimensions, about 307m parameters (110m non-insert).
Small model: ~140m parameter (42m non-insert).

It adopts Gemma 2 Token Efficiency with 256K vocabulary, rotary position embedding (rope) and Flashattention2. Sequence length from 1024 to 8192 tokensuse unadjusted embed and sliding window attention. This allows MMBERT to handle contexts almost orders of magnitude than XLM-R while maintaining faster inference.

What training data and stages were used?

Mmbert received training 3 trillion tokens span 1,833 languages. Data sources include FineWeb2, Dolma, Megawika V2, Extension, Starcoder, etc. Depending on the stage, English accounts for only about 10–34% of this corpus.

The training is divided into three stages:

Pre-training: 2.3T tokens span 60 languages and codes.
Mid-term training: 600B tokens span 110 languages, focusing on high-quality sources.
Attenuation phase: 100B token covers 1,833 languages, emphasizing low resource adaptation.

What new training strategies have been introduced?

Three major innovations have driven Mmbert’s performance:

Annealing Language Learning (ALL): Gradually introduce language (60→110→1833). The sampling distribution ranges from high resources to unified annealing, ensuring that low resource languages get impacts in the later stages without overfitting limited data.
Inverse masking timetable: The masking rate starts at 30% and decays to 5%, which later facilitates early and fine-grained coarse learning.
Models for merger of trans-decay variants: Multiple attenuation phase models (English, 110 languages and 1833 languages) are combined by bonding, utilizing complementary strength without the need for retraining from scratch.

How does Mmbert perform on the benchmark?

English NLU (glue): Mmbert Base hit 86.3, surpassing XLM-R (83.3), almost matching Modernbert (87.4), although >75% of the training was allocated to non-English data.
Multilingual NLU (Xtreme): Mmbert Base scored 72.8 vs. XLM-R 70.4, growth in classification and quality inspection tasks.
Embed Tasks (MTEB V2): Mmbert Base Base Cighterbert is in English (53.9 vs. 53.8) and boots in multilingual (XLM-R 54.1 vs. 52.4).
Code Retrieval (COIR): Mmbert outperformed the XLM-R by about 9 points, although Eurobert is still stronger on proprietary data.

How does mmbert handle low resource languages?

Annealed learning schedules ensure low-resource language benefits during later training periods. On benchmarks such as Faroese Foqa and Tigrinya Tiquad, Mmbert performed significantly better than the O3 and Gemini 2.5 Pro. These results show that if carefully trained, the encoder model can be effectively generalized even in extremely low resource scenarios.

What efficiency has Mmbert achieved?

mmbert is 2–4×Fast Compared to XLM-R and Minilm when supported 8192 input. It is worth noting that it is faster when it comes to 8192 tokens than older encoders. This speed improves modern training formulas, effective attention mechanisms and optimized embeddings.

Summary

MMBERT is a long-term alternative to XLM-R, redefining what multilingual encoders can offer. It runs 2–4 times, processes sequences up to 8K tokens, and outperforms previous models in both high resource benchmarks and low resource languages, which are inadequate in past rates. Its training formula (3 trillion tokens with annealing language learning, reverse masking and model merging) shows how careful design can unlock extensive overview without excessive redundancy. The result is an open, efficient and scalable encoder that not only fills the six-year gap in XLM-R, but also provides a strong foundation for the next generation of multilingual NLP systems.

Check Paper,,,,, Models about embracing faces, github and technical details. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

Meet MMBERT: A code-only language model predicted in multilingual texts in more than 1800 languages, faster than previous models 2-4×

Why are new multilingual encoders needed?

Learn about mmbert’s architecture

What training data and stages were used?

What new training strategies have been introduced?

How does Mmbert perform on the benchmark?

How does mmbert handle low resource languages?

What efficiency has Mmbert achieved?

Summary

You may also like...

live chat

Recent Posts

Meet MMBERT: A code-only language model predicted in multilingual texts in more than 1800 languages, faster than previous models 2-4×

Why are new multilingual encoders needed?

Learn about mmbert’s architecture

What training data and stages were used?

What new training strategies have been introduced?

How does Mmbert perform on the benchmark?

How does mmbert handle low resource languages?

What efficiency has Mmbert achieved?

Summary

You may also like...

Master the coding guide for self-supervised learning to achieve efficient data management and active learning with Lightly AI

Meta AI’s MILS: Game-changing for Zero Photo Multimodal AI

The role of immune cells in skin and mucosal vaccine development

live chat

Recent Posts