0

Meta AI researchers introduce a scalable byte-level autoregressive U-NET model that outperforms token-based transformers in language modeling benchmarks

Language modeling plays a fundamental role in natural language processing, enabling machines to predict and generate text similar to human language. These models start with statistical methods and then develop through neural architecture to today’s large-scale transformer systems. In the center of many applications, such as chatbots, translation tools and text completion engines, language models interpret and generate word or byte sequences. Their effectiveness depends to a large extent on the underlying architecture and the data representation used. As the demand for more efficient and scalable models grows, researchers continue to explore new structural and training methods to improve performance, handle longer environments and reduce computational load. In these efforts, combining the idea of ​​convolutional architecture with autoregressive prediction is an interesting approach.

Challenges of tokenization and transformer-based language models

One of the main problems with language modeling is the overuse of token-based models and transformer models, which are computationally expensive and are often inefficient in processing at the byte level and even across languages. Techniques such as byte pair encoding control sequence length, but create inconsistencies between language and domain. Although the transformer is accurate, it lacks scalability due to its secondary complexity. Although competing methods, such as sparse attention, try to solve this problem, they often come at the expense of simplicity or performance. Byte-level modeling using Flat Transformers shows only partial success, highlighting the need for new architectures that can handle raw byte input without tokenization while achieving excellent performance.

Introduction to Au-net: a byte-level language model without tokens

Researchers from Fair from Meta, Tau, Inria and Lisn, CNRS & Université Paris-Saclay, Insa Rouen Normandy, Litis, Litis, Rouen, France have introduced a new automatic regression U-NET (AU-NET). This model combines the idea of ​​convolutional U-NET design with the autoregressive decoding process. Contrary to transformer systems, Au-net does not need to be tokenized and works directly on bytes. The architecture is designed to achieve parallel and efficient generation and has autonomy to combine automatic regression capabilities. It achieves this by encoding downsampled convolutions hierarchically and then resampling the stage, thus restoring the original sequence size. It is worth noting that Au-net proposes a splitting mechanism that can predict sub-analysis of sequences, thereby enhancing scalability. This design transfer also ensures that the complexity of the model is dependent on sequence length rather than quadratic. The researchers deployed the model in several language modeling benchmarks and multilingual tasks to test its effectiveness in low-resource and large-scale settings.

Au-net architecture: multi-scale coding and parallel inference

The Au-net architecture is implemented through multiple scale phases that reduce and then reconstruct the input sequence using a large stride convolution. During training, each segment of the input sequence is predicted to maintain the self-rotating attribute in a masked manner. The model uses the learned split function to divide the input sequence into non-overlapping groups, which are then predicted and combined into a complete output simultaneously. It supports both shallow and deep configurations, and the model’s training calculation budget ranges from 3% to 75% compared to the standard baseline. For example, a configuration with 8 billion parameters training on a 200B token can achieve competitive results. Another version is training for 60 billion tokens and has a 1 billion parameter model, which earns 35.7 BLEU scores on standard translation tasks, better than the baseline model trained on the same data. Furthermore, Au-net shows faster generation speeds due to its parallel decoding, which brings significant benefits to latency-sensitive applications.

The benchmark results show competitive advantage over transformers

The experimental results show that they performed very well in various tasks. On ENWIK8 (a byte-level compression reference), Au-net reaches 1.01 bits per byte, exceeding a transformer baseline that reaches only 1.02 bits. On PG-19, which is a long-form model task, the model achieves 2.61 bits per byte, while the standard transformer’s 2.75 bits. Au-net can also scale efficiently in the compute budget, with the 8b model size trained on the 200B token implementing 43.3 BLEU on the Flores-200 conversion. In multilingual evaluation using Flores-200, the model outperforms token-based transformers in low resource language pairs. It also shows better translingual generalization in the language family, achieving BLEU scores of up to 33.0 in several configurations. When evaluated based on equal calculations and data budgets, AU-NET matches or outperforms the transformer, and in some cases, the generation speed is increased by 20% to 30%.

Key contributions and performance insights of AU-NET

  • Au-net eliminates the need for tokenization by operating directly on the original byte input.
  • On ENWIK8, Au-net scores 1.01 bpb, exceeding the transformer baseline of 1.02 bpb.
  • On PG-19, it hits 2.61 bpb, improving the standard transformer of 2.75 bpb.
  • Flores-200 multilingual evaluation shows 33.0 BLEU, performing better than the token system.
  • AU-NET trained byte-level models maintain high performance in high-resource and low-resource settings.
  • Generation speed is 20%–30%, enabling fast, parallel inference.
  • Regulations on laws; performance improves as model size and data increase.
  • This model shows better transverbal generalization and robustness to noise.
  • Use computing efficiently; Au-net matches or exceeds transformer performance at a lower computing budget.
  • Au-net is a viable alternative to large-scale language modeling tasks, including multilingual and byte-level applications.

Conclusion: The practical benefits and scalability potential of Au-net

In summary, the researchers provide detailed scaling analysis showing that AU-NET adheres to predictable high-parameter scaling laws. It benefits from increasing model size and training tokens in a way consistent with the practice observed in the transformer model. For example, in a computed matching training setting, Au-net’s performance increases steadily with increasing data to model ratio, matching the gain seen in the transformer counterpart. Importantly, Au-net is able to scale to a model with 8 billion parameters, showing that training is effective and showing that the architecture is able to support high-capacity systems. In extended evaluation, the model maintains its efficiency when applied to downstream tasks, showing strong performance in language generation, translation, and byte-level prediction benchmarks. Au-net is also easier to train and more robust under noisy input conditions than token-based models.

Why is this research important?

This study is important because it challenges the long-term dependence on token-based language models by introducing Au-net, a byte-level automatic regression architecture that eliminates the overhead of tokenization while achieving competitive or superior performance. By directly processing raw bytes and scaling effectively with linear complexity, Au-net addresses the key limitations of transformer models, namely their secondary expansion and dependence on fixed vocabulary. Its strong results in multilingual and long text benchmarks, especially in low-resource environments, highlights its potential to build more efficient, popularizable and popularizable NLP systems. This positiones Au-net as a promising alternative to future large-scale language modeling efforts.


Check Paper and github pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.