AI

Superbpe: A language model that uses crossword tokenization

Language Models (LMS) face a fundamental challenge in how to perceive text data through tokenization. The current sub-word marker takes text segments as vocabulary tokens, fails to bridge spaces, and obeys artificial constraints that treat spaces as semantic boundaries. This practice ignores the reality that meaning often goes beyond a single word – multi-word expressions (such as “many” functions as a single semantic unit, and English speakers will mentally store thousands of such phrases. Cross-lingually, the same concept can be represented as a single or multiple words according to language. It is worth noting that some languages ​​(such as Chinese and Japanese) do not use spaces, allowing tokens to span multiple words or sentences without significant performance degradation.

Previous research explores several methods other than traditional subword tokenization. Some studies have investigated processed text at multiple granular levels or processed text that creates multiword tokens through frequency-based N-Gram recognition. Other researchers have explored multi-sentence prediction (MTP), allowing language models to predict various tokens in one step, which confirms the model’s ability to process multiple subwords simultaneously. However, these methods require architectural modifications and fixing the number of tokens predicted by each step. Some researchers have adopted a token-free approach, modeling text directly into byte sequences. However, this significantly increases the length and computational requirements of the sequence, resulting in complex architectural solutions.

Researchers at the University of Washington, NVIDIA and ALEN AI Institute have proposed Superbpe, a tokenized algorithm that creates a vocabulary that contains traditional sub-word markers and innovative “superword” markers that are spread throughout multiple words. This approach enhances the popular byte pair encoding (BPE) algorithm by initially maintaining spatial boundaries to learn sub-word tokens and then removing these constraints to allow superword tokens to form. Although standard BPE quickly achieved a decrease in returns and began to use increasingly rare subwords as vocabulary sizes grew, Superbpe continued to discover common multiword sequences to encode as single s tokens, thereby improving coding efficiency.

Superbpe operates through a two-stage training process that modifies the entanglement steps of traditional BPE mentioned above. This approach visually constructs semantic units and combines them into a common sequence for increased efficiency. Setting t = t (t is the transition point, t is the target size) produces a standard BPE, while t = 0 produces a naive spaceless BPE. Training Super Training requires more computing resources than standard BPE, because if no spaces are corrected, the training data consists of very long “words” with minimal deduplication. However, this increased training took several hours of 100 CPUs and only happened once, which is negligible compared to the resources required for language model preparatory courses.

Superbpe demonstrates impressive performance across 30 benchmarks, covering knowledge, reasoning, coding, reading comprehension and more. All Superbpe models performed better than the BPE baseline, with the strongest 8B model achieving an average improvement of 4.0% and exceeding 25 baselines in 30 individual tasks. The multi-select task showed considerable growth and improved by 9.7%. The only statistically significant underperformance occurred in the Lambada task, with the final accuracy of Superbpe falling from 75.8% to 70.6%. Furthermore, all reasonable transition points produce stronger results than the baseline. The most encoded transition point provides a +3.1% performance improvement while reducing inference calculations by 35%.

In short, the researchers introduced Superbpe, a more efficient tokenization method that combines superword tokens by enhancing the standard BPE algorithm. Although symbolization is the basic interface between language models and text, the tokenization algorithm is still relatively static. Superbpe challenges this status quo by recognizing that tokens can be extended to traditional subword boundaries to include multiword expressions. Superbpe Tokenizers enables language models to achieve superior performance in numerous downstream tasks while reducing inference computing costs. These advantages do not require any modification to the basic model architecture, making Superbpe a seamless alternative to traditional BPE in the modern language model development pipeline.


Check Paper and project pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 85k+ ml reddit.

Post Superbpe: The language model with transword tokenization first appeared on Marktechpost.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button