AI

This AI paper introduces the Kolmogorov test: a compression benchmark for evaluating code generation language models

Compression is the cornerstone of computational intelligence, and it is deeply rooted in Kolmogorov’s theory of complexity, which defines the minimum program required to reproduce a given sequence. Unlike traditional compression methods that seek repetition and redundancy, Kolmogorov’s framework interprets compression as a problem of discovering structured patterns through programmatic representations. Although the theory is expected to be optimally compressed, its incompatibility presents significant obstacles. However, the emergence of large language models that can generate code opens up an interesting opportunity for testing how modern systems can approximate this theoretical ideal by using code reasoning rather than pattern matching.

The core problem stems from the limitations of current tools that use concise executable code to compress data sequences. Models usually copy inputs rather than generating programs that copy them, indicating that there are gaps in real schema understanding. This is especially evident when dealing with real-world audio, text, or DNA sequences, complex logical structures must be discovered to enable effective compression. The main challenge is to ensure that the model replicates the sequence and uses a minimal and reasonable instruction set. Furthermore, although synthetic training data is useful for control evaluation, it often fails to support a powerful generalization of natural data, which is critical for practical applications.

Several compression tools exist, from traditional algorithms (such as GZIP) to new neural compression systems. GZIP remains a strong baseline, especially for long-term or repeated sequences, due to its efficient encoding of statistical regularity. Recently, using predicted probability to compress input data, language modeling methods have been integrated with arithmetic coding. However, these methods often require access to full model weights at decoding time, limiting their efficiency and applicability. Code generation models such as GPT-4 and Llama are also evaluated in the zero photo setup to generate Python programs that copy input sequences. However, they often produce verbose, inaccurate code, especially when facing invisible or complex sequences.

Meta AI and Tel Aviv University researchers introduced Kolmogorov-Test (KT), a benchmark for evaluating the inference ability of code-generating language models. This test evaluates the model’s ability to generate the shortest program that outputs a given input sequence. Unlike typical benchmarks, KT emphasizes logical composition and planning generation rather than predictive text modeling. The sequence includes natural data from audio (librispeech), text (Wikipedia enwik9) and DNA (GRCH38), as well as synthetic sequences generated through a custom designed domain-specific language (DSL). This DSL supports the construction of structured sequences through operations such as scope creation, sequence modification, merging, and filtering.

The researchers developed an automation framework to generate millions of synthetic programs-sequence pairs using this DSL. These programs then train and evaluate models, including large pre-trained, specially trained models such as Seqcoder. To measure performance, the team adopted metrics such as accuracy (regardless of the generated programs reproduced in sequence) and accuracy – how to compare the correct program with GZIP compression. The test involves compressed sequences of varying lengths, with the synthetic sequence averaged 76 bytes and the real sequence limit of 128.

The results show that even the most powerful models struggle. GPT-4 has an accuracy of 69.5% on high-quality audio, but it has dropped to 36.4% in 8-bit audio and a 50.3% accuracy for DNA data. Llama-3.1-405b has poor performance, with audio accuracy as low as 3.9% and DNA is only 24.8%. In the synthetic data, Seqcoder-8b has an accuracy of 92.5%, with an accuracy score of 0.56, which outperforms traditional tools for GZIP (such as GZIP). However, its accuracy for real-life data remains at zero close. This difference illustrates the difficulty of moving success from synthetic benchmarks to more diverse and noisy real-world sequences, highlighting the limitations of the current training regime and driving the need for new strategies.

Overall, this study clearly outlines the complexity of compression generated by code. KT benchmarks provide rigorous diverse model inference and structural identification tests, thus revealing a distinct gap between synthetic learning environments and real-world applications. The introduced methods and tests set high standards for future models, aiming to cope with this challenge by compressing unified reasoning, but still require a lot of innovation.


Check Paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 85k+ ml reddit.

This post This AI paper introduces Kolmogorov-Test: a compression benchmark for evaluating code generation language models, first appeared on Marktechpost.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button