Optimize assembly code with LLMS: Enhanced learning features exceed traditional compilers

by admin · May 24, 2025

LLMs show impressive functionality in a variety of programming tasks, but their potential for program optimization has not yet been fully explored. Although some recent efforts have used LLM to improve the performance of languages such as C++ and Python, the wider application of LLM to optimize code, especially in low-level programming environments, remains limited. Existing LLM benchmarks focus to a large extent on natural language or code generation that solves GitHub problems, as shown in Humaneval, MBPP, Apps, Swe-Bench, and Swe-Agent. Additionally, models such as Codex, Alphacode, and Code Llama are primarily designed to improve code generation quality rather than performance. However, some studies have begun to address optimization, including parallelization and code efficiency improvements, although many of these methods are limited by the requirements of formal validation, thus limiting scalability.

In contrast, some newer approaches cover test-based validation, allowing more complex programs to be optimized through loops. Learn-based compiler optimization strategies (e.g., CoreSet applying Graphics Neural Networks using autophagos that enhance learning for sequencing) show promise for improved performance. Overspeeding technology is designed to find the most efficient version of a program, but is usually limited to small-scale issues. In addition, frameworks such as AUTOTVM and ANSOR are also committed to optimizing GPU kernel code through statistical modeling and search. Recently, LLM-driven optimization has attracted attention, using feedback from test cases to guide enhanced learning methods of LLMS. Technologies like CODERL and PPOCODER use strategy optimization methods to fine-tune models, even in resource-constrained programming languages such as Verilog.

Stanford, UIUC, CMU and Visa researchers used LLMS to explore the use of LLM to optimize assembly code performance, an area handled by compilers such as GCC. They introduced an enhanced learning framework using proximity strategy optimization (PPO) and were guided by reward balance correctness and speed on the GCC-O3 baseline. Using a data set of 8,072 real-world programs, their model QWEN2.5-encoded-7B-PPO achieved a test rate of 96.0% and an average speed of 1.47 times, surpassing 20 other models, including Claude-3.7-Sonnet. Their results show that with RL training, LLM can effectively outperform the optimization of conventional compilers.

This method involves optimizing compiled C programs using RL methods for performance. Given C program C, it is compiled to assemble P using GCC -O3. The purpose is to generate a new assembly program P’, which is functionally equivalent but faster. Correctness was verified using the test set and acceleration was measured by improving execution time. Using Codenet as the dataset, the author applies PPO to train a language model that generates improved code. Two bonus features (this function is the acceleration and speed of correction boot) are used to improve guidance training based on program effectiveness, correctness and performance.

The study evaluates various language models to optimize assembly code, showing that most models struggle with low test rates and minimal acceleration. However, the reinforced learning-trained QWEN2.5-coder-7b-PPO is significantly better than others, reaching 96% accuracy and 1.47 times the average speed. Ablation studies have shown that using GCC-O3 as a reference auxiliary performance while deleting it leads to a sharp drop. It is worth noting that models such as Claude-3.7-Sonnet can go beyond compilers by identifying hardware-specific optimizations, such as replacing loops with a single POPCNT instruction, proving that they can perform semantic-level code conversions outside of traditional compiler functions.

In summary, the study uses LLMS to optimize assembly code, an area where traditional compilers struggle due to the complexity of low-level performance tuning. The authors used PPO to fine-tune QWEN2.5-encoding-7B, reward correctness (through test cases) and acceleration of GCC-O3. They introduced 8,072 real-world program C benchmarks to evaluate performance. The model achieved a test rate of 96.0% and an average acceleration of 1.47 times, surpassing the other 20 models, including Claude-3.7-Sonnet. Although effective, limitations include the lack of formal guarantees of correctness and the variability of hardware performance throughout the system.

View paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.