AI

Verina: Evaluate LLM through formally proven end-to-end verifiable code generation

LLM-based code generation faces verification gap

LLM shows strong performance in programming and is widely adopted in tools such as cursor and Github Copilot to improve developer productivity. However, LLMS cannot provide formal guarantees for generated code due to its probabilistic nature. The generated code usually contains errors, and these issues can become productivity bottlenecks when generated using LLM-based code. Developing the right benchmarks to track progress in verifiable code generation is important but challenging because it involves three interconnected tasks: code generation, specification generation, and proof generation. Current benchmarks are insufficient because they lack support for all three tasks, quality control, strong metrics and modular design.

Existing benchmarks lack comprehensive support for verifiability

Benchmarks like HumaneVal and MBPP have made good progress in LLM-based code generation, but do not handle formal specifications or proofs. Many validation-centric efforts target only one or two tasks and undertake other elements that humans offer. Dafnybench and MinicodeProps are designed to be used to prove production, while Autospec and Specgen infer the specifications and proofs of code written by humans. Interactive theorem systems (e.g. Leans) provide promising goals for verifiable code generation using LLMS, as they support proof-building using intermediate steps. However, existing validation benchmarks in Lean, such as Microdocks and FVAPP, have limitations in mission coverage and quality control.

Introduction to Verina: Overall benchmarks generated by code, specifications and proofs

Researchers from the University of California and Meta Fair proposed Verina (the field of verifiable code generation), a high-quality benchmark for evaluating verifiable code generation. It consists of 189 programming challenges containing detailed problem descriptions, code, specifications, proofs and test suites, the format of which is lean. Verina is built from quality control, drawing questions from sources like MBPP, LiveCodebench and Leetcode to provide different levels of difficulty. All samples can be manually reviewed and refined to ensure clear natural language descriptions, precise form specifications and accurate code implementation. Each sample contains a test suite to cover both positive and negative situations, with 100% code to achieve line coverage and pass ground truth specifications.

Structure and composition of Verina dataset

Verina consists of two subsets with different difficulty levels: Verina-Basic and Verina-Adv. verina-basic contains 108 questions about translation of Dafni code written by humans. This included 49 questions from MBPP-DFY50 and other instances of the Clover substitute, and was translated using OpenAi O3-Mini with few start prompts and then checked. Verina-ADV contains 81 advanced coding questions submitted by students in the course provided by the theorem, in which students raise questions from platforms such as Leetcode and Livecodebench, followed by formal solutions in Lean. In addition, Verina adopts strict quality assurance, including detailed problem descriptions, complete code coverage, and full code coverage for positive tests, and complete test rates for ground truth specifications.

Performance Insights: Verina’s LLM Assessment highlights key challenges

An evaluation of 9 state-of-the-art LLMs on Verina reveals a clear hierarchy. Code generation achieved the highest success rate, followed by specification generation, while proving generation is still the most challenging, with pass rates below 3.6% for all models. Verina-ADV is more difficult compared to Verina-Basic in all three tasks, highlighting that the increase in complexity of the problem significantly affects the performance of verifiable code generation. Iterative proof improvements for O4-Mini show that after 64 iterations, the simpler problem on Verina-Basic has increased from 7.41% to 22.22% despite the limited gains on Verina-ADV. Providing ground real specifications can enhance code generation, indicating that formal specifications can effectively limit and guide the integrated process.

Conclusion: Verina sets new standards in verifiable code evaluation

In short, the researchers introduced Verina, an advance in benchmark verifiable code generation. It offers 189 well-curated examples with detailed task descriptions, high-quality code, lean specifications, and a wide range of test suites with full coverage. However, for fine-tuning tasks, the dataset is still relatively small and needs to be extended with LLM aid for automatic annotation. Verina emphasizes simple independent tasks that are suitable for benchmarking, but does not fully represent complex reality verification projects. The generation metrics of specifications can be improved in the future by combining more capable abandonment (including based on LLMS or SMT solvers) to effectively handle complex sanity and integrity relationships.


Check Paper, dataset card, github page. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button