Salesforce AI proposes Viunit: an AI framework that improves the reliability of visual programs by automatically generating unit tests using LLM and diffusion models.

Visual programming has emerged strongly in computer vision and AI, especially in image reasoning. Visual programming allows computers to create executable code to interact with visual content to provide the correct response. These systems form the backbone of object detection, image subtitles and VQA applications. Its effectiveness stems from the ability to modularize multiple reasoning tasks, but correctness raises a major problem. Contrary to conventional programming, logical errors can be detected during syntax checking and debugging, and visual programs can produce seemingly correct results, but may be logically incorrect. Improved unit testing methods play a crucial role in making them more reliable.
A recurring problem with visual programming is that the model gives the correct answer for the wrong reasons. The logic of these outputs cannot be verified has a serious impact, because a program that performs well suddenly fails unexpectedly when suffering from new data. The latest research on the Codellama-7B model for 100 visual programs generated by the GQA dataset shows that only 33% of these programs are correct. On the other hand, 23% of people need to be rewrited. Most models are based on statistical correlation rather than practical understanding and are therefore susceptible to edge cases. Visual programming lacks systematic testing programs, errors often fail to attract attention, and requires a stronger verification framework.
Efforts to improve the reliability of visual programs focus primarily on training using marked datasets, but this approach has limitations. Annotations for training data can be expensive and may not cover all potential use cases. Some researchers have explored reinforcement learning strategies, prioritizing plans to produce correct answers during training, but these approaches do not necessarily ensure logical rationality. Traditional unit testing widely used for text-based programming has been tweaked to check whether program output falls under a predefined category. Although these methods provide a certain level of verification, they do not verify that the reasons behind the answer are logically correct. Addressing these limitations requires a new solution to the system evaluates program behavior.
Salesforce AI research and researchers at the University of Pennsylvania have introduced Viunit test (viunit)a framework designed to improve the reliability of visual programs by generating unit tests that evaluate logical correctness. Unlike the conventional unit testing techniques that are primarily used in text-based applications, Viunit generates test cases in image solution pairs. These unit tests allow researchers to verify that the model really understands the relationships and properties in the image, rather than relying on statistical shortcuts. The core idea behind this framework is to systematically evaluate the visual program by creating images used as test input, accompanied by the expected answers that the program should generate. This process ensures that the model produces the correct answers and follows logical steps to reach them.
The Viunit framework uses LLMS to generate test cases. This process starts with creating candidate image descriptions and then converting them into a synthetic image using state-of-the-art text-generating model. To maximize the effectiveness of unit testing, Viunit incorporates an optimization standard that selects image descriptions to provide optimal test coverage for different scenarios. The system then executes a visual program on these test images, comparing the program’s response to the expected answer. The scoring function is used to evaluate the performance of programs on these tests and can be refined or discarded for failed tests. This structured approach ensures that unit testing is comprehensive and can identify a wide range of potential errors. The framework also introduces four key applications for visual unit testing: best program selection, rejection, repropose and reward design based on reinforcement learning. These applications allow researchers to improve model reliability by selecting the best performing program, refuse to generate answers when confidence is low, refine the program with iterative prompts, and train the model using unit test-driven enhancement reinforcement learning.
To evaluate Viunit, the researchers conducted extensive experiments on three datasets (GQA, sugar sieve, and Winoground) that are common benchmarks for evaluating visual reasoning and image text matching. The results show that Viunit significantly improves model performance. Specifically, in all datasets, it improved accuracy by an average of 11.4%. The framework also allows open source models with 7 billion parameters to outperform such as GPT-4O-MINI (average) 7.7% (e.g. GPT-4O-MINI). Furthermore, Viunit successfully reduces the number of correct programs due to error reasons. The enhanced learning-based reward design implemented in Viunit has proven to be efficient and outperforms traditional correctness-based reward strategies. This improvement shows that unit testing can be used to detect errors and to actively improve and improve model performance. Introducing an answer rejection strategy also helps with reliability, ensuring that the model does not provide misleading answers with low confidence.
Several key points of research include:
- Only 33% of tested visual programs in GQA are completely correct, while 23% require a lot of rewrite due to logic flaws.
- Viunit reduces logically flawed programs by 40% to ensure that the model relies on reasonable reasoning rather than statistical shortcuts.
- In the three datasets, the average model accuracy of the framework increased by 11.4% on average.
- Viunit enables the 7B open source model to outperform the GPT-4O-Mini 7.7%.
- The framework introduces four novel applications – best plan choices, answer rejection, reproposed and reinforcement learning reward design.
- Viunit-based re-improved re-improved performance is 7.5 percentage points higher than error-based re-commits.
- The reinforcement learning strategy used in Viunit is better than the reward strategy based on accuracy, with an increase of 1.3%.
- The system is successfully identified when the program is unreliable, improving the answer rejection strategy and reducing false confidence.
Check Paper and github pages. All credits for this study are to the researchers on the project. Also, please keep an eye on us twitter And don’t forget to join us 80k+ ml subcolumn count.
🚨 Recommended Reading – LG AI Research Unleashes Nexus: An Advanced System Integration Agent AI Systems and Data Compliance Standards to Address Legal Issues in AI Datasets

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.
🚨Recommended open source AI platform: “Intellagent is an open source multi-agent framework that evaluates complex dialogue AI systems” (promoted)