AI

UC Berkeley introduces Cybergym: A real-world cybersecurity assessment framework for evaluating large-scale vulnerabilities among large-scale code bases

Cybersecurity has become an important field of artificial intelligence, driven by the dependence on large software systems and the dependence on extended functions of AI tools. As the complexity of threats develops, ensuring the security of software systems is more than just a matter of traditional protection; now it intersects with automated reasoning, vulnerability detection, and code-level understanding. Modern cybersecurity requires tools and methods that can simulate real-world scenarios, identify hidden flaws, and verify the system integrity of various software infrastructures. In this environment, researchers have been developing benchmarks and methods to systematically evaluate the ability of AI agents to detect and even exploit vulnerabilities in aligned with human security researchers. But bridging the gap between AI reasoning and the complexity of real-world cybersecurity remains a key challenge.

Problems with existing benchmarks

A pressing question is the lack of effective ways to assess whether AI systems can truly understand and handle security tasks under real-world conditions. Simplified benchmark tasks often dominate current testing methods, which rarely reflects the confusing and layered reality of large software repositories. These environments involve complex input conditions, deep code paths, and subtle vulnerabilities that are more needed than surface level inspections. Without a strong evaluation method, it is difficult to determine whether an AI agent can be trusted to perform tasks such as vulnerability detection or development development. More importantly, current benchmarks cannot reflect the scale and nuances of vulnerabilities found in actively maintained, widely used software systems, leaving key assessment gaps.

Limitations of current tools

Several benchmarks have been used to evaluate cybersecurity capabilities including Cybench and NYU CTF benchmarks. These focus on capture-type tasks, which have limited complexity, often involving small code bases and constrained testing environments. Some benchmarks try to attract real-world vulnerabilities, but they are usually conducted on a limited range. Furthermore, many tools rely on synthetic test cases or narrow challenging problems that cannot represent the diversity of software input, execution paths, and error types in real systems. Even professional agents for security analysis have been tested on benchmarks with only dozens or hundreds of tasks, far from the complexity of real-world threatening landscapes.

Introducing the network

The researchers introduced networkThis is a large-scale and comprehensive benchmarking tool designed specifically to evaluate AI agents in real-life cybersecurity environments. Cybergym is developed at the University of California, Berkeley and includes 1,507 different benchmark tasks from actual vulnerabilities found and patched in 188 major open source software projects. These vulnerabilities were initially identified by Oss-Fuzz, a continuous fuzzy motion maintained by Google. To ensure realism, each benchmark instance includes a complete pre-drawn code base, executables, and text descriptions of vulnerabilities. The agent must generate a proof of concept test that reproduces the vulnerability in the uncatched version and Cybergym triggers the vulnerability in the patch patch version based on whether it triggers the vulnerability in the patch patch version and in the point patch patch patch. This benchmark standard uniquely emphasizes the generation of proof of concept (POC), which requires the agent to traverse complex code paths and synthesize inputs to meet specific security conditions. The network is modular and containerized, and can be easily scalable and repeatable.

Network assessment level

The evaluation pipeline in the network is built within four levels of difficulty range, each increasing the amount of input information provided. At level 0, only the code base is given, no hint of vulnerability. Level 1 has added natural language description. Level 2 introduces the Basic Real Concept Proof of Concept (POC) and Crash stack Trace, while Level 3 includes the patch itself and the patch patch code base. Each level proposes new reasoning and complexity. For example, in Level 1, the agent must infer the location and context of the vulnerability purely from its text description and code base. To ensure benchmark quality, Cybergym applies filters, such as checking patch information, verifying proof of concept (POC) repeatability, and removing redundancy by comparing stack traces. The final dataset contains a code base with a median of 1,117 files and 387,491 lines of code, up to 40,000 files and 7 million lines of code. The patch sizes also vary, modifying the median of 1 file and seven lines, but sometimes spanning 40 files and over 3,000 lines. These vulnerabilities target various crash types, 30.4% related to stacked-space cross-current readings, and 19.0% due to non-initialized value use.

Experimental results

When targeting this benchmark, the success of existing agents is limited. Among the four proxy frameworks, Open, Codex, Mystery and Cybench, the best performer combined with Claude-3.7-Sonnet, replicating only 11.9% of the target vulnerabilities. This performance is greatly reduced when processing longer POC inputs, as POC has the highest success rate, below 10 bytes (43.5%), while lengths exceed 100 bytes. Open source models such as DeepSeek-V3 lag with a success rate of only 3.6%. Even specialized models for code reasoning, such as SWE-GYM-32B and R2E-GYM-32B, failed to generalize, with scores below 2%. Surprisingly, richer input information improves performance: Level 3 achieved 17.1% success, while Level 0 achieved only 3.5%. The analysis also shows that most successful POC replication occurs between 20 and 40 execution steps, and many run more than 90 steps and ultimately fail. Despite these challenges, the agents discovered 15 previously unknown zero-day vulnerabilities and found two but not listed vulnerabilities in the real world, but not listed new vulnerabilities, demonstrating their potential new discovery capabilities.

Key Points

  • Benchmark Quantity and Realism: Cybergym contains 1,507 tasks from real, patched vulnerabilities in 188 software projects, making it the largest and most realistic benchmark of its kind.
  • Proxy Limits: Even the best performing proxy pattern combination replicated only 11.9% of the vulnerability, with many combinations scoring below 5%.
  • Difficulty scaling: Provides additional inputs, such as stack traces or patches, significantly improve performance, while Level 3 tasks produce a 17.1% success rate.
  • Length sensitivity: Agents struggle with tasks involving long POCs. The POC with over 100 bytes, accounting for 65.7% of the data set, has the lowest success rate.
  • Discovery Potential: Agent-generated POCs discovered 15 new zero-day vulnerabilities, verifying their potential use in real-world security analysis.
  • Model Behavior: Most successful exploits are generated during task execution, with reduced returns after 80 steps.
  • Tool interaction: The agent performs better when using tools (e.g., using “awk”, “grep”, or installing ‘xxd’) and tuning POCS based on runtime feedback.

in conclusion

In summary, this study highlights a key issue: evaluating cybersecurity is not only challenging, but also critical to understanding its limitations and capabilities. Cybergym stands out by providing a large-scale real-world framework. Researchers address this problem with practical and detailed benchmarks that force agents to reason deeply throughout the code base, resulting in efficient utilization and adaptation through iteration. The results clearly show that despite current agencies showing hope, especially when new bugs are discovered, there is still a long road to enable AI to reliably promote the contribution of cybersecurity.


Check Paper, github page, rankings. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button