AI

AWS introduces SWE-PolyBench: a new open source multilingual benchmark for evaluating AI encoding proxy

Recent advances in Large Language Models (LLMS) have enabled the development of AI-based coding agents that can generate, modify and understand software code. However, evaluation of these systems remains limited, often limited to synthetic or narrow benchmarks, mainly in Python. These benchmarks rarely reflect the structural and semantic diversity of realistic code bases, so many agents over-adapt to benchmark-specific patterns rather than proving transferable, transferable functionality.

AWS introduces SWE-PolyBench: A more comprehensive evaluation framework

To address these challenges, the AWS AI Lab has already introduced it SWE-PolyBenchThis is a multilingual storage-level benchmark designed to be based on the evaluation of an AI encoding agent that is executed. The benchmarks span 21 GITHUB repositories for four widely used programming languages ​​(Java, JavaScript, Typescript, and Python), which includes 2,110 tasks, including bug fixes, functional implementations, and code refactoring.

Unlike previous benchmarks, SWE-PolyBench combines real-world pull requests (PRs) for enclosing practical problems and includes relevant test cases, allowing for verifiable evaluations. A smaller, layered subset –SWE-PolyBench500– Also released to support faster experiments while preserving task and linguistic diversity.

Technical structure and evaluation indicators

SWE-PolyBench adopts an execution-based evaluation pipeline. Each task includes a repository snapshot and a question statement from a GitHub issue. The system applies associated ground real patches (e.g., Maven for Java, NPM for JS/TS, etc.) in containerized test environments configured for their respective language ecosystems. Then, the results of two types of unit tests are benchmarked: Failed channel (F2P) and Pass (P2P).

To provide a more refined evaluation of the encoder, SWE-PolyBench introduces Concrete Syntax Tree (CST)– Based on indicators. This includes file-level and node-level search scores, which evaluate the agent’s ability to locate and modify relevant parts of the code base. These metrics provide insights beyond binary pass/fail results, especially for complex multi-file modifications.

Empirical assessment and observation

Three open source encoding agents –assistant,,,,, Swe-Agentand No agent– Suitable for SWE-PolyBench. All use human Claude 3.5 as the base model and are modified to handle benchmark multilingual, storage-level requirements.

Evaluation shows that performance differences between language and task types are large. For example, the agent performs best on Python tasks (up to 24.1% pass rate) but struggles in typescripts (as low as 4.7%). Although Java has higher complexity in average node changes, success rates are achieved higher than typing rates, suggesting that preprocessing and syntax familiarity play a crucial role in model performance.

Performance also varies with task complexity. Tasks limited to single-function or single-layer variations produced higher success rates (up to 40%), while those requiring mixed or multi-file variations decreased significantly. Interestingly, high retrieval accuracy and recall (especially for file and CST node recognition) do not always translate into higher pass rates, suggesting that code localization is necessary but not sufficient to solve the problem.

Conclusion: A powerful evaluation of AI encoder

SWE-PolyBench provides a powerful and nuanced evaluation framework for coding agents, addressing key limitations of existing benchmarks. By supporting multiple programming languages, covering a wider range of task types and merging syntax-aware metrics, it provides a more representative assessment of agents’ real-world applicability.

Benchmarks show that despite promising capabilities, AI agents still have inconsistent performance in language and tasks. SWE-PolyBench lays the foundation for future research, aiming to improve the universality, robustness and reasoning capabilities of AI coding assistants.


Check out the AWS Devops blog, Hugging Faces – SWE-Polybench and GitHub – Swe-Polybench. Also, don’t forget to follow us twitter And join us Telegram Channel and LinkedIn GrOUP. Don’t forget to join us 90K+ ml reddit.

🔥 [Register Now] Minicon Agesic AI Virtual Conference: Free Registration + Certificate of Attendance + 4-hour Short Event (May 21, 9am-1pm) + Hands-On the Workshop

Post AWS introduces SWE-PolyBench: a new open source multilingual benchmark for evaluating AI encoding agents, first appeared on Marktechpost.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button