Salesforce AI introduces CRMARENA-PRO: LLM Agent’s First Multi-turn and Enterprise-Level Benchmark

by admin · June 5, 2025

AI agents powered by LLMS show great promise in handling complex business tasks, especially in areas such as customer relationship management (CRM). However, evaluating its real-world effectiveness is challenging due to the lack of publicly available real-world business data. Existing benchmarks often focus on simple, one-turn interactions or narrow applications such as customer service, lack of wider domain names, including sales, CPQ processes and B2B operations. They also cannot test how agents manage sensitive information. These limitations make it a challenge to fully understand the performance of LLM agents in various real-world business scenarios and communication methods.

Previous benchmarks focused on customer service tasks in B2C solutions, overlooking key business operations such as sales and CPQ processes, as well as unique challenges of B2B interaction, including longer sales cycles. Furthermore, many benchmarks lack realism, often overlooking multi-transfer conversations or skipping expert validation of tasks and environments. Another key gap is the lack of confidentiality assessment, which is crucial in a workplace environment where AI agents often interact with sensitive business and customer data. Without assessing data awareness, these benchmarks cannot address serious practical issues such as privacy, legal risks and trust.

Researchers at Salesforce AI Research have launched CRMARENA-PRO, a benchmark designed to actually evaluate LLM agents such as Gemini 2.5 Pro in a professional business environment. It has expert verification tasks across customer service, sales and CPQ across customer service, sales and CPQ. Benchmarks to talk more and evaluate confidentiality awareness. The findings show that even the best-performing models like the Gemini 2.5 Pro achieve only 58% accuracy in a single-turn task and performance drops to 35% in a multi-turn setup. Workflow execution is an exception, with Gemini 2.5 Pro over 83%, but confidentiality processing remains a major challenge for all evaluation models.

CRMARENA-PRO is a new benchmark that rigorously tests LLM agents, including customer service, sales and CPQ solutions in a realistic business environment. The benchmark is built using synthetic but structurally accurate enterprise data generated by GPT-4. Based on the Salesforce model, the benchmark simulates the business environment through a sandbox Salesforce organization. It has 19 tasks grouped under four key skills: database queries, text reasoning, workflow execution, and policy compliance. CRMARENA-PRO also includes multiple transfer conversations with simulated users and testing confidentiality awareness. Expert evaluation confirms the realism of the data and environment, ensuring a reliable LLM proxy performance test bed.

The top LLM agents of 19 business tasks were evaluated and compared with the focus on completing tasks and understanding of confidentiality. Indicators vary by task type – real-face matching of structured output and F1 score of generated responses. The GPT-4O-based LLM judge evaluated whether the model appropriately refused to share sensitive information. Models such as Gemini-2.5-Pro and O1, with advanced reasoning, are significantly better than lighters or non-controversial versions, especially in complex tasks. Although the performance of the B2B and B2C settings is similar, there is a subtle trend based on model strength. Confidentiality awareness prompts increase rejection rates, but sometimes reduce task accuracy, highlighting the trade-off between privacy and performance.

In short, CRMARENA-PRO is a new benchmark designed to test the effectiveness of LLM agents in handling real-world business tasks in customer relationship management. It includes 19 expert review tasks for B2B and B2C programs, covering sales, service and pricing operations. Although top agents performed well in single-turn tasks (about 58% of success), their performance dropped sharply to around 35% of the more redirect conversations. Workflow execution is the easiest area, but most other skills are challenging. Confidentiality awareness is low and improves it by prompting that it is usually reduced task accuracy. These findings reveal a clear gap between LLM’s capabilities and enterprise needs.

View paper, github pages, hug panels and tech blogs. All credits for this study are to the researchers on the project.

🆕 did you know? Marktechpost is the fastest growing AI media platform with more than 1 million readers per month. Book a strategy call to discuss your campaign goals. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.