Evaluating enterprise-class AI assistants: A benchmark for complex, voice-driven workflows

As businesses increasingly integrate AI assistants, it is crucial to evaluate how these systems perform real-world tasks effectively, especially through voice-based interactions. Existing assessment methods focus on a wide range of dialogue skills or limited use of task-specific tools. However, these benchmarks are insufficient when measuring the ability of AI agents to manage complex, professional workflows in various fields. This gap underscores the need for more comprehensive assessment frameworks that reflect the challenges AI assistants face in practical enterprise settings, ensuring they can truly support complex voice-driven operations in real-world environments.

To address the limitations of existing benchmarks, Salesforce AI Research and Engineering has developed a tailored evaluation system to evaluate AI agents in complex enterprise tasks for text and voice interfaces. This internal tool supports the development of products such as Agentforce. It provides a standardized framework to evaluate AI assistant performance in four key business areas: managing health care appointments, handling financial transactions, processing inbound sales, and fulfilling e-commerce orders. Using well-planned human-verified test cases, the benchmark requires agents to complete multi-step operations, use domain-specific tools, and adhere to strict security protocols for both communication modes.

Traditional AI benchmarks usually focus on common sense or basic instructions, but enterprise settings require more advanced features. In these cases, AI agents must integrate with multiple tools and systems, follow strict security and compliance procedures, and understand professional terms and workflows. Voice-based interaction adds another layer of complexity, especially in multi-step tasks, due to potential speech recognition and synthesis errors. To meet these needs, benchmarks guide AI development toward more reliable and effective assistants tailored to enterprises.

Salesforce’s benchmarks use a modular framework that contains four key components: domain-specific environments, predefined tasks with clear goals, simulated interactions that reflect real-world conversations, and measurable performance metrics. It evaluates AI in four enterprise areas: healthcare appointment management, financial services, sales and e-commerce. Tasks range from simple requests to complex operations involving conditional logic and multiple system calls. In human-verified test cases, benchmarks ensure real-world challenges of testing agents’ inference, accuracy, and tool processing in text and voice interfaces.

The evaluation framework measures AI proxy performance based on two main criteria: accuracy, correctness of the proxy completing tasks, and efficiency, which are evaluated through conversation length and token usage. Evaluate text and voice interactions and choose to add audio noise to the elasticity of the test system. Modular benchmarks are implemented in Python, support realistic customer-agent conversations, multiple AI model providers, and configurable voice processing, and configure voice components with built-in voice-to-text and text. The open source release is planned to enable developers to extend the framework to new use cases and communication formats.

Preliminary testing across top models such as GPT-4 variants and Llama shows that financial tasks are the most prone to errors due to strict verification requirements. The performance of speech-based tasks is 5-8% lower than that of text. The accuracy of multi-step tasks is further reduced, especially those that require conditional logic. These findings highlight ongoing challenges in tool usage chains, protocol compliance, and voice processing. Despite its power, the benchmark lacks personalization, real-world user behavior diversity and multilingual capabilities. Future work will address these gaps by expanding domains, introducing user modeling and combining more subjective and cross-language evaluations.

View technical details. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.