A coded implementation of a comprehensive enterprise AI benchmark framework for evaluating rule-based LLM and hybrid agent AI systems across real-world tasks

by admin · November 2, 2025

In this tutorial, we develop a comprehensive benchmarking framework for evaluating the performance of various types of agent AI systems on real-world enterprise software tasks. We designed a diverse set of challenges, ranging from data transformation and API integration to workflow automation and performance optimization, and evaluated how various agents (including rule-based agents, LLM-driven agents, and hybrid agents) performed in these areas. By running structured benchmarks and visualizing key performance metrics such as accuracy, execution time, and success rate, we can gain a deeper understanding of the benefits and tradeoffs of each agent in an enterprise environment. Check The complete code is here.

import json
import time
import random
from typing import Dict, List, Any, Callable
from dataclasses import dataclass, asdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


@dataclass
class Task:
   id: str
   name: str
   description: str
   category: str
   complexity: int
   expected_output: Any


@dataclass
class BenchmarkResult:
   task_id: str
   agent_name: str
   success: bool
   execution_time: float
   accuracy: float
   error_message: str = ""


class EnterpriseTaskSuite:
   def __init__(self):
       self.tasks = self._create_tasks()


   def _create_tasks(self) -> List[Task]:
       return [
           Task("data_transform", "CSV Data Transformation",
                "Transform customer data by aggregating sales", "data_processing", 3,
                {"total_sales": 15000, "avg_order": 750}),
           Task("api_integration", "REST API Integration",
                "Parse API response and extract key metrics", "integration", 2,
                {"status": "success", "active_users": 1250}),
           Task("workflow_automation", "Multi-Step Workflow",
                "Execute data validation -> processing -> reporting", "automation", 4,
                {"validated": True, "processed": 100, "report_generated": True}),
           Task("error_handling", "Error Recovery",
                "Handle malformed data gracefully", "reliability", 3,
                {"errors_caught": 5, "recovery_success": True}),
           Task("optimization", "Query Optimization",
                "Optimize database query performance", "performance", 5,
                {"execution_time_ms": 45, "rows_scanned": 1000}),
           Task("data_validation", "Schema Validation",
                "Validate data against business rules", "validation", 2,
                {"valid_records": 95, "invalid_records": 5}),
           Task("reporting", "Executive Dashboard",
                "Generate KPI summary report", "analytics", 3,
                {"revenue": 125000, "growth": 0.15, "customer_count": 450}),
           Task("integration_test", "System Integration",
                "Test end-to-end integration flow", "testing", 4,
                {"all_systems_connected": True, "latency_ms": 120}),
       ]


   def get_task(self, task_id: str) -> Task:
       return next((t for t in self.tasks if t.id == task_id), None)

We define core data structures for our benchmarking system. We create the Task and BenchmarkResult data classes and initialize the EnterpriseTaskSuite, which contains multiple enterprise-related tasks such as data transformation, reporting, and integration. We lay the foundation for the ongoing evaluation of different types of agents in these tasks. Check The complete code is here.

class BaseAgent:
   def __init__(self, name: str):
       self.name = name


   def execute(self, task: Task) -> Dict[str, Any]:
       raise NotImplementedError


class RuleBasedAgent(BaseAgent):
   def execute(self, task: Task) -> Dict[str, Any]:
       time.sleep(random.uniform(0.1, 0.3))
       if task.category == "data_processing":
           return {"total_sales": 15000 + random.randint(-500, 500),
                   "avg_order": 750 + random.randint(-50, 50)}
       elif task.category == "integration":
           return {"status": "success", "active_users": 1250}
       elif task.category == "automation":
           return {"validated": True, "processed": 98, "report_generated": True}
       else:
           return task.expected_output

We introduced a basic agent structure and implemented RuleBasedAgent, which uses predefined rules to mimic traditional automation logic. We simulate how such agents can perform tasks deterministically while maintaining speed and reliability, providing a baseline for comparison with more advanced agents. Check The complete code is here.

class LLMAgent(BaseAgent):
   def execute(self, task: Task) -> Dict[str, Any]:
       time.sleep(random.uniform(0.2, 0.5))
       accuracy_boost = 0.95 if task.complexity >= 4 else 0.90
       result = {}
       for key, value in task.expected_output.items():
           if isinstance(value, (int, float)):
               variation = value * (1 - accuracy_boost)
               result[key] = value + random.uniform(-variation, variation)
           else:
               result[key] = value
       return result


class HybridAgent(BaseAgent):
   def execute(self, task: Task) -> Dict[str, Any]:
       time.sleep(random.uniform(0.15, 0.35))
       if task.complexity

We developed two intelligent agent types: LLMAgent (which stands for reasoning-based artificial intelligence system) and HybridAgent (which combines rule-based accuracy with LLM adaptability). We designed these agents to demonstrate how learning-based approaches can improve task accuracy, especially for complex enterprise workflows. Check The complete code is here.

class BenchmarkEngine:
   def __init__(self, task_suite: EnterpriseTaskSuite):
       self.task_suite = task_suite
       self.results: List[BenchmarkResult] = []


   def run_benchmark(self, agent: BaseAgent, iterations: int = 3):
       print(f"n{'='*60}")
       print(f"Benchmarking Agent: {agent.name}")
       print(f"{'='*60}")
       for task in self.task_suite.tasks:
           print(f"nTask: {task.name} (Complexity: {task.complexity}/5)")
           for i in range(iterations):
               result = self._execute_task(agent, task, i+1)
               self.results.append(result)
               status = "✓ PASS" if result.success else "✗ FAIL"
               print(f"  Run {i+1}: {status} | Time: {result.execution_time:.3f}s | Accuracy: {result.accuracy:.2%}")

Here, we build the core of a benchmarking engine that manages agent evaluation across a defined suite of tasks. We implemented methods to run each agent multiple times per task, record the results, and measure key parameters such as execution time and accuracy. This creates a systematic and repeatable benchmarking cycle. Check The complete code is here.

 def _execute_task(self, agent: BaseAgent, task: Task, run_num: int) -> BenchmarkResult:
       start_time = time.time()
       try:
           output = agent.execute(task)
           execution_time = time.time() - start_time
           accuracy = self._calculate_accuracy(output, task.expected_output)
           success = accuracy >= 0.85
           return BenchmarkResult(task_id=task.id, agent_name=agent.name, success=success,
                                  execution_time=execution_time, accuracy=accuracy)
       except Exception as e:
           execution_time = time.time() - start_time
           return BenchmarkResult(task_id=task.id, agent_name=agent.name, success=False,
                                  execution_time=execution_time, accuracy=0.0, error_message=str(e))


   def _calculate_accuracy(self, output: Dict, expected: Dict) -> float:
       if not output:
           return 0.0
       scores = []
       for key, expected_val in expected.items():
           if key not in output:
               scores.append(0.0)
               continue
           actual_val = output[key]
           if isinstance(expected_val, bool):
               scores.append(1.0 if actual_val == expected_val else 0.0)
           elif isinstance(expected_val, (int, float)):
               diff = abs(actual_val - expected_val)
               tolerance = abs(expected_val * 0.1)
               score = max(0, 1 - (diff / (tolerance + 1e-9)))
               scores.append(score)
           else:
               scores.append(1.0 if actual_val == expected_val else 0.0)
       return np.mean(scores) if scores else 0.0

We define task execution logic and accuracy calculations. We measure the performance of each agent by comparing its output to expected results using a scoring mechanism. This step ensures that our benchmarking process is quantitative and fair, providing insights into how well the agent aligns with business expectations. Check The complete code is here.

 def generate_report(self):
       df = pd.DataFrame([asdict(r) for r in self.results])
       print(f"n{'='*60}")
       print("BENCHMARK REPORT")
       print(f"{'='*60}n")
       for agent_name in df['agent_name'].unique():
           agent_df = df[df['agent_name'] == agent_name]
           print(f"{agent_name}:")
           print(f"  Success Rate: {agent_df['success'].mean():.1%}")
           print(f"  Avg Execution Time: {agent_df['execution_time'].mean():.3f}s")
           print(f"  Avg Accuracy: {agent_df['accuracy'].mean():.2%}n")
       return df


   def visualize_results(self, df: pd.DataFrame):
       fig, axes = plt.subplots(2, 2, figsize=(14, 10))
       fig.suptitle('Enterprise Agent Benchmarking Results', fontsize=16, fontweight="bold")
       success_rate = df.groupby('agent_name')['success'].mean()
       axes[0, 0].bar(success_rate.index, success_rate.values, color=['#3498db', '#e74c3c', '#2ecc71'])
       axes[0, 0].set_title('Success Rate by Agent', fontweight="bold")
       axes[0, 0].set_ylabel('Success Rate')
       axes[0, 0].set_ylim(0, 1.1)
       for i, v in enumerate(success_rate.values):
           axes[0, 0].text(i, v + 0.02, f'{v:.1%}', ha="center", fontweight="bold")
       time_data = df.groupby('agent_name')['execution_time'].mean()
       axes[0, 1].bar(time_data.index, time_data.values, color=['#3498db', '#e74c3c', '#2ecc71'])
       axes[0, 1].set_title('Average Execution Time', fontweight="bold")
       axes[0, 1].set_ylabel('Time (seconds)')
       for i, v in enumerate(time_data.values):
           axes[0, 1].text(i, v + 0.01, f'{v:.3f}s', ha="center", fontweight="bold")
       df.boxplot(column='accuracy', by='agent_name', ax=axes[1, 0])
       axes[1, 0].set_title('Accuracy Distribution', fontweight="bold")
       axes[1, 0].set_xlabel('Agent')
       axes[1, 0].set_ylabel('Accuracy')
       plt.sca(axes[1, 0])
       plt.xticks(rotation=15)
       task_complexity = {t.id: t.complexity for t in self.task_suite.tasks}
       df['complexity'] = df['task_id'].map(task_complexity)
       complexity_perf = df.groupby(['agent_name', 'complexity'])['accuracy'].mean().unstack()
       complexity_perf.plot(kind='line', ax=axes[1, 1], marker="o", linewidth=2)
       axes[1, 1].set_title('Accuracy by Task Complexity', fontweight="bold")
       axes[1, 1].set_xlabel('Task Complexity')
       axes[1, 1].set_ylabel('Accuracy')
       axes[1, 1].legend(title="Agent", loc="best")
       axes[1, 1].grid(True, alpha=0.3)
       plt.tight_layout()
       plt.show()


if __name__ == "__main__":
   print("Enterprise Software Benchmarking for Agentic Agents")
   print("="*60)
   task_suite = EnterpriseTaskSuite()
   benchmark = BenchmarkEngine(task_suite)
   agents = [RuleBasedAgent("Rule-Based Agent"), LLMAgent("LLM Agent"), HybridAgent("Hybrid Agent")]
   for agent in agents:
       benchmark.run_benchmark(agent, iterations=3)
   results_df = benchmark.generate_report()
   benchmark.visualize_results(results_df)
   results_df.to_csv('agent_benchmark_results.csv', index=False)
   print("nResults exported to: agent_benchmark_results.csv")

We generate detailed reports and create visual analytics for performance comparisons. We analyze metrics such as agent success rate, execution time and accuracy, and task complexity. Finally, we export the results to a CSV file to complete the complete enterprise-level assessment workflow.

In summary, we implemented a robust and scalable benchmarking system that allows us to measure and compare the efficiency, adaptability, and accuracy of multiple agent AI methods. We observed how different architectures performed well at different levels of task complexity and how visual analytics highlighted performance trends. This process allows us to evaluate existing agents and provide a solid foundation for the next generation of enterprise AI agents, optimized for reliability and intelligence.

Check The complete code is here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.