How to implement the LLM Ara-As-AA-Gudge method to evaluate large language model output

by admin · August 25, 2025

In this tutorial, we will explore how to implement the Judge AS-AA-Gudge method to evaluate large language model output. Rather than assigning an isolated number worth points to each response, the method makes a positive comparison between the outputs to determine which one is better – depending on the criteria you define, such as help, clarity, or tone. Check The complete code is here.

We will use OpenAI’s GPT-4.1 and Gemini 2.5 Pro to generate responses and use GPT-5 as a judge to evaluate its output. For demonstration, we will use a simple email support scheme with the following context:

Dear Support,  
I ordered a wireless mouse last week, but I received a keyboard instead.  
Can you please resolve this as soon as possible?  
Thank you,  
John

Install dependencies

pip install deepeval google-genai openai

In this tutorial, you need the API keys for OpenAI and Google. Check The complete code is here.

Since we are using DeepeVal for evaluation, the OpenAI API key is required

import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')
os.environ['GOOGLE_API_KEY'] = getpass('Enter Google API Key: ')

Define context

Next, we will define the context of the test case. In this example, we are using a customer support scheme where users report receiving errors. We will create a Context_email containing the original message of the client and then build a prompt to generate a response based on that context. Check The complete code is here.

from deepeval.test_case import ArenaTestCase, LLMTestCase, LLMTestCaseParams
from deepeval.metrics import ArenaGEval

context_email = """
Dear Support,
I ordered a wireless mouse last week, but I received a keyboard instead. 
Can you please resolve this as soon as possible?
Thank you,
John
"""

prompt = f"""
{context_email}
--------

Q: Write a response to the customer email above.
"""

OpenAI model response

from openai import OpenAI
client = OpenAI()

def get_openai_response(prompt: str, model: str = "gpt-4.1") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content

openAI_response = get_openai_response(prompt=prompt)

Gemini Model Response

from google import genai
client = genai.Client()

def get_gemini_response(prompt, model="gemini-2.5-pro"):
    response = client.models.generate_content(
        model=model,
        contents=prompt
    )
    return response.text
geminiResponse = get_gemini_response(prompt=prompt)

Define arena test cases

Here we set up the Arenatestcase to compare the outputs of the two models – GPT-4 and Gemini, prompted with the same input. Both models receive the same Context_email, and the generated responses are stored in OpenAI_Response and Geminiriresponse for evaluation. Check The complete code is here.

a_test_case = ArenaTestCase(
    contestants={
        "GPT-4": LLMTestCase(
            input="Write a response to the customer email above.",
            context=[context_email],
            actual_output=openAI_response,
        ),
        "Gemini": LLMTestCase(
            input="Write a response to the customer email above.",
            context=[context_email],
            actual_output=geminiResponse,
        ),
    },
)

Set evaluation metrics

Here we define an Arageval metric called “Support Email Quality”. The evaluation focuses on empathy, professionalism, and clarity – aimed at determining responses that are comprehensible, polite, and concise. The evaluation takes into account context, input and model output, using GPT-5 as an evaluator that enables detailed records for better insights. Check The complete code is here.

metric = ArenaGEval(
    name="Support Email Quality",
    criteria=(
        "Select the response that best balances empathy, professionalism, and clarity. "
        "It should sound understanding, polite, and be succinct."
    ),
    evaluation_params=[
        LLMTestCaseParams.CONTEXT,
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
    model="gpt-5",  
    verbose_mode=True
)

Running evaluation

metric.measure(a_test_case)

**************************************************
Support Email Quality [Arena GEval] Verbose Logs
**************************************************
Criteria:
Select the response that best balances empathy, professionalism, and clarity. It should sound understanding, 
polite, and be succinct. 
 
Evaluation Steps:
[
    "From the Context and Input, identify the user's intent, needs, tone, and any constraints or specifics to be 
addressed.",
    "Verify the Actual Output directly responds to the Input, uses relevant details from the Context, and remains 
consistent with any constraints.",
    "Evaluate empathy: check whether the Actual Output acknowledges the user's situation/feelings from the 
Context/Input in a polite, understanding way.",
    "Evaluate professionalism and clarity: ensure respectful, blame-free tone and concise, easy-to-understand 
wording; choose the response that best balances empathy, professionalism, and succinct clarity."
] 
 
Winner: GPT-4
 
Reason: GPT-4 delivers a single, concise, and professional email that directly addresses the context (acknowledges 
receiving a keyboard instead of the ordered wireless mouse), apologizes, and clearly outlines next steps (send the 
correct mouse and provide return instructions) with a polite verification step (requesting a photo). This best 
matches the request to write a response and balances empathy and clarity. In contrast, Gemini includes multiple 
options with meta commentary, which dilutes focus and fails to provide one clear reply; while empathetic and 
detailed (e.g., acknowledging frustration and offering prepaid labels), the multi-option format and an over-assertive claim of already locating the order reduce professionalism and succinct clarity compared to GPT-4.
======================================================================

The evaluation results show that GPT-4 outperforms other models in generating support emails that balance empathy, professionalism, and clarity. GPT-4’s response stands out because it is concise, polite and action-oriented, by identifying the problem for errors and clearly explaining the next steps to resolve the problem, solving the situation directly, such as sending the correct project and providing return instructions. The tone is respectful and understandable, completely consistent with the user’s need for clear and sympathetic replies. By contrast, Gemini’s response, while understanding and detailed, includes multiple response options and unnecessary comments, reducing its clarity and professionalism. This result emphasizes the ability of GPT-4 to provide professional and considerate, customer-centric communication.

Check The complete code is here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

I am a civil engineering graduate in Islamic Islam in Jamia Milia New Delhi (2022) and I am very interested in data science, especially neural networks and their applications in various fields.

How to implement the LLM Ara-As-AA-Gudge method to evaluate large language model output

Install dependencies

Define context

OpenAI model response

Gemini Model Response

Define arena test cases

Set evaluation metrics

Running evaluation

You may also like...

live chat

Recent Posts

How to implement the LLM Ara-As-AA-Gudge method to evaluate large language model output

Install dependencies

Define context

OpenAI model response

Gemini Model Response

Define arena test cases

Set evaluation metrics

Running evaluation

You may also like...

How to realize value from the workforce that supports Genai

Zhipu AI releases GLM-4.5V: Multi-mode reasoning through scalable reinforcement learning

The code generated by the AI ​​will remain here. Are we not very safe in the end?

live chat

Recent Posts

The code generated by the AI will remain here. Are we not very safe in the end?