Start MLFlow with LLM evaluation

MLFlow is a powerful open source platform for managing the machine learning lifecycle. Traditionally it is used to track model experiments, record parameters, and manage deployments, but MLFlow recently introduced support for evaluating large language models (LLMS).
In this tutorial, we explore how to evaluate the performance of LLM (in our case Google’s Gemini model), a set of fact-based tips. We will use Gemini to respond to fact-based cues and evaluate their quality using various metrics directly supported by MLFlow.
Set up dependencies
For this tutorial, we will use both OpenAI and Gemini APIs. MLFlow’s built-in generation of AI evaluation metrics currently relies on OpenAI models (such as GPT-4) to act as judges for metrics such as answer similarity or loyalty, so OpenAI API keys are required. You can get:
Install the library
pip install mlflow openai pandas google-genai
Set OpenAI and Google API keys as environment variables
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key:')
os.environ["GOOGLE_API_KEY"] = getpass('Enter Google API Key:')
Prepare the evaluation data and get the output from Gemini
import mlflow
import openai
import os
import pandas as pd
from google import genai
Create evaluation data
In this step, we define a small evaluation dataset containing factual tips and their correct ground truth answers. These tips cover topics such as science, health, web development, and programming. This structured format allows us to objectively compare the Gemini generated responses with known correct answers using various evaluation metrics in MLFLOW.
eval_data = pd.DataFrame(
{
"inputs": [
"Who developed the theory of general relativity?",
"What are the primary functions of the liver in the human body?",
"Explain what HTTP status code 404 means.",
"What is the boiling point of water at sea level in Celsius?",
"Name the largest planet in our solar system.",
"What programming language is primarily used for developing iOS apps?",
],
"ground_truth": [
"Albert Einstein developed the theory of general relativity.",
"The liver helps in detoxification, protein synthesis, and production of biochemicals necessary for digestion.",
"HTTP 404 means 'Not Found' -- the server can't find the requested resource.",
"The boiling point of water at sea level is 100 degrees Celsius.",
"Jupiter is the largest planet in our solar system.",
"Swift is the primary programming language used for iOS app development."
]
}
)
eval_data
Receive a Gemini response
This code block defines a helper function gemini_completion() that uses the Google Generative AI SDK to send a prompt to the Gemini 1.5 Flash model and returns the generated response to plain text. We then apply this feature to each prompt in the evaluation dataset to generate predictions for the model and store them in the new “prediction” column. These predictions will be evaluated later based on the answer to ground truth
client = genai.Client()
def gemini_completion(prompt: str) -> str:
response = client.models.generate_content(
model="gemini-1.5-flash",
contents=prompt
)
return response.text.strip()
eval_data["predictions"] = eval_data["inputs"].apply(gemini_completion)
eval_data
Evaluate Gemini output with MLFlow
In this step, we initiated an MLFlow run to evaluate the response of the Gemini model to a set of fact-based answers. We use the mlflow.evaluate() method with four lightweight metrics: Answer_Similar (Measure the semantic similarity between the output of the model and the truth on the ground), extreme_match (Check the word match), Incubation period (Tracking response generation time) and token_count (Record the number of output tokens).
It is important to pay attention to Answer_Similar Metric system is used internally Openai’s GPT A model for judging semantic intimacy between answers, which is why access to the OpenAI API is required. This setup provides an efficient way to evaluate LLM output without relying on custom evaluation logic. The final evaluation results are printed and saved to a CSV file for later inspection or visualization.
mlflow.set_tracking_uri("mlruns")
mlflow.set_experiment("Gemini Simple Metrics Eval")
with mlflow.start_run():
results = mlflow.evaluate(
model_type="question-answering",
data=eval_data,
predictions="predictions",
targets="ground_truth",
extra_metrics=[
mlflow.metrics.genai.answer_similarity(),
mlflow.metrics.exact_match(),
mlflow.metrics.latency(),
mlflow.metrics.token_count()
]
)
print("Aggregated Metrics:")
print(results.metrics)
# Save detailed table
results.tables["eval_results_table"].to_csv("gemini_eval_results.csv", index=False)
To view detailed results of the evaluation, we load the saved CSV file into the data frame and adjust the display settings to ensure full visibility of each response. This allows us to check personal prompts, predictions generated by Gemini, answers to ground truth, and related metric scores without truncation, which is especially useful for notebook environments like Colab or Jupyter.
results = pd.read_csv('gemini_eval_results.csv')
pd.set_option('display.max_colwidth', None)
results
Check The code is here. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.
I am a civil engineering graduate in Islamic Islam in Jamia Milia New Delhi (2022) and I am very interested in data science, especially neural networks and their applications in various fields.
