Use Opik to implement fully tracked and measured native LLM processes for transparent, measurable and repeatable AI workflows

In this tutorial, we implemented a complete workflow for building, tracking, and evaluating LLM pipelines using opik. We build the system step by step, starting with lightweight models, adding prompt-based planning, creating datasets, and finally running automated evaluations. As we walk through each snippet, we’ll see how Opik helps us track each feature scope, visualize the behavior of the pipeline, and measure output quality using clear, repeatable metrics. Finally, we have a fully equipped QA system that we can easily scale, compare, and monitor. Check The complete code is here.

!pip install -q opik transformers accelerate torch


import torch
from transformers import pipeline
import textwrap


import opik
from opik import Opik, Prompt, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, LevenshteinRatio


device = 0 if torch.cuda.is_available() else -1
print("Using device:", "cuda" if device == 0 else "cpu")


opik.configure()
PROJECT_NAME = "opik-hf-tutorial"

We set up the environment by installing the required libraries and initializing Opik. We load the core modules, detect the device, and configure our project so that every trace flows into the correct workspace. We’ve laid the foundation for the rest of this tutorial. Check The complete code is here.

llm = pipeline(
   "text-generation",
   model="distilgpt2",
   device=device,
)


def hf_generate(prompt: str, max_new_tokens: int = 80) -> str:
   result = llm(
       prompt,
       max_new_tokens=max_new_tokens,
       do_sample=True,
       temperature=0.3,
       pad_token_id=llm.tokenizer.eos_token_id,
   )[0]["generated_text"]
   return result[len(prompt):].strip()

We load a lightweight Hugging Face model and create a small helper function to cleanly generate text. Our LLM preparation runs natively, without the need for external APIs. This provides a reliable and repeatable generation layer for the rest of our pipeline. Check The complete code is here.

plan_prompt = Prompt(
   name="hf_plan_prompt",
   prompt=textwrap.dedent("""
       You are an assistant that creates a plan to answer a question
       using ONLY the given context.


       Context:
       {{context}}


       Question:
       {{question}}


       Return exactly 3 bullet points as a plan.
   """).strip(),
)


answer_prompt = Prompt(
   name="hf_answer_prompt",
   prompt=textwrap.dedent("""
       You answer based only on the given context.


       Context:
       {{context}}


       Question:
       {{question}}


       Plan:
       {{plan}}


       Answer the question in 2โ€“4 concise sentences.
   """).strip(),
)

We define two structured prompts using Opik’s Prompt class. We control the planning and answering phases through clear templates. This helps us maintain consistency and observe how structured cues affect model behavior. Check The complete code is here.

DOCS = {
   "overview": """
       Opik is an open-source platform for debugging, evaluating,
       and monitoring LLM and RAG applications. It provides tracing,
       datasets, experiments, and evaluation metrics.
   """,
   "tracing": """
       Tracing in Opik logs nested spans, LLM calls, token usage,
       feedback scores, and metadata to inspect complex LLM pipelines.
   """,
   "evaluation": """
       Opik evaluations are defined by datasets, evaluation tasks,
       scoring metrics, and experiments that aggregate scores,
       helping detect regressions or issues.
   """,
}


@track(project_name=PROJECT_NAME, type="tool", name="retrieve_context")
def retrieve_context(question: str) -> str:
   q = question.lower()
   if "trace" in q or "span" in q:
       return DOCS["tracing"]
   if "metric" in q or "dataset" in q or "evaluate" in q:
       return DOCS["evaluation"]
   return DOCS["overview"]

We built a small document store and an Opik tracked retrieval function as tools. We let the pipeline choose the context based on the user’s question. This allows us to simulate a minimal RAG-like workflow without the need for an actual vector database. Check The complete code is here.

@track(project_name=PROJECT_NAME, type="llm", name="plan_answer")
def plan_answer(context: str, question: str) -> str:
   rendered = plan_prompt.format(context=context, question=question)
   return hf_generate(rendered, max_new_tokens=80)


@track(project_name=PROJECT_NAME, type="llm", name="answer_from_plan")
def answer_from_plan(context: str, question: str, plan: str) -> str:
   rendered = answer_prompt.format(
       context=context,
       question=question,
       plan=plan,
   )
   return hf_generate(rendered, max_new_tokens=120)


@track(project_name=PROJECT_NAME, type="general", name="qa_pipeline")
def qa_pipeline(question: str) -> str:
   context = retrieve_context(question)
   plan = plan_answer(context, question)
   answer = answer_from_plan(context, question, plan)
   return answer


print("Sample answer:n", qa_pipeline("What does Opik help developers do?"))

We integrate planning, reasoning and answering into a fully traceable LLM process. We capture each step using Opik’s decorator so that we can analyze the span in a dashboard. Through the test pipeline, we confirmed that all components integrated smoothly. Check The complete code is here.

client = Opik()


dataset = client.get_or_create_dataset(
   name="HF_Opik_QA_Dataset",
   description="Small QA dataset for HF + Opik tutorial",
)


dataset.insert([
   {
       "question": "What kind of platform is Opik?",
       "context": DOCS["overview"],
       "reference": "Opik is an open-source platform for debugging, evaluating and monitoring LLM and RAG applications.",
   },
   {
       "question": "What does tracing in Opik log?",
       "context": DOCS["tracing"],
       "reference": "Tracing logs nested spans, LLM calls, token usage, feedback scores, and metadata.",
   },
   {
       "question": "What are the components of an Opik evaluation?",
       "context": DOCS["evaluation"],
       "reference": "An Opik evaluation uses datasets, evaluation tasks, scoring metrics and experiments that aggregate scores.",
   },
])

We create and populate the dataset in Opik that our evaluation will use. We insert multiple question-and-answer pairs covering different aspects of Opik. This dataset will serve as the ground truth for our later QA evaluations. Check The complete code is here.

equals_metric = Equals()
lev_metric = LevenshteinRatio()


def evaluation_task(item: dict) -> dict:
   output = qa_pipeline(item["question"])
   return {
       "output": output,
       "reference": item["reference"],
   }

We define the evaluation task and choose two metrics (Equals and LevenshteinRatio) to measure model quality. We ensure that tasks produce output in the exact format required for scoring. This connects our pipeline to Opik’s evaluation engine. Check The complete code is here.

evaluation_result = evaluate(
   dataset=dataset,
   task=evaluation_task,
   scoring_metrics=[equals_metric, lev_metric],
   experiment_name="HF_Opik_QA_Experiment",
   project_name=PROJECT_NAME,
   task_threads=1,
)


print("nExperiment URL:", evaluation_result.experiment_url)

We use Opik’s evaluation function to conduct evaluation experiments. For stability in Colab, we maintain execution order. Once completed, we will receive a link to view the experiment details within the Opik dashboard. Check The complete code is here.

agg = evaluation_result.aggregate_evaluation_scores()


print("nAggregated scores:")
for metric_name, stats in agg.aggregated_scores.items():
   print(metric_name, "=>", stats)

We aggregate and print evaluation scores to understand how our pipeline is performing. We check the indicator results to see whether the output matches the reference and where improvements are needed. This closes the loop on our fully instrumented LLM workflow.

In summary, we have built a small but fully functional LLM evaluation ecosystem, fully powered by Opik and native models. We observe how traces, hints, datasets, and metrics come together to allow us to transparently understand the model’s inference process. When we finally completed the evaluation and reviewed the aggregated scores, we appreciated how Opik allowed us to iterate quickly, experiment systematically, and validate improvements in a structured and reliable way.


Check The complete code is here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And donโ€™t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an AI media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

๐Ÿ™Œ FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...