Optimize LLM usage with Routellm
Routellm is a flexible framework for serving and evaluating LLM routers designed to maximize performance while minimizing costs.
Key Features:
- Seamless integration – acts as a pour-in alternative to OpenAI clients or run as an OpenAI compatible server, intelligently routes simpler queries to cheaper models.
- Pre-trained routers out of the box – proven to reduce costs by 85%, while retaining 95% of GPT-4 performance on widely used benchmark benchmarks such as MT Bench.
- Cost-effective excellence – Matches the performance of leading commercial products while being over 40% cheaper.
- Scalable and customizable – Easily add new routers, fine-tune thresholds, and compare performance on multiple benchmarks.
In this tutorial, we will look at how to:
- Load and use a pretrained router.
- Calibrate it for your own use cases.
- Test routing behavior on different types of prompts.
- Check The complete code is here.
Install dependencies
!pip install "routellm[serve,eval]"
Loading OpenAI API Key
To obtain the OpenAI API key, access and generate a new key. If you are a new user, you may need to add billing details and pay a minimum of $5 to activate API access.
Routellm uses Litellm to support chat completion for various open source and closed source models. If you want to use another model, you can view the provider list. Check The complete code is here.
import os
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')
Download the configuration file
Routellm uses configuration files to locate validated router checkpoints and their training datasets. This file tells the system where to find a model that decides whether to send a query to a strong or weak model. Check The complete code is here.
Do I need editing?
For most users – no. The default configuration already points to a trained router (MF, BERT, CAUSAL_LLM). You just need to change it if you intend to:
- Train your own router on a custom dataset.
- Completely replace the routing algorithm with the new algorithm.
For this tutorial, we will keep the configuration as is:
- Set our powerful and weak model names in the code.
- Add our API keys to the selected provider.
- Use calibration thresholds to balance cost and quality.
- Check The complete code is here.
!wget
Initialize the Routellm controller
In this code block, we import the necessary library and initialize the Routellm Controller, which will manage how to route the prompts between models. We specify the router =[“mf”] To use a matrix decomposition router, this is a preprocessed decision model that predicts whether a query should be sent to a strong or weak model.
The strong_model parameter is set to “GPT-5”, which is a high-quality but more expensive model, while the feed_model parameter is set to “O4-Mini”, which is a faster and cheaper option. For each incoming prompt, the router evaluates its complexity against the threshold and automatically selects the most cost-effective option – ensuring that simple tasks are handled by cheaper models, while more challenging models gain the functionality of the stronger model.
This configuration allows you to balance cost efficiency and response quality without manual intervention. Check The complete code is here.
import os
import pandas as pd
from routellm.controller import Controller
client = Controller(
routers=["mf"], # Model Fusion router
strong_model="gpt-5",
weak_model="o4-mini"
)
!python -m routellm.calibrate_threshold --routers mf --strong-model-pct 0.1 --config config.example.yaml
This command runs the threshold calibration process (MF) router of Routellm. –Strong-Model-PCT 0.1 parameter tells the system to route approximately 10% of queries to strong models (and the rest with weak models).
Use –config config.example.yaml for model and router settings to determine calibration:
For 10% strong calls using MF, the optimal threshold is 0.24034.
This means that any query with a router allocation of complexity score above 0.24034 will be sent to a strong model, while the model below will go to a weak model, consistent with the cost-quality-quality tradeoff you need.
Define thresholds and prompt variables
Here we define a set of test tips designed to cover a range of complexity levels. They include simple factual problems (possibly with weak model routing), medium inference tasks (boundary threshold cases), and high complexity or creative requests (more suitable for strong models), and code generation tasks for testing technical features. Check The complete code is here.
threshold = 0.24034
prompts = [
# Easy factual (likely weak model)
"Who wrote the novel 'Pride and Prejudice'?",
"What is the largest planet in our solar system?",
# Medium reasoning (borderline cases)
"If a train leaves at 3 PM and travels 60 km/h, how far will it travel by 6:30 PM?",
"Explain why the sky appears blue during the day and red/orange during sunset.",
# High complexity / creative (likely strong model)
"Write a 6-line rap verse about climate change using internal rhyme.",
"Summarize the differences between supervised, unsupervised, and reinforcement learning with examples.",
# Code generation
"Write a Python function to check if a given string is a palindrome, ignoring punctuation and spaces.",
"Generate SQL to find the top 3 highest-paying customers from a 'sales' table."
]
Evaluate the winning rate
The following code uses the MF router to calculate the winning rate for each test prompt, indicating that the probability of a strong model exceeds that of a weak model.
Based on calibration threshold 0.24034, two prompts –
“If the train leaves and travels 60 km/h at 3 pm, how far will it travel before 6:30 pm? ” (0.303087)
“Write python function to check if a given string is literal accepted, ignoring punctuation and spaces. ” (0.272534)
– Threshold exceeds and routes to strong models.
All other tips remain below the threshold, meaning they will be served by weaker, cheaper models. Check The complete code is here.
win_rates = client.batch_calculate_win_rate(prompts=pd.Series(prompts), router="mf")
# Store results in DataFrame
_df = pd.DataFrame({
"Prompt": prompts,
"Win_Rate": win_rates
})
# Show full text without truncation
pd.set_option('display.max_colwidth', None)
These results also help fine-tune the routing strategy – by analyzing the winning rate distribution, we can adjust the threshold to better balance cost savings and performance.
Routing prompts via calibration model fusion (MF) router
This code iterates over the list of test prompts and sends each code to the Routellm Controller using the specified threshold (Router-MF-{threshold}).
For each prompt, the router decides whether to use a strong or weak model based on the calculated winning rate.
The response includes the generated output and actual model selected by the router.
These details (hints, models and generated outputs) are stored in the results list for later analysis. Check The complete code is here.
results = []
for prompt in prompts:
response = client.chat.completions.create(
model=f"router-mf-{threshold}",
messages=[{"role": "user", "content": prompt}]
)
message = response.choices[0].message["content"]
model_used = response.model # RouteLLM returns the model actually used
results.append({
"Prompt": prompt,
"Model Used": model_used,
"Output": message
})
df = pd.DataFrame(results)
In the results, prompts 2 and 6 exceed the threshold win rate and are therefore routed to the GPT-5 model, while the rest are processed by the weaker model.
Check The complete code is here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

I am a civil engineering graduate in Islamic Islam in Jamia Milia New Delhi (2022) and I am very interested in data science, especially neural networks and their applications in various fields.