Use LitServe to build implementations of advanced multi-endpoint machine learning APIs: batch processing, streaming, caching and local inference

by admin · October 24, 2025

In this tutorial we will explore Special servicea lightweight and powerful serving framework that allows us to deploy machine learning models as APIs with minimal effort. We build and test multiple endpoints to demonstrate real-world functionality such as text generation, batch processing, streaming, multitasking, and caching, all running locally without relying on external APIs. Finally, we gained a clear understanding of how to design scalable and flexible machine learning service pipelines that are both efficient and easily scalable for production-grade applications. Check The complete code is here.

!pip install litserve torch transformers -q


import litserve as ls
import torch
from transformers import pipeline
import time
from typing import List

We first set up the environment on Google Colab and installed all required dependencies, including LitServe, PyTorch, and Transformers. We then import the necessary libraries and modules that allow us to effectively define, serve, and test our API. Check The complete code is here.

class TextGeneratorAPI(ls.LitAPI):
   def setup(self, device):
       self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() else -1)
       self.device = device
   def decode_request(self, request):
       return request["prompt"]
   def predict(self, prompt):
       result = self.model(prompt, max_length=100, num_return_sequences=1, temperature=0.8, do_sample=True)
       return result[0]['generated_text']
   def encode_response(self, output):
       return {"generated_text": output, "model": "distilgpt2"}


class BatchedSentimentAPI(ls.LitAPI):
   def setup(self, device):
       self.model = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0 if device == "cuda" and torch.cuda.is_available() else -1)
   def decode_request(self, request):
       return request["text"]
   def batch(self, inputs: List[str]) -> List[str]:
       return inputs
   def predict(self, batch: List[str]):
       results = self.model(batch)
       return results
   def unbatch(self, output):
       return output
   def encode_response(self, output):
       return {"label": output["label"], "score": float(output["score"]), "batched": True}

Here, we created two LitServe APIs, one for text generation using the native DistilGPT2 model and another for batch sentiment analysis. We define how each API decodes incoming requests, performs inference, and returns structured responses, demonstrating how easy it is to build scalable, reusable model service endpoints. Check The complete code is here.

class StreamingTextAPI(ls.LitAPI):
   def setup(self, device):
       self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() else -1)
   def decode_request(self, request):
       return request["prompt"]
   def predict(self, prompt):
       words = ["Once", "upon", "a", "time", "in", "a", "digital", "world"]
       for word in words:
           time.sleep(0.1)
           yield word + " "
   def encode_response(self, output):
       for token in output:
           yield {"token": token}

In this section, we design a streaming text generation API that emits tokens as they are generated. We simulate a real-time stream by generating one word at a time to demonstrate how LitServe can efficiently handle continuous token generation. Check The complete code is here.

class MultiTaskAPI(ls.LitAPI):
   def setup(self, device):
       self.sentiment = pipeline("sentiment-analysis", device=-1)
       self.summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6", device=-1)
       self.device = device
   def decode_request(self, request):
       return {"task": request.get("task", "sentiment"), "text": request["text"]}
   def predict(self, inputs):
       task = inputs["task"]
       text = inputs["text"]
       if task == "sentiment":
           result = self.sentiment(text)[0]
           return {"task": "sentiment", "result": result}
       elif task == "summarize":
           if len(text.split())

We have now developed a multi-tasking API that can handle sentiment analysis and summarization through a single endpoint. This code snippet demonstrates how to manage multiple model pipelines through a unified interface, dynamically routing each request to the appropriate pipeline based on the specified task. Check The complete code is here.

class CachedAPI(ls.LitAPI):
   def setup(self, device):
       self.model = pipeline("sentiment-analysis", device=-1)
       self.cache = {}
       self.hits = 0
       self.misses = 0
   def decode_request(self, request):
       return request["text"]
   def predict(self, text):
       if text in self.cache:
           self.hits += 1
           return self.cache[text], True
       self.misses += 1
       result = self.model(text)[0]
       self.cache[text] = result
       return result, False
   def encode_response(self, output):
       result, from_cache = output
       return {"label": result["label"], "score": float(result["score"]), "from_cache": from_cache, "cache_stats": {"hits": self.hits, "misses": self.misses}}

We implemented an API that uses caching to store previous inference results, thereby reducing redundant computation for repeated requests. We track cache hits and misses in real time, illustrating how a simple caching mechanism can significantly improve performance in repetitive inference scenarios. Check The complete code is here.

def test_apis_locally():
   print("=" * 70)
   print("Testing APIs Locally (No Server)")
   print("=" * 70)


   api1 = TextGeneratorAPI(); api1.setup("cpu")
   decoded = api1.decode_request({"prompt": "Artificial intelligence will"})
   result = api1.predict(decoded)
   encoded = api1.encode_response(result)
   print(f"✓ Result: {encoded['generated_text'][:100]}...")


   api2 = BatchedSentimentAPI(); api2.setup("cpu")
   texts = ["I love Python!", "This is terrible.", "Neutral statement."]
   decoded_batch = [api2.decode_request({"text": t}) for t in texts]
   batched = api2.batch(decoded_batch)
   results = api2.predict(batched)
   unbatched = api2.unbatch(results)
   for i, r in enumerate(unbatched):
       encoded = api2.encode_response(r)
       print(f"✓ '{texts[i]}' -> {encoded['label']} ({encoded['score']:.2f})")


   api3 = MultiTaskAPI(); api3.setup("cpu")
   decoded = api3.decode_request({"task": "sentiment", "text": "Amazing tutorial!"})
   result = api3.predict(decoded)
   print(f"✓ Sentiment: {result['result']}")


   api4 = CachedAPI(); api4.setup("cpu")
   test_text = "LitServe is awesome!"
   for i in range(3):
       decoded = api4.decode_request({"text": test_text})
       result = api4.predict(decoded)
       encoded = api4.encode_response(result)
       print(f"✓ Request {i+1}: {encoded['label']} (cached: {encoded['from_cache']})")


   print("=" * 70)
   print("✅ All tests completed successfully!")
   print("=" * 70)


test_apis_locally()

We test all APIs locally to verify correctness and performance without starting an external server. We evaluate text generation, batch sentiment analysis, multitasking, and caching in order to ensure that every component of the LitServe setup runs smoothly and efficiently.

In summary, we created and ran various APIs, demonstrating the versatility of the framework. We experimented with text generation, sentiment analysis, multitasking, and caching to experience LitServe’s seamless integration with the Hugging Face pipeline. As we completed this tutorial, we realized how LitServe simplified the model deployment workflow, allowing us to serve intelligent machine learning systems with only a few lines of Python code, while maintaining flexibility, performance, and simplicity.

Check The complete code is here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

Use LitServe to build implementations of advanced multi-endpoint machine learning APIs: batch processing, streaming, caching and local inference

You may also like...

live chat

Recent Posts

Use LitServe to build implementations of advanced multi-endpoint machine learning APIs: batch processing, streaming, caching and local inference

You may also like...

“Create a copy of this image. Don’t change anything” AI trend begins

DanceGrpo: A unified framework for visually generated enhanced learning across multiple paradigms and tasks

Breaking barriers: Brazilian team solves Hilbert’s 16th problem for the first time in 124 years

live chat

Recent Posts