How to use semantic LLM caching to reduce cost and latency in RAG applications

Semantic caching in LLM (Large Language Model) applications optimizes performance by storing and reusing responses based on semantic similarity rather than exact text matching. When a new query arrives, it is converted into an embedding and compared with the cached query using a similarity search. If a close match is found (above the similarity threshold), the cached response is returned immediately, skipping the expensive retrieval and generation process. Otherwise, the full RAG pipeline is run and new query-response pairs are added to the cache for future use.

In a RAG setting, the semantic cache typically saves only the responses to the questions actually asked, rather than every possible query. This helps reduce the latency and API cost of duplicating or slightly rewriting questions. In this article, we’ll look at a short example that demonstrates how caching can significantly reduce cost and response time in an LLM-based application. Check The complete code is here.

Semantic caching works by storing and retrieving responses based on the meaning of a user query rather than its exact wording. Each incoming query is converted into a vector embedding representing its semantic content. The system then performs a similarity search, typically using approximate nearest neighbor (ANN) techniques, to compare this embedding with embeddings already stored in the cache.

If a sufficiently similar query-response pair exists (i.e., its similarity score exceeds a defined threshold), the cached response is returned immediately, bypassing expensive retrieval or generation steps. Otherwise, the full RAG pipeline is executed, the document is retrieved and a new answer is generated, which is then stored in the cache for future use. Check The complete code is here.

In a RAG application, the semantic cache only stores responses for queries that are actually processed by the system – not all possible issues are cached in advance. Each query that reaches LLM and produces an answer can create a cache entry containing the query’s embedding and corresponding response.

Depending on the design of the system, the cache may store only the final LLM output, the retrieved documents, or both. To maintain efficiency, cache entries are managed through policies such as time-to-live (TTL) expiration or least recently used (LRU) eviction, ensuring that only recent or frequently accessed queries remain in memory over time. Check The complete code is here.

Install dependencies

Set dependencies

import os
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')

In this tutorial we will use OpenAI, but you can use any LLM provider.

from openai import OpenAI
client = OpenAI()

Run repeated queries without caching

In this section, we run the same query 10 times directly through the GPT-4.1 model and observe how long it takes when no caching mechanism is applied. Each call triggers a complete LLM calculation and response generation, resulting in repeated processing of the same input. Check The complete code is here.

This helps establish a baseline of total time and cost before we implement semantic caching in the next section.

import time
def ask_gpt(query):
    start = time.time()
    response = client.responses.create(
      model="gpt-4.1",
      input=query
    )
    end = time.time()
    return response.output[0].content[0].text, end - start
query = "Explain the concept of semantic caching in just 2 lines."
total_time = 0

for i in range(10):
    _, duration = ask_gpt(query)
    total_time += duration
    print(f"Run {i+1} took {duration:.2f} seconds")

print(f"nTotal time for 10 runs: {total_time:.2f} seconds")

Even though the query remains the same, each call still takes 1-3 seconds, resulting in a total of ~22 seconds for 10 runs. This inefficiency highlights why semantic caching is so valuable – it allows us to reuse previous responses for semantically identical queries and save time and API costs. Check The complete code is here.

Implement semantic caching for faster responses

In this section, we enhance the previous setup by introducing semantic caching, which allows our application to reuse responses for semantically similar queries instead of making repeated calls to the GPT-4.1 API.

Here’s how it works: Each incoming query is converted into a vector embedding using the text-embedding-3-small model. This embedding captures the semantics of the text. When a new query arrives, we compute its cosine similarity to the embeddings already stored in the cache. If a match is found with a similarity score higher than a defined threshold (e.g. 0.85), the system returns the cached response immediately, thus avoiding another API call.

If a sufficiently similar query does not exist in the cache, the model generates a new response and then stores it along with its embedding for future use. Over time, this approach can significantly reduce response times and API costs, especially for frequently asked or rewritten queries. Check The complete code is here.

import numpy as np
from numpy.linalg import norm
semantic_cache = []

def get_embedding(text):
    emb = client.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(emb.data[0].embedding)
def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

def ask_gpt_with_cache(query, threshold=0.85):
    query_embedding = get_embedding(query)
    
    # Check similarity with existing cache
    for cached_query, cached_emb, cached_resp in semantic_cache:
        sim = cosine_similarity(query_embedding, cached_emb)
        if sim > threshold:
            print(f"🔁 Using cached response (similarity: {sim:.2f})")
            return cached_resp, 0.0  # no API time
    
    # Otherwise, call GPT
    start = time.time()
    response = client.responses.create(
        model="gpt-4.1",
        input=query
    )
    end = time.time()
    text = response.output[0].content[0].text
    
    # Store in cache
    semantic_cache.append((query, query_embedding, text))
    return text, end - start
queries = [
    "Explain semantic caching in simple terms.",
    "What is semantic caching and how does it work?",
    "How does caching work in LLMs?",
    "Tell me about semantic caching for LLMs.",
    "Explain semantic caching simply.", 
]


total_time = 0
for q in queries:
    resp, t = ask_gpt_with_cache(q)
    total_time += t
    print(f"⏱️ Query took {t:.2f} secondsn")

print(f"nTotal time with caching: {total_time:.2f} seconds")

In the output, the first query took about 8 seconds because there was no caching and the model had to generate a new response. When a similar question is asked next, the system recognizes a high semantic similarity (0.86) and immediately reuses the cached answer, saving time. Some queries like “How does caching work in LLM?” and “Tell me about semantic caching in LLM” are very different, so the model generates new responses, each of which takes more than 10 seconds. The final query is almost identical to the first query (similarity 0.97) and is served immediately from cache.


Check The complete code is here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


I am a Civil Engineering graduate (2022) from Jamia Millia Islamia, New Delhi and I am very interested in data science, especially neural networks and their applications in various fields.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...