[MOCK] Semantic Caching: Accelerating beyond basic RAG with up to 65x latency reduction.
By caching responses at the semantic level (not just exact keywords), organizations can dramatically speed up responses and reduce resource consumption.

Alex Buicescu
AI Product Engineer @Brain Co.
Sep 8, 2025

Modern enterprises increasingly deploy RAG (Retrieval-Augmented Generation) chatbots to answer questions using large language models (LLMs) and internal knowledge bases. However, these systems can become expensive and slow as usage grows. Often, many users end up asking similar questions, which means the backend is redundantly performing the same heavy operations (document retrieval and LLM inference) over and over. High query latency hurts user experience, and repeated LLM calls drive up costs. This is where semantic caching comes in.
Semantic caching is an optimization technique that recognizes when new queries have essentially the same meaning as past queries, and reuses the cached answer instead of invoking the full pipeline again. By caching responses at the semantic level (not just exact keywords), organizations can dramatically speed up responses and reduce resource consumption. At Brain Co., we leverage this technique across sales agents and customer support agents for a variety of industries.
In this post, we’ll explore what semantic caching is, how it works in a RAG architecture, the benefits it brings (from faster answers to lower GPU and API usage), considerations for cache freshness, and real-world use cases.
What is Semantic Caching?
Traditional caches usually use exact keys: if a user asks exactly the same question again, then a cached answer can be returned. But human language rarely repeats verbatim; people can phrase the same intent in many ways. Semantic caching bridges this gap by understanding query intent. As one definition puts it, “The term ‘semantic’ implies that the cache takes into account the meaning or semantics of the data or computation being cached, rather than just its syntactic representation.” In practice, this means similar queries (even with different wording) can retrieve the same cached response.
For example, consider a knowledge-base Q&A bot, illustrated below. One user asks, “What is the capital of France?” and later another asks, “Can you tell me the capital of France?” These queries use different words but have identical intent. A naive cache keyed by exact text would treat them as distinct and would not reuse the answer. A semantic cache, however, recognizes they are semantically equivalent questions and returns the same cached answer (in this case, “Paris”) without repeating the lookup and generation process. This capability relies on embedding the queries into a vector representation of their meaning, and comparing for similarity rather than exact matches. Essentially, the cache is keyed by semantic embeddings of queries. If a new query’s embedding is very close to that of a previously answered query, the system can safely assume the questions are equivalent and use the cached result.

By operating on meaning, semantic caching dramatically improves cache hit rates for natural language queries. It’s essentially an advanced key-value store where the “key” is a vector representing the query’s meaning, and the “value” is a previously computed answer (along with any supporting context). Next, we’ll look at how this works within a RAG chatbot’s architecture.
How Semantic Caching Works in a RAG Pipeline
In a retrieval-augmented generation workflow, a user’s question typically goes through these steps: (1) embed the query and retrieve relevant documents from a knowledge base, (2) feed the query + documents to an LLM to generate an answer, and (3) return the answer to the user. Semantic caching introduces an earlier step: check if a semantically similar query was already answered recently. If yes, we can skip directly to returning the cached answer, bypassing retrieval and LLM calls. If not, we proceed normally and then store this new Q&A pair in the cache for future reuse.

High-level architecture of a RAG chatbot with a semantic cache layer.
The user’s query is first embedded and checked against a vector database of past queries (semantic cache). If a similar query hit is found (green path), the cached answer is returned immediately to the user. This avoids running the expensive retrieval and LLM generation steps again. If there’s a cache miss (red path), the system performs the standard RAG flow – retrieving context from the knowledge base and calling the LLM – to produce a new answer, which is then stored in the cache for future queries. Over time, the cache “learns” common queries and the system gets faster and more cost-efficient.
To implement this, the application maintains a vector index of previously asked questions’ embeddings (often in a vector database like pgvector, Qdrant, Pinecone, etc.) along with their answers. When a new query comes in, the system generates its embedding (using the same embedding model as used for retrieval) and performs a similarity search in the cache index. If it finds a past question vector with a similarity score above a chosen threshold, the system assumes the new query’s intent matches that past query. It then simply returns the cached answer (a direct lookup) instead of executing a new retrieval+LLM pipeline. This check is typically very fast (vector search is efficient even for thousands of stored queries) and saves significant time compared to calling the LLM.
def on_user_message(query):
query_embedding = embed(query)
cached_response = session.execute("""
SELECT response
FROM cache
WHERE (1 - (response_embedding <=> :query_embedding)) >= 0.9
""", {"query_embedding": query_embedding}).first()
if cached_response:
return cached_response
else:
relevant_chunks = get_relevant_chunks(query, query_embedding)
llm_answer = generate_llm_answer(query, relevant_chunks)
cache_answer(query, query_embedding, llm_answer)
return llm_answer


