Stanford University researchers discover caching risk warnings in AI APIs: revealing security flaws and data vulnerabilities

The processing requirements of LLMS present great challenges, especially real-time uses where fast response times are critical. Reprocessing each problem is time-consuming and inefficient, requiring a lot of resources. AI service providers overcome low performance by using a cache system that stores duplicate queries so that they can answer these questions immediately without waiting, thereby optimizing efficiency while saving latency. However, while speeding up the response time, security risks will also arise. Scientists have studied how LLM API caching habits unconsciously reveal confidential information. They found that caching strategies based on commercial AI services, user query and transaction mechanism model information may leak through timing-based side channel attacks.
One of the key risks of rapid caching is that it has the potential to reveal information about previous user queries. If a cached prompt is shared among multiple users, an attacker can determine whether someone else has recently submitted a similar prompt based on the response time difference. With global caching, the risk becomes greater, and one user’s prompt can lead to a faster response time for another user to submit related queries. By analyzing the response time changes, the researchers demonstrate how this vulnerability allows an attacker to discover confidential business data, personal information and proprietary queries.
Various AI service providers have different caches, but their cache policies are not necessarily transparent to users. Some people limit caches to individual users so that caches are only given prompts to the individuals who publish them, so sharing data between accounts is not allowed. Others implement caches for each organization so that several users in the company or organization can share the cached prompts. Although some users have special access, while more effective, this can also leak sensitive information. The most threatening security risks of global caches result in the cached prompts, where all API services can access the cached prompts. As a result, an attacker can manipulate response time inconsistent to determine the previous prompt for the submission. Researchers found that most AI providers are not transparent about caching policies, so users are still unaware of the security threats their queries pose.
To investigate these issues, the Stanford University research team developed an audit framework that can detect timely caches at different levels of access. Their approach involves sending controlled sequences to various AI APIs and measuring response time variations. If you add a cache prompt, the response time after resubmission will be significantly faster. They developed statistical hypothesis tests to confirm whether caching occurred and to determine whether cache sharing was extended outside of individual users. The researchers determined the pattern indicating the cache by systematically adjusting the timeliness length, prefix similarity, and repetition frequency. The audit process involves testing 17 commercial AI APIs, including APIs provided by OpenAI, Humans, DeepSeek, Fireworks AI, and more. Their test focuses on detecting whether cache is implemented and whether it is limited to a single user or shared in a wider group.
The audit process consists of two main tests: one for measuring the response time of cache hits and the other for cache misses. In the cache test, the same prompt was submitted multiple times to see if the response speed was improved after the first request. In cached missed tests, randomly generated hints are used to establish a baseline for unmitigated response times. These statistical analysis of response times provides clear evidence of cache in several APIs. The researchers identified cache behavior in 8 of 17 API providers. More importantly, they found that 7 of these providers shared caches worldwide, meaning that any user can infer the usage patterns of other users based on response speed. Their findings also reveal previously unknown architectural details about OpenAI’s text insertion 3 mini-model – the submission cache behavior shows that it follows a decoder-only transformer structure, an information that has not been disclosed yet.
Performance evaluations of cache versus non-clinical cue highlight a compelling difference in response time. For example, in OpenAI’s Text-Embedding-3-Small API, the average response time for cache hits is about 0.1 seconds, while the cache speed results in a latency of up to 0.5 seconds. The researchers determined that cache sharing vulnerabilities could enable attackers to achieve near-perfect accuracy in distinguishing cache from non-disposal prompts. Their statistical tests yielded highly significant P values, usually below 10°, indicating a strong likelihood of cache behavior. Furthermore, they found that in many cases a single duplicate request is enough to trigger the cache, and Openai and Azure require 25 consecutive requests before it becomes obvious. These findings suggest that API providers may use distributed cache systems where prompts are not immediately stored in all servers, but are cached after reuse.
Key points of research include the following:
- The prompt cache speeds up the response by storing previously processed queries, but it can expose sensitive information when the cache is shared among multiple users.
- Global cache was detected in 7 of 17 API providers, allowing attackers to infer prompts used by timed changes.
- Some API providers do not disclose caching policies publicly, meaning users may not know that others are storing and accessing their input.
- The study identified the response time difference, with the average cache hit rate of 0.1 seconds and the cache hit rate reached 0.5 seconds, providing measurable cache proof.
- The statistical audit framework detects caches with high accuracy, with p-values typically dropping below 10°, confirming the cache of systems between multiple providers.
- OpenAI’s text insertion-3-MALL model is revealed as a decoder-only transformer, a previously undisclosed detail inferred from cache behavior.
- Some API providers patched the vulnerability after disclosure, but others have not addressed the issue, indicating the need for stricter industry standards.
- Mitigation strategies include limiting cache to a single user, responding randomly latency to prevent timing inference, and providing greater transparency to cache policies.
Check Paper and Github page. All credits for this study are to the researchers on the project. Also, please keep an eye on us twitter And don’t forget to join us 80k+ ml subcolumn count.
🚨 Recommended Reading – LG AI Research Unleashes Nexus: An Advanced System Integration Agent AI Systems and Data Compliance Standards to Address Legal Issues in AI Datasets

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.
🚨Recommended open source AI platform: “Intellagent is an open source multi-agent framework that evaluates complex dialogue AI systems” (promoted)