Meet Leann: The smallest vector database that democratizes individual AI with an approximate neighbor (ANN) search index of storage efficiency
Embedding-based searches outperform traditional keyword-based approaches by capturing semantic similarity by using dense vector representations and approximately nearest neighbor (ANN) searches. However, ANN data structures bring too much storage overhead, usually 1.5 to 7 times the size of the original raw data. This overhead is manageable in large web applications, but is not practical for personal devices or large data sets. Reducing storage to less than 5% of the original data size is critical for edge deployment, but existing solutions are insufficient. Technologies such as product quantification (PQ) can reduce storage, but either lead to reduced accuracy or require increased search latency.
The vector search method depends on the IVF and the proximity graph. Graphic-based methods such as HNSW, NSG and VAMANA are considered state-of-the-art due to their accuracy and efficiency balance. Efforts to reduce graph size, such as learned neighbor selection, facial limitations due to high training costs and dependence on labeled data. For resource-constrained environments, Diskann and Starling store data on disk, while Fusionanns optimizes hardware usage. Methods such as AISAQ and Edgerag try to minimize memory usage, but still suffer from high storage space or large-scale performance degradation. Embedding compression techniques such as PQ and Rabitq provide quantification with theoretical error boundaries, but strive to maintain accuracy in a tight budget.
Researchers at UC Berkeley, Cuhk, Amazon Web Services and UC Davis have developed Leann, a storage efficiency ANN search index optimized for resource-limited personal devices. It integrates a compact graphics-based structure with a pass-through recomputation strategy, allowing quick and accurate retrieval while minimizing storage overhead. By reducing the index size below 5% of the original raw data, Leann’s storage space is up to 50 times smaller than the standard index. It maintains 90% of its first 3 recalls in 2 seconds in real-world question benchmarks. To reduce latency, Leann uses a two-stage traversal algorithm and dynamic batches that combine embedding calculations across search hops, thereby enhancing GPU utilization.
Leann’s architecture combines core approaches such as graph-based recomputation, main techniques and system workflows. It is built on the HNSW framework and observes that each query requires only a limited subset of nodes, thereby prompting on-demand computing rather than pre-store all embeddings. To address early challenges, Leann introduced two techniques: (a) two-stage graph traversal with dynamic batching to reduce re-latency delay, and (b) highly retained graph pruning methods to reduce metadata storage. In the system workflow, Leann first calculates the embeddedness of all dataset items, and then constructs the vector index using a graph-based indexing method.
In terms of storage and latency, LeAnn outperforms the IVF-based recalculation method Edgerag, achieving a latency reduction from 21.17 to 200.60 times on various datasets and hardware platforms. This advantage comes from Leann’s Polyogarithmic Recompuntion complexity, which is more effective than Edgerag’s √𝑋 growth. Leann achieved higher performance in most datasets in terms of accuracy of downstream rag tasks, except for GPQA, distribution mismatch limits its effectiveness. Similarly, on HOTPOTQA, the single-hop retrieval setting limits the improvement of accuracy because the dataset requires multi-hop inference. Despite these limitations, Leann showed strong performance in different benchmarks.
In this article, the researchers introduce Leann, a storage-efficient neural retrieval system that combines graph-based recomputation with innovative optimization. By integrating two-level search algorithms and dynamic batching, it eliminates the need for complete storage embedding, thereby achieving substantial reduction in storage overhead while maintaining high precision. Despite its advantages, Leann still faces limitations, such as high peak storage usage during index building, which can be addressed through clustering or other technologies. Future work may focus on reducing latency and enhancing responsiveness, opening the way for wider adoption in resource-constrained environments.
Check Paper and GitHub page. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.