Enterprise retrieval-augmented generation (RAG) pipelines face a growing tension: they must ingest new documents continuously while serving low-latency queries. Traditional vector database (VecDB) indices require frequent global rebuilds or sacrifice search quality. A new indexing framework, Semantic Pyramid Indexing (SPI), aims to resolve this by adapting retrieval depth to each query, according to a paper from Liu, Dong, Yu, and Yanxuan published on arXiv.
The Challenge of Streaming Retrieval-Augmented Generation
In streaming RAG workflows, document ingestion and query processing happen concurrently. Existing VecDB pipelines often operate with a uniform representation regime, ignoring the variation in semantic granularity required across different queries. This mismatch leads to either excessive latency for simple queries or insufficient recall for complex ones. SPI addresses this by organizing embeddings into semantically aligned resolution levels and selecting retrieval depth per query via a lightweight uncertainty-aware controller.
Introducing Semantic Pyramid Indexing (SPI)
SPI is a VecDB-layer indexing framework that structures embeddings into $L$ semantically aligned resolution levels. At query time, a controller determines how deep to search, enabling a progressive coarse-to-fine approximate nearest neighbor (ANN) search. The framework supports level-wise streaming insertion without global rebuilds, and its distributed execution uses LSH partitioning with asynchronous gRPC coordination. SPI is designed to be compatible with existing backends such as FAISS and Qdrant, according to the authors.
Key features of SPI include:
- Adaptive query-depth selection based on query complexity
- Incremental updates without frequent global rebuilding
- A top-$K$ stability guarantee: queries with sufficient retrieval margin return an identical top-$K$ set at a shallower level
- Distributed scaling via LSH partitions and gRPC
Performance Benchmarks and Scaling Results
The authors evaluated SPI on the MS MARCO and Natural Questions datasets using the same dense encoder family. SPI achieved competitive Recall@10 with lower latency, yielding a 1.4–2.3× average retrieval latency reduction under fixed Recall@10 targets compared to comparable approximate-ANN baselines.
| Metric | Value |
|---|---|
| Latency reduction at fixed Recall@10 | 1.4–2.3× |
| Throughput scaling on 8 nodes | 6.2× (~73% efficiency) |
| 16-node configuration | Included for completeness; diminishing efficiency |
In a prototype scaling study up to 8 nodes, SPI showed 6.2× throughput scaling, achieving approximately 73% efficiency. A 16-node configuration was tested but showed diminishing returns, according to the paper. The authors also provide a top-$K$ stability guarantee: for queries with sufficient retrieval margin, the same top-K results are returned at a shallower level, ensuring consistency.
Implications for Enterprise Vector Databases
For enterprise architects evaluating VecDB solutions for RAG pipelines, SPI offers a potential path to balance latency and recall in streaming scenarios. Its compatibility with widely used backends like FAISS and Qdrant may reduce integration friction. The availability of code and configurations (linked in the paper) allows for direct benchmarking against existing deployments. While the research is academic, the performance gains — especially the 1.4–2.3× latency improvements — are directly relevant for production systems where query response time impacts user experience.