[Proposal] Vector Similarity Search Indexing #2287
Beihao-Zhou
started this conversation in
General
Replies: 1 comment 2 replies
-
Thank you for providing excellent preliminary research and suggestions, which have given us direction to implement vector search. I also agree that we can first try to implement HNSW (including designing an efficient encoding for reducing HNSW index to rocksdb key-values). And in the later phase, we can introduce other indexes according to the situation. If anyone in the community are interested, welcome to join the discussion! cc @git-hulk @mapleFU @Yangsx-1 |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Kvrocks Vector Similarity Search Indexing Proposal
Background
Redis Vector Search[1] enables real-time indexing, updating, and querying of vectors using two methods: FLAT, which performs brute-force indexing, and HNSW (Hierarchical Navigable Small World) graphs[2].
With the development of the Search module in KVrocks, integrating vector indexing capabilities will empower users to conduct vector similarity searches using KVrocks, supporting real-time processing and efficient large-scale vector data management.
This proposal will explore potential implementations to vector similarity search that prioritize disk access patterns.
Potential Indexing Solutions
HNSW
Algorithm[3][6]
Pros
Cons
Vamana(diskANN) indexing
Algorithm[4][7]
Pros[5]
Cons
Related Work and Impl
IVFFlat
Algorithm[11]
Pros
Cons
In a short, I think HNSW should be implemented first as it’s more compatible with Redis protocol and well-proven in many existing frameworks. We could consider supporting Vamana for future use cases involving static datasets. Additionally, the inclusion of IVFFlat should be evaluated, particularly for scenarios where index size and build time are critical, even though it may require more frequent rebuilding with data updates. To further improve HNSW, we could try HNSW + PQ [9] or SPANN [8], which, in a high level, clusters vectors first and then performs a more fine-grained search within the closest clusters. However, the first milestone is to successfully implement HNSW.
Similar Discussion
apache/lucene#12615
Appendix
ANN benchmarking tool: https://ann-benchmarks.com/glove-100-angular_10_angular.html
References
[1] Redis Vector Database: https://redis.io/docs/latest/develop/get-started/vector-database/
[2] Redis Search Reference: https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/
[3] Write You a Vector Database: https://skyzh.github.io/write-you-a-vector-db/cpp-06-01-nsw.html
[4] Zilliz Engineering Blog**.** DiskANN: A Disk-based ANNS Solution with High Recall and High QPS on Billion-scale Dataset: https://zilliz.com/blog/diskann-a-disk-based-anns-solution-with-high-recall-and-high-qps-on-billion-scale-dataset
[5] Vamana vs. HNSW - Exploring ANN algorithms Part 1: https://weaviate.io/blog/ann-algorithms-vamana-vs-hnsw
[6] Hierarchical Navigable Small Worlds (HNSW):https://www.pinecone.io/learn/series/faiss/hnsw/
[7] DiskANN and the Vamana Algorithm: https://zilliz.com/learn/DiskANN-and-the-Vamana-Algorithm
[8] SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search:https://arxiv.org/pdf/2111.08566.pdf
[9] HNSW+PQ - Exploring ANN algorithms: https://weaviate.io/blog/ann-algorithms-hnsw-pq
[10] Vector Indexes in Postgres using pgvector: IVFFlat vs HNSW: https://tembo.io/blog/vector-indexes-in-pgvector
[11] Everything You Need to Know about Vector Index Basics: https://zilliz.com/learn/vector-index
I'm new to vector database field, any corrections, thoughts and/or insights are welcome!
(p.s. The encoding part is not in the scope of this proposal, but there will be a new post talking about index encoding once the indexing method is determined here. )
Beta Was this translation helpful? Give feedback.
All reactions