[Proposal] Vector Similarity Search Indexing #2287

Beihao-Zhou · 2024-05-02T17:31:52Z

Beihao-Zhou
May 2, 2024

Kvrocks Vector Similarity Search Indexing Proposal

Background

Redis Vector Search[1] enables real-time indexing, updating, and querying of vectors using two methods: FLAT, which performs brute-force indexing, and HNSW (Hierarchical Navigable Small World) graphs[2].

With the development of the Search module in KVrocks, integrating vector indexing capabilities will empower users to conduct vector similarity searches using KVrocks, supporting real-time processing and efficient large-scale vector data management.

This proposal will explore potential implementations to vector similarity search that prioritize disk access patterns.

Potential Indexing Solutions

HNSW

Algorithm[3][6]

Build a hierarchy of layers to speed up the traversal of the nearest neighbor graph.
In this graph, the top layers contain only long-range edges.
The deeper the search traverses through the hierarchy, the shorter the distance between vectors captured in the edges.

Pros

Compatible with original Redis protocol
Real-time insertion
Well-proven performant vector search indexing algorithm in many frameworks
- On-disk HNSW index for Postgres with pg_embedding
- Faiss HNSW impl

Cons

Additional indexing layers or metadata to manage disk-based graph traversal, which results in increased disk RTT and increased metadata.

Vamana(diskANN) indexing

Algorithm[4][7]

Build a random graph.
Optimize the graph, so it only connects vectors close to each other.
Modify the graph by removing some short connections and adding some long-range edges to speed up the traversal of the graph.

Pros[5]

Minimize the footprint of each index and reduces redundancy.
Designed with disk-based systems in mind, reducing the number of disk seeks during queries.

Cons

Static Nature: The initial design and common implementations of diskANN are generally static. This means that once the index is built, it is not designed to dynamically incorporate new data points. Potentially, we could prune data newly inserted data points; however, there is no research or blogs found that actually implemented and benchmarked it.
As Redis explicitly support HNSW, the parameters for Vamana are different from that of HNSW, despite there might be corresponding mapping between parameters of different models.

Related Work and Impl

IVFFlat

Algorithm[11]

IVFFlat divides vectors into multiple lists based on a number of computed centroids, forming clusters around these centroids.
Each list corresponds to a cluster and contains vectors close to that centroid.
During search, instead of comparing to all vectors, the algorithm narrows down to subsets of lists based on the proximity of their centroids to the query vector.

Pros

Limits search to relevant clusters, reducing the number of distance calculations.
Since vectors are grouped by similarity and does not have much space overhead compared to graph index, this can potentially reduce storage needs.

Cons

Changes in Recall upon Updates: Significant impact on recall if vectors are added or modified, as it might require recalculating centroids and redistributing vectors.
Potential Need for Re-indexing: Regular updates or additions may necessitate frequent re-indexing to maintain efficiency and accuracy.

💡 Comparative Analysis with HNSW [10]

Robustness to Updates: HNSW handles updates and modifications with minimal impact on recall.
Index Size: IVFFlat has a smaller storage footprint.
Query Speed: HNSW is substantially faster in terms of queries per second.
Build Time: IVFFlat is significantly faster to build compared to HNSW.

In a short, I think HNSW should be implemented first as it’s more compatible with Redis protocol and well-proven in many existing frameworks. We could consider supporting Vamana for future use cases involving static datasets. Additionally, the inclusion of IVFFlat should be evaluated, particularly for scenarios where index size and build time are critical, even though it may require more frequent rebuilding with data updates. To further improve HNSW, we could try HNSW + PQ [9] or SPANN [8], which, in a high level, clusters vectors first and then performs a more fine-grained search within the closest clusters. However, the first milestone is to successfully implement HNSW.

Similar Discussion

apache/lucene#12615

Appendix

ANN benchmarking tool: https://ann-benchmarks.com/glove-100-angular_10_angular.html

References

[1] Redis Vector Database: https://redis.io/docs/latest/develop/get-started/vector-database/

[2] Redis Search Reference: https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/

[3] Write You a Vector Database: https://skyzh.github.io/write-you-a-vector-db/cpp-06-01-nsw.html

[4] Zilliz Engineering Blog**.** DiskANN: A Disk-based ANNS Solution with High Recall and High QPS on Billion-scale Dataset: https://zilliz.com/blog/diskann-a-disk-based-anns-solution-with-high-recall-and-high-qps-on-billion-scale-dataset

[5] Vamana vs. HNSW - Exploring ANN algorithms Part 1: https://weaviate.io/blog/ann-algorithms-vamana-vs-hnsw

[6] Hierarchical Navigable Small Worlds (HNSW):https://www.pinecone.io/learn/series/faiss/hnsw/

[7] DiskANN and the Vamana Algorithm: https://zilliz.com/learn/DiskANN-and-the-Vamana-Algorithm

[8] SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search:https://arxiv.org/pdf/2111.08566.pdf

[9] HNSW+PQ - Exploring ANN algorithms: https://weaviate.io/blog/ann-algorithms-hnsw-pq

[10] Vector Indexes in Postgres using pgvector: IVFFlat vs HNSW: https://tembo.io/blog/vector-indexes-in-pgvector

[11] Everything You Need to Know about Vector Index Basics: https://zilliz.com/learn/vector-index

I'm new to vector database field, any corrections, thoughts and/or insights are welcome!
(p.s. The encoding part is not in the scope of this proposal, but there will be a new post talking about index encoding once the indexing method is determined here. )

PragmaTwice · 2024-05-04T03:01:29Z

PragmaTwice
May 4, 2024
Collaborator

Thank you for providing excellent preliminary research and suggestions, which have given us direction to implement vector search.

I also agree that we can first try to implement HNSW (including designing an efficient encoding for reducing HNSW index to rocksdb key-values). And in the later phase, we can introduce other indexes according to the situation.

If anyone in the community are interested, welcome to join the discussion! cc @git-hulk @mapleFU @Yangsx-1

2 replies

git-hulk May 4, 2024
Collaborator

Very excited to see this proposal. And I also think it will make many users happy even if only supports HNSW.

Beihao-Zhou May 5, 2024
Author

Sounds good!! Then I'll come up with the design for HNSW encoding soon, look forward to your all feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Vector Similarity Search Indexing #2287

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

[Proposal] Vector Similarity Search Indexing #2287

Beihao-Zhou May 2, 2024

Kvrocks Vector Similarity Search Indexing Proposal

Background

Potential Indexing Solutions

HNSW

Algorithm[3][6]

Pros

Cons

Vamana(diskANN) indexing

Algorithm[4][7]

IVFFlat

Algorithm[11]

Pros

Cons

Similar Discussion

Appendix

References

Replies: 1 comment · 2 replies

PragmaTwice May 4, 2024 Collaborator

git-hulk May 4, 2024 Collaborator

Beihao-Zhou May 5, 2024 Author

Beihao-Zhou
May 2, 2024

Replies: 1 comment 2 replies

PragmaTwice
May 4, 2024
Collaborator

git-hulk May 4, 2024
Collaborator

Beihao-Zhou May 5, 2024
Author