Skip to content

Feature: Allow embedding vector search for max_marginal_relevance_search #2620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rishabh208gupta opened this issue Jul 31, 2024 · 11 comments · Fixed by #2625
Closed

Feature: Allow embedding vector search for max_marginal_relevance_search #2620

rishabh208gupta opened this issue Jul 31, 2024 · 11 comments · Fixed by #2625

Comments

@rishabh208gupta
Copy link
Contributor

rishabh208gupta commented Jul 31, 2024

Describe the feature: Currently there is no option to perform the max_marginal_relevance_search with only query embeddings vector. The query is mandatory and because of which the embedding service needs to be passed.

@pquentin
Copy link
Member

pquentin commented Aug 1, 2024

Would it help you if the maximal_marginal_relevance function was exposed? Or do you really just want an option to only provide query_embedding instead of query in max_marginal_relevance_search?

@rishabh208gupta
Copy link
Contributor Author

rishabh208gupta commented Aug 1, 2024

I was actually looking for an option to only provide query_embedding instead of query and embedding_service in max_marginal_relevance_search

P.S. I am not sure how it works, cause I have never contributed to an open source library, but I would like to work on this if possible. Thanks

@pquentin
Copy link
Member

pquentin commented Aug 1, 2024

Sure! The first step is to get the test to pass locally:

def test_max_marginal_relevance_search(
self, sync_client: Elasticsearch, index: str
) -> None:
"""Test max marginal relevance search."""
texts = ["foo", "bar", "baz"]
vector_field = "vector_field"
text_field = "text_field"
embedding_service = ConsistentFakeEmbeddings()
store = VectorStore(
index=index,
retrieval_strategy=DenseVectorScriptScoreStrategy(),
embedding_service=embedding_service,
vector_field=vector_field,
text_field=text_field,
client=sync_client,
)
store.add_texts(texts)
mmr_output = store.max_marginal_relevance_search(
embedding_service=embedding_service,
query=texts[0],
vector_field=vector_field,
k=3,
num_candidates=3,
)
sim_output = store.search(query=texts[0], k=3)
assert mmr_output == sim_output
mmr_output = store.max_marginal_relevance_search(
embedding_service=embedding_service,
query=texts[0],
vector_field=vector_field,
k=2,
num_candidates=3,
)
assert len(mmr_output) == 2
assert mmr_output[0]["_source"][text_field] == texts[0]
assert mmr_output[1]["_source"][text_field] == texts[1]
mmr_output = store.max_marginal_relevance_search(
embedding_service=embedding_service,
query=texts[0],
vector_field=vector_field,
k=2,
num_candidates=3,
lambda_mult=0.1, # more diversity
)
assert len(mmr_output) == 2
assert mmr_output[0]["_source"][text_field] == texts[0]
assert mmr_output[1]["_source"][text_field] == texts[2]
# if fetch_k < k, then the output will be less than k
mmr_output = store.max_marginal_relevance_search(
embedding_service=embedding_service,
query=texts[0],
vector_field=vector_field,
k=3,
num_candidates=2,
)
assert len(mmr_output) == 2

https://github.com/elastic/elasticsearch-py/blob/main/CONTRIBUTING.md could help you here. The formatting step (nox -rs format) is important because it generates the sync version of max_marginal_relevance_search from the async version. But only the sync version is tested.

@rishabh208gupta
Copy link
Contributor Author

Thanks! Let me give this a go.

@pquentin
Copy link
Member

pquentin commented Aug 2, 2024

Please ask for help if you get stuck, the current system is quite fragile and does not provide helpful errors when things are not setup in the exact same way as in continuous integration.

@rishabh208gupta
Copy link
Contributor Author

Thanks! so I started elastic search locally using the script .buildkite/run-elasticsearch.sh. After that I ran nox -rs test (I didn't run nox -rs format cause I just wanted to run the existing tests successfully). Some of the test cases including test_max_marginal_relevance_search failed due to an authentication error. Is there anything else that I need to do before successfully executing all the existing tests?

@pquentin
Copy link
Member

pquentin commented Aug 2, 2024

What STACK_VERSION did you set for run-elasticsearch.sh?

And are you able to connect to that Elasticsearch instance?

$ curl --resolve instance:9200:127.0.0.1 --cacert .buildkite/certs/ca.pem https://elastic:changeme@instance:9200
{
  "name" : "instance",
  "cluster_name" : "elasticsearch-8-11-0-SNAPSHOT-rest-test",
  "cluster_uuid" : "8UO33t3DS9mziGEfeJNfGw",
  "version" : {
    "number" : "8.11.0-SNAPSHOT",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "edf70163e6e5371a90090831e055b31973958cf0",
    "build_date" : "2023-11-06T18:49:41.957873063Z",
    "build_snapshot" : true,
    "lucene_version" : "9.8.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

The --resolve trick is needed because the SSL certificate in .buildkite/certs/ is for "instance", so curl has to think it's connecting to that host.

@rishabh208gupta
Copy link
Contributor Author

rishabh208gupta commented Aug 2, 2024

Yeah I was able to connect to the elastic search instance :

rishabhgupta@Rishabhs-MacBook-Pro elasticsearch-py % curl --resolve instance:9200:127.0.0.1 --cacert .buildkite/certs/ca.pem https://elastic:changeme@instance:9200
{
  "name" : "instance",
  "cluster_name" : "elasticsearch-8-11-0-SNAPSHOT-rest-test",
  "cluster_uuid" : "RQFi2OFNTnOr754BMEXPnA",
  "version" : {
    "number" : "8.11.0-SNAPSHOT",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "edf70163e6e5371a90090831e055b31973958cf0",
    "build_date" : "2023-11-06T18:49:41.957873063Z",
    "build_snapshot" : true,
    "lucene_version" : "9.8.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

The error that I can for the test :

ERROR test_elasticsearch/test_server/test_vectorstore/test_vectorstore.py::TestVectorStore::test_max_marginal_relevance_search - elasticsearch.AuthenticationException: AuthenticationException(401, 'security_exception', 'missing authentication credentials for REST request [/]')

@pquentin
Copy link
Member

pquentin commented Aug 2, 2024

OK, this is pretty annoying. I had to add "instance" in /etc/hosts by adding the following entry:

127.0.0.1	instance

Now, that test runs fine with:

STACK_VERSION=8.15.0-SNAPSHOT .buildkite/run-elasticsearch.sh

and then:

ELASTICSEARCH_URL=https://elastic:changeme@instance:9200 nox -rs test-3.12

Also, I added the following change to only run that test:

diff --git a/noxfile.py b/noxfile.py
index 69e53417..730fb27d 100644
--- a/noxfile.py
+++ b/noxfile.py
@@ -42,6 +42,8 @@ def pytest_argv():
         "--log-level=DEBUG",
         "--cache-clear",
         "-vv",
+        "-k",
+        "test_max_marginal_relevance_search"
     ]

rishabh208gupta pushed a commit to rishabh208gupta/elasticsearch-py that referenced this issue Aug 2, 2024

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
@rishabh208gupta
Copy link
Contributor Author

Thanks a lot! I was able to run the test, I have created a PR

rishabh208gupta pushed a commit to rishabh208gupta/elasticsearch-py that referenced this issue Aug 2, 2024
Signed-off-by: Rishabh Gupta <rishabhgupta@Rishabhs-MacBook-Pro.local>
rishabh208gupta added a commit to rishabh208gupta/elasticsearch-py that referenced this issue Aug 6, 2024
Signed-off-by: rishabh208gupta <rishabhgupta.52pp@gmail.com>
rishabh208gupta added a commit to rishabh208gupta/elasticsearch-py that referenced this issue Aug 6, 2024
Signed-off-by: rishabh208gupta <rishabhgupta.52pp@gmail.com>
@rishabh208gupta
Copy link
Contributor Author

Hi @pquentin, just wanted to check with you on the status of this, any estimate of when this might be reviewed/merged?

miguelgrinberg pushed a commit that referenced this issue Aug 13, 2024
* allow embeddings vector to be used for mmr searching (#2620)

Signed-off-by: rishabh208gupta <rishabhgupta.52pp@gmail.com>

* Use embedding service if provided

---------

Signed-off-by: rishabh208gupta <rishabhgupta.52pp@gmail.com>
Co-authored-by: Quentin Pradet <quentin.pradet@elastic.co>
github-actions bot pushed a commit that referenced this issue Aug 13, 2024
* allow embeddings vector to be used for mmr searching (#2620)

Signed-off-by: rishabh208gupta <rishabhgupta.52pp@gmail.com>

* Use embedding service if provided

---------

Signed-off-by: rishabh208gupta <rishabhgupta.52pp@gmail.com>
Co-authored-by: Quentin Pradet <quentin.pradet@elastic.co>
(cherry picked from commit 3b1bce7)
miguelgrinberg pushed a commit that referenced this issue Aug 13, 2024
…2639)

* allow embeddings vector to be used for mmr searching (#2620)

Signed-off-by: rishabh208gupta <rishabhgupta.52pp@gmail.com>

* Use embedding service if provided

---------

Signed-off-by: rishabh208gupta <rishabhgupta.52pp@gmail.com>
Co-authored-by: Quentin Pradet <quentin.pradet@elastic.co>
(cherry picked from commit 3b1bce7)

Co-authored-by: Rishabh Gupta <71093470+rishabh208gupta@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants