New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List vector ids by prefix #307
Conversation
@jhamon, awesome work. Excited to get that out! As for your open question: we do not plan to add anything else in the near future, but it theoretically might happen at some point. However if we end up adding just some auxiliary data it might be possible to modify the iterator function to deal with it/strip it away. @gdj0nes WDYT? |
pinecone/data/index.py
Outdated
yield [v.id for v in results.vectors] | ||
|
||
full_page = len(results.vectors) == limit | ||
if results.pagination and full_page: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhamon it's valid for a page to contain less than limit
results without being the last page. While I do not see that happening with List, I prefer to leave this freedom to engineering.
Therefore, the only indication of whether we've reached the end is whether the result has any pagination token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you're describing is the current behavior, but I don't think it's valid. I reported it as a bug and your team filed an Asana issue for it. I added this "full page" check so that the behavior would be deterministic in testing and not fire off additional calls to fetch an empty array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug was specifically about adding an extra empty page at the end of the pagination loop, but the generic case of sending <limit results is valid - see the API Pagination PRD:
Users should not assume that a page containing fewer results than the limit value is necessarily the last one - and instead should always check the pagination.next field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Ben here. In the generic case we should have the ability to shards result and than it is common practice to return less than the full page and still have continuation token.
pinecone/grpc/index_grpc.py
Outdated
yield [v.id for v in results.vectors] | ||
|
||
full_page = len(results.vectors) == limit | ||
if results.pagination and results.pagination.next and full_page: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above: it's valid for a page to contain less than limit
results without being the last page. While I do not see that happening with List, I prefer to leave this freedom to engineering.
Therefore, the only indication of whether we've reached the end is whether the result has any pagination token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here looks good to me, helpful to go through this right before implementing on TypeScript.
I tried pulling this branch down and testing locally and I've run into issues with both list
and list_paginated
, although it seems to be the same error:
>>> pc.Index('step-test').list_paginated(namespace="test-list")
curl -X GET 'https://step-test-e6dddb2.svc.us-east-1-aws.pinecone.io/vectors/list?namespace=test-list' -H 'Accept: application/json' -H 'User-Agent: python-client-3.0.2 (urllib3:2.0.7)' -H 'Api-Key: redacted'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/austin/workspace/pinecone-python-client/pinecone/utils/error_handling.py", line 10, in inner_func
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/austin/workspace/pinecone-python-client/pinecone/data/index.py", line 523, in list_paginated
return self._vector_api.list(**args_dict, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 772, in __call__
return self.callable(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api/vector_operations_api.py", line 712, in __list
return self.call_with_http_info(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 834, in call_with_http_info
return self.api_client.call_api(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 409, in call_api
return self.__call_api(resource_path, method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 224, in __call_api
return_data = self.deserialize(
^^^^^^^^^^^^^^^^^
File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 325, in deserialize
deserialized_data = validate_and_convert_types(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/model_utils.py", line 1539, in validate_and_convert_types
converted_instance = attempt_convert_item(
^^^^^^^^^^^^^^^^^^^^^
File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/model_utils.py", line 1421, in attempt_convert_item
raise get_type_error(input_value, path_to_item, valid_classes,
pinecone.core.client.exceptions.PineconeApiTypeError: Invalid type for variable 'received_data'. Required value type is ListResponse and passed type was str at ['received_data']
Your integration tests seem to be passing though so I'm assuming this is something on my end. I'll keep playing around with it.
print('Seeding data in namespace "' + namespace + '"') | ||
setup_data(idx, namespace, False) | ||
|
||
print('Seeding data in namespace ""') | ||
setup_data(idx, '', True) | ||
|
||
print('Waiting a bit more to ensure freshness') | ||
time.sleep(60) | ||
time.sleep(120) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this doubling of the sleep time here is intentional due to freshness concerns around the larger number of records.
assert results != None | ||
assert len(results.vectors) == 9 | ||
assert results.namespace == '' | ||
# assert results.pagination == None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these two commented asserts plus the one below in test_list_when_using_pagination
need to stay commented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, still need these. I filed a bug with the engine team to address this issue.
pinecone/data/index.py
Outdated
def list(self, **kwargs): | ||
limit = kwargs.get("limit", 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we've settled on what the functions are going to look like we should add some docstrings here and in index_grpc.py
.
e6eb581
to
cbb186c
Compare
I removed the full_page check and added in docstrings. |
Problem
Need to implement the new data plane list endpoint.
Solution
pinecone/core
are generated from spec files and can be ignored for this review.pinecone/data/index.py
main implementationpinecone/grpc/index_grpc.py
main implementationtests/integration/data/conftest.py
adjustments to test setup, mainly to generate a new namespace to hold vectors for list testing data.tests/integration/data/seed.py
to upsert a larger number of vectors, so I would have enough data to page throughtests/integration/data/test_list.py
tests/integration/data/test_list_errors.py
Open questions
{'id': '1'}
in the vectors array? For convenience thelist()
method is currently implemented as a generator function that abstracts the pagination steps and yields a flat list of id values. For a use case where you were going to immediately fetch those ids, this seems ideal. But would be limiting if we ever wanted to return more than just ids here.Usage
Install the dev client version
install "pinecone-client[grpc]"==3.1.0.dev1
REST
GRPC
Type of Change
Testing
Try out the dev version.