Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List vector ids by prefix #307

Merged
merged 4 commits into from Feb 23, 2024
Merged

List vector ids by prefix #307

merged 4 commits into from Feb 23, 2024

Conversation

jhamon
Copy link
Collaborator

@jhamon jhamon commented Feb 6, 2024

Problem

Need to implement the new data plane list endpoint.

Solution

  • Update generated code. Pull in list endpoint. All changes under pinecone/core are generated from spec files and can be ignored for this review.
  • Implement changes for list endpoint in:
    • pinecone/data/index.py main implementation
    • pinecone/grpc/index_grpc.py main implementation
    • tests/integration/data/conftest.py adjustments to test setup, mainly to generate a new namespace to hold vectors for list testing data.
    • tests/integration/data/seed.py to upsert a larger number of vectors, so I would have enough data to page through
    • tests/integration/data/test_list.py
    • tests/integration/data/test_list_errors.py

Open questions

  • Do we expect to ever return more data than just {'id': '1'} in the vectors array? For convenience the list() method is currently implemented as a generator function that abstracts the pagination steps and yields a flat list of id values. For a use case where you were going to immediately fetch those ids, this seems ideal. But would be limiting if we ever wanted to return more than just ids here.

Usage

Install the dev client version install "pinecone-client[grpc]"==3.1.0.dev1

REST

from pinecone import Pinecone

pc = Pinecone(api_key='xxx')
index = pc.Index(host='hosturl')

# To iterate over all result pages using a generator function
for ids in index.list(prefix='pref', limit=3, namespace='foo'):
    print(ids) // ['pref1', 'pref2', 'pref3']

# For manual control over pagination
results = index.list_paginated(
    prefix='pref', 
    limit=3, 
    namespace='foo', 
    pagination_token='eyJza2lwX3Bhc3QiOiI5IiwicHJlZml4IjpudWxsfQ=='
)
print(results.namespace)
print([v.id for v in results.vectors])
print(results.pagination.next)
print(results.usage)

GRPC

from pinecone.grpc import PineconeGRPC

pc = PineconeGRPC(api_key='xxx')
index = pc.Index(host='hosturl')

# To iterate over all result pages using a generator function
for ids in index.list(prefix='pref', limit=3, namespace='foo'):
    print(ids) // ['pref1', 'pref2', 'pref3']

# For manual control over pagination
results = index.list_paginated(
    prefix='pref', 
    limit=3, 
    namespace='foo', 
    pagination_token='eyJza2lwX3Bhc3QiOiI5IiwicHJlZml4IjpudWxsfQ=='
)
print(results.namespace)
print([v.id for v in results.vectors])
print(results.pagination.next)
print(results.usage)

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Testing

Try out the dev version.

pip install "pinecone-client[grpc]"==3.1.0.dev1

@jhamon jhamon changed the title Generated changes WIP on list by id prefix Feb 7, 2024
@jhamon jhamon marked this pull request as ready for review February 7, 2024 06:57
@opbenesh
Copy link

opbenesh commented Feb 8, 2024

@jhamon, awesome work. Excited to get that out!

As for your open question: we do not plan to add anything else in the near future, but it theoretically might happen at some point. However if we end up adding just some auxiliary data it might be possible to modify the iterator function to deal with it/strip it away.

@gdj0nes WDYT?

yield [v.id for v in results.vectors]

full_page = len(results.vectors) == limit
if results.pagination and full_page:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhamon it's valid for a page to contain less than limit results without being the last page. While I do not see that happening with List, I prefer to leave this freedom to engineering.
Therefore, the only indication of whether we've reached the end is whether the result has any pagination token.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're describing is the current behavior, but I don't think it's valid. I reported it as a bug and your team filed an Asana issue for it. I added this "full page" check so that the behavior would be deterministic in testing and not fire off additional calls to fetch an empty array.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug was specifically about adding an extra empty page at the end of the pagination loop, but the generic case of sending <limit results is valid - see the API Pagination PRD:

Users should not assume that a page containing fewer results than the limit value is necessarily the last one - and instead should always check the pagination.next field.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Ben here. In the generic case we should have the ability to shards result and than it is common practice to return less than the full page and still have continuation token.

yield [v.id for v in results.vectors]

full_page = len(results.vectors) == limit
if results.pagination and results.pagination.next and full_page:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: it's valid for a page to contain less than limit results without being the last page. While I do not see that happening with List, I prefer to leave this freedom to engineering.
Therefore, the only indication of whether we've reached the end is whether the result has any pagination token.

Copy link
Contributor

@austin-denoble austin-denoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here looks good to me, helpful to go through this right before implementing on TypeScript.

I tried pulling this branch down and testing locally and I've run into issues with both list and list_paginated, although it seems to be the same error:

>>> pc.Index('step-test').list_paginated(namespace="test-list")
curl -X GET 'https://step-test-e6dddb2.svc.us-east-1-aws.pinecone.io/vectors/list?namespace=test-list' -H 'Accept: application/json' -H 'User-Agent: python-client-3.0.2 (urllib3:2.0.7)' -H 'Api-Key: redacted'

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/austin/workspace/pinecone-python-client/pinecone/utils/error_handling.py", line 10, in inner_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/data/index.py", line 523, in list_paginated
    return self._vector_api.list(**args_dict, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 772, in __call__
    return self.callable(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api/vector_operations_api.py", line 712, in __list
    return self.call_with_http_info(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 834, in call_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 409, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 224, in __call_api
    return_data = self.deserialize(
                  ^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 325, in deserialize
    deserialized_data = validate_and_convert_types(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/model_utils.py", line 1539, in validate_and_convert_types
    converted_instance = attempt_convert_item(
                         ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/model_utils.py", line 1421, in attempt_convert_item
    raise get_type_error(input_value, path_to_item, valid_classes,
pinecone.core.client.exceptions.PineconeApiTypeError: Invalid type for variable 'received_data'. Required value type is ListResponse and passed type was str at ['received_data']

Your integration tests seem to be passing though so I'm assuming this is something on my end. I'll keep playing around with it.

print('Seeding data in namespace "' + namespace + '"')
setup_data(idx, namespace, False)

print('Seeding data in namespace ""')
setup_data(idx, '', True)

print('Waiting a bit more to ensure freshness')
time.sleep(60)
time.sleep(120)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this doubling of the sleep time here is intentional due to freshness concerns around the larger number of records.

assert results != None
assert len(results.vectors) == 9
assert results.namespace == ''
# assert results.pagination == None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these two commented asserts plus the one below in test_list_when_using_pagination need to stay commented?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, still need these. I filed a bug with the engine team to address this issue.

Comment on lines 526 to 527
def list(self, **kwargs):
limit = kwargs.get("limit", 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we've settled on what the functions are going to look like we should add some docstrings here and in index_grpc.py.

@jhamon jhamon changed the title WIP on list by id prefix List vector ids by prefix Feb 23, 2024
@jhamon
Copy link
Collaborator Author

jhamon commented Feb 23, 2024

I removed the full_page check and added in docstrings.

@jhamon jhamon merged commit 2ca1eb9 into main Feb 23, 2024
124 checks passed
@jhamon jhamon deleted the jhamon/update-core-feb6 branch February 23, 2024 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants