List vector ids by prefix #307

jhamon · 2024-02-06T20:10:58Z

Problem

Need to implement the new data plane list endpoint.

Solution

Update generated code. Pull in list endpoint. All changes under pinecone/core are generated from spec files and can be ignored for this review.
Implement changes for list endpoint in:
- pinecone/data/index.py main implementation
- pinecone/grpc/index_grpc.py main implementation
- tests/integration/data/conftest.py adjustments to test setup, mainly to generate a new namespace to hold vectors for list testing data.
- tests/integration/data/seed.py to upsert a larger number of vectors, so I would have enough data to page through
- tests/integration/data/test_list.py
- tests/integration/data/test_list_errors.py

Open questions

Do we expect to ever return more data than just {'id': '1'} in the vectors array? For convenience the list() method is currently implemented as a generator function that abstracts the pagination steps and yields a flat list of id values. For a use case where you were going to immediately fetch those ids, this seems ideal. But would be limiting if we ever wanted to return more than just ids here.

Usage

Install the dev client version install "pinecone-client[grpc]"==3.1.0.dev1

REST

from pinecone import Pinecone

pc = Pinecone(api_key='xxx')
index = pc.Index(host='hosturl')

# To iterate over all result pages using a generator function
for ids in index.list(prefix='pref', limit=3, namespace='foo'):
    print(ids) // ['pref1', 'pref2', 'pref3']

# For manual control over pagination
results = index.list_paginated(
    prefix='pref', 
    limit=3, 
    namespace='foo', 
    pagination_token='eyJza2lwX3Bhc3QiOiI5IiwicHJlZml4IjpudWxsfQ=='
)
print(results.namespace)
print([v.id for v in results.vectors])
print(results.pagination.next)
print(results.usage)

GRPC

from pinecone.grpc import PineconeGRPC

pc = PineconeGRPC(api_key='xxx')
index = pc.Index(host='hosturl')

# To iterate over all result pages using a generator function
for ids in index.list(prefix='pref', limit=3, namespace='foo'):
    print(ids) // ['pref1', 'pref2', 'pref3']

# For manual control over pagination
results = index.list_paginated(
    prefix='pref', 
    limit=3, 
    namespace='foo', 
    pagination_token='eyJza2lwX3Bhc3QiOiI5IiwicHJlZml4IjpudWxsfQ=='
)
print(results.namespace)
print([v.id for v in results.vectors])
print(results.pagination.next)
print(results.usage)

Type of Change

New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Testing

Try out the dev version.

pip install "pinecone-client[grpc]"==3.1.0.dev1

opbenesh · 2024-02-08T11:18:08Z

@jhamon, awesome work. Excited to get that out!

As for your open question: we do not plan to add anything else in the near future, but it theoretically might happen at some point. However if we end up adding just some auxiliary data it might be possible to modify the iterator function to deal with it/strip it away.

@gdj0nes WDYT?

opbenesh · 2024-02-11T12:50:38Z

pinecone/data/index.py

+                yield [v.id for v in results.vectors]
+
+            full_page = len(results.vectors) == limit
+            if results.pagination and full_page:


@jhamon it's valid for a page to contain less than limit results without being the last page. While I do not see that happening with List, I prefer to leave this freedom to engineering.
Therefore, the only indication of whether we've reached the end is whether the result has any pagination token.

What you're describing is the current behavior, but I don't think it's valid. I reported it as a bug and your team filed an Asana issue for it. I added this "full page" check so that the behavior would be deterministic in testing and not fire off additional calls to fetch an empty array.

The bug was specifically about adding an extra empty page at the end of the pagination loop, but the generic case of sending <limit results is valid - see the API Pagination PRD:

Users should not assume that a page containing fewer results than the limit value is necessarily the last one - and instead should always check the pagination.next field.

I agree with Ben here. In the generic case we should have the ability to shards result and than it is common practice to return less than the full page and still have continuation token.

opbenesh · 2024-02-11T12:51:17Z

pinecone/grpc/index_grpc.py

+                yield [v.id for v in results.vectors]
+
+            full_page = len(results.vectors) == limit
+            if results.pagination and results.pagination.next and full_page:


Same as above: it's valid for a page to contain less than limit results without being the last page. While I do not see that happening with List, I prefer to leave this freedom to engineering.
Therefore, the only indication of whether we've reached the end is whether the result has any pagination token.

austin-denoble

The code here looks good to me, helpful to go through this right before implementing on TypeScript.

I tried pulling this branch down and testing locally and I've run into issues with both list and list_paginated, although it seems to be the same error:

>>> pc.Index('step-test').list_paginated(namespace="test-list")
curl -X GET 'https://step-test-e6dddb2.svc.us-east-1-aws.pinecone.io/vectors/list?namespace=test-list' -H 'Accept: application/json' -H 'User-Agent: python-client-3.0.2 (urllib3:2.0.7)' -H 'Api-Key: redacted'

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/austin/workspace/pinecone-python-client/pinecone/utils/error_handling.py", line 10, in inner_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/data/index.py", line 523, in list_paginated
    return self._vector_api.list(**args_dict, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 772, in __call__
    return self.callable(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api/vector_operations_api.py", line 712, in __list
    return self.call_with_http_info(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 834, in call_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 409, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 224, in __call_api
    return_data = self.deserialize(
                  ^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/api_client.py", line 325, in deserialize
    deserialized_data = validate_and_convert_types(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/model_utils.py", line 1539, in validate_and_convert_types
    converted_instance = attempt_convert_item(
                         ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/workspace/pinecone-python-client/pinecone/core/client/model_utils.py", line 1421, in attempt_convert_item
    raise get_type_error(input_value, path_to_item, valid_classes,
pinecone.core.client.exceptions.PineconeApiTypeError: Invalid type for variable 'received_data'. Required value type is ListResponse and passed type was str at ['received_data']

Your integration tests seem to be passing though so I'm assuming this is something on my end. I'll keep playing around with it.

austin-denoble · 2024-02-14T21:44:32Z

tests/integration/data/conftest.py

    print('Seeding data in namespace "' + namespace + '"')
    setup_data(idx, namespace, False)

    print('Seeding data in namespace ""')
    setup_data(idx, '', True)

    print('Waiting a bit more to ensure freshness')
-    time.sleep(60)
+    time.sleep(120)


I'm assuming this doubling of the sleep time here is intentional due to freshness concerns around the larger number of records.

austin-denoble · 2024-02-14T21:46:15Z

tests/integration/data/test_list.py

+        assert results != None
+        assert len(results.vectors) == 9
+        assert results.namespace == ''
+        # assert results.pagination == None


Do these two commented asserts plus the one below in test_list_when_using_pagination need to stay commented?

Yeah, still need these. I filed a bug with the engine team to address this issue.

austin-denoble · 2024-02-21T20:44:55Z

pinecone/data/index.py

+    def list(self, **kwargs):
+        limit = kwargs.get("limit", 100)


When we've settled on what the functions are going to look like we should add some docstrings here and in index_grpc.py.

jhamon · 2024-02-23T23:04:15Z

I removed the full_page check and added in docstrings.

jhamon changed the title ~~Generated changes~~ WIP on list by id prefix Feb 7, 2024

jhamon requested review from austin-denoble and opbenesh February 7, 2024 06:56

jhamon marked this pull request as ready for review February 7, 2024 06:57

opbenesh requested changes Feb 11, 2024

View reviewed changes

austin-denoble approved these changes Feb 14, 2024

View reviewed changes

austin-denoble reviewed Feb 21, 2024

View reviewed changes

Rebase changes

cbb186c

jhamon force-pushed the jhamon/update-core-feb6 branch from e6eb581 to cbb186c Compare February 22, 2024 23:02

jhamon added 3 commits February 23, 2024 14:46

Remove full_page check; add docs

a8b1e9f

Merge branch 'main' into jhamon/update-core-feb6

a3c46e8

Update README

2b06370

jhamon changed the title ~~WIP on list by id prefix~~ List vector ids by prefix Feb 23, 2024

jhamon merged commit 2ca1eb9 into main Feb 23, 2024
124 checks passed

jhamon deleted the jhamon/update-core-feb6 branch February 23, 2024 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List vector ids by prefix #307

List vector ids by prefix #307

jhamon commented Feb 6, 2024 •

edited

opbenesh commented Feb 8, 2024 •

edited

opbenesh Feb 11, 2024

jhamon Feb 12, 2024

opbenesh Feb 14, 2024

tomer-w Feb 19, 2024

opbenesh Feb 11, 2024

austin-denoble left a comment

austin-denoble Feb 14, 2024

austin-denoble Feb 14, 2024

jhamon Feb 23, 2024

austin-denoble Feb 21, 2024

jhamon commented Feb 23, 2024

List vector ids by prefix #307

List vector ids by prefix #307

Conversation

jhamon commented Feb 6, 2024 • edited

Problem

Solution

Open questions

Usage

REST

GRPC

Type of Change

Testing

opbenesh commented Feb 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austin-denoble left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamon commented Feb 23, 2024

jhamon commented Feb 6, 2024 •

edited

opbenesh commented Feb 8, 2024 •

edited