GH-20512: [Python] Numpy conversion doesn't account for ListArray offset #15210

wjones127 · 2023-01-05T18:57:16Z

Closes: [Python] Quadratic memory usage of Table.to_pandas with nested data #20512

github-actions · 2023-01-05T18:57:39Z

https://issues.apache.org/jira/browse/ARROW-18400

github-actions · 2023-01-05T18:57:41Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

wjones127 · 2023-01-05T19:03:22Z

cpp/src/arrow/array/array_list_test.cc

+    auto values = checked_cast<ListArray*>(arr_sliced.get())->values();
+    auto expected_values = ArrayFromJSON(int16(), "[1, 2, 3, 4, 5]");
+    AssertArraysEqual(*expected_values, *values);


Should we make this expected behavior or not?

I just commented on the JIRA as well, but so in the current design, we shouldn't change that. It is Flatten() that already implements this, and the values being unsliced ensures that they still match the offsets of a sliced ListArray.

It is certainly confusing, though. I wonder if we should give it a more scary name like "raw_values"

+1 on renaming to raw_values. Or at the very least, we should modify the doc comment. Right now it's not clear it doesn't account for offset and length.

The problem with Flatten though is that it removes the null values from the values array, but in this case we want them.

The problem with Flatten though is that it removes the null values from the values array, but in this case we want them.

Because we use the offsets to slice into the flat values, and those offsets take into account potential values behind a null?

This corner case is covered by existing tests?

I should add a test.

And you might be right about flatten, it does include the nulls in the values (not the list). For some reason earlier I had gotten the impression it doesn't, while I was debugging.

>>> import pyarrow as pa >>> arr = pa.array([[1, 2], [3, 4, 5], [6, None], [7, 8]]) >>> arr.flatten() <pyarrow.lib.Int64Array object at 0x129b2c820> [ 1, 2, 3, 4, 5, 6, null, 7, 8 ]

Yes, but your example uses an actual null inside a list, not a "null list". For example, here the first null is not in the flattened output:

In [3]: arr = pa.array([[1, 2], None, [3, None]]) In [4]: arr.flatten() Out[4]: <pyarrow.lib.Int64Array object at 0x7f8b1fbcaa40> [ 1, 2, 3, null ]

And I suppose behind that null can be any data (the default when constructed above is that there is no data behind, so the offset doesn't increment for that list element):

In [5]: arr.offsets Out[5]: <pyarrow.lib.Int32Array object at 0x7f8b4f0f5a80> [ 0, 2, 2, 4 ]

It's a bit tricky to construct manually, but something like:

In [10]: arr = pa.ListArray.from_arrays(pa.array([0, 2, 4, 6]), pa.array([1, 2, 99, 99, 3, None]), mask=pa.array([False, True, False])) In [11]: arr Out[11]: <pyarrow.lib.ListArray object at 0x7f8b1fbcba60> [ [ 1, 2 ], null, [ 3, null ] ] In [12]: arr.flatten() Out[12]: <pyarrow.lib.Int64Array object at 0x7f8b4f065960> [ 1, 2, 3, null ] In [13]: arr.values Out[13]: <pyarrow.lib.Int64Array object at 0x7f8b4f065780> [ 1, 2, 99, 99, 3, null ] In [14]: arr.offsets Out[14]: <pyarrow.lib.Int32Array object at 0x7f8b1fbc9300> [ 0, 2, 4, 6 ]

But, so I am not sure if for this case you actually need to flattened (with nulls removed) or not for this case of converting to numpy.

The offsets still assume those unused values are present, so it was maybe actually a good call to think the "Flattened" values (with values behind nulls removed) was not the correct method to use here.

Because we use the offsets to slice into the flat values, and those offsets take into account potential values behind a null?

Oh now I remember, and I think I understand what you are saying here better now. For the fixed size lists, the values behind a null entry in the list are removed when we call Flatten(). When we go back to try to reconstruct the lists based on offsets, the offsets produced by value_offset are all invalid since they don't account for the values we dropped in Flatten().

wjones127 · 2023-01-05T22:25:18Z

I re-ran the original reproduction and it seems memory usage is no longer quadratic:

Num rows	Memory usage (10.0.1)	Memory usage (after)
256k	2,153,767,662	1,102,736,461
512k	8,496,047,798	2,185,596,364

Code for test

Write test file:

import numpy as np
import random
import string
import tracemalloc
import pyarrow as pa
import pyarrow.parquet as pq

_characters = string.ascii_uppercase + string.digits + string.punctuation

def make_random_string(N=10):
    return ''.join(random.choice(_characters) for _ in range(N))

nrows = 256_000
filename = 'nested_pandas.parquet'

arr_len = 10
nested_col = []
for i in range(nrows):
    nested_col.append(np.array(
            [{
                'a': None if i % 1000 == 0 else np.random.choice(10000, size=3).astype(np.int64),
                'b': None if i % 100 == 0 else random.choice(range(100)),
                'c': None if i % 10 == 0 else make_random_string(5)
            } for i in range(arr_len)]
        ))

table = pa.table({'c1': nested_col})

# table = pa.table({
#     'c1': pa.array([list(range(random.randint(1, 20))) for _ in range(nrows)])
# })

# Writing to .parquet and loading it into arrow again
pq.write_table(table, filename)

Then measure:

import tracemalloc
import pyarrow.parquet as pq

filename = '/Users/willjones/Documents/arrows/arrow/python/nested_pandas.parquet'
tracemalloc.start()
table_from_parquet = pq.read_table(filename)

out = table_from_parquet.to_pandas()

print(tracemalloc.get_traced_memory())

jorisvandenbossche · 2023-01-09T10:26:32Z

python/pyarrow/tests/test_pandas.py

@@ -4513,3 +4513,27 @@ def test_does_not_mutate_timedelta_nested():
    df = table.to_pandas()

    assert df["timedelta_2"][0].to_pytimedelta() == timedelta_2[0]
+
+
+def test_list_no_duplicate_base():


There is a TestConvertListTypes class that groups some list type related tests, maybe can move this somewhere there.

I can move it there.

raulcd · 2023-01-16T17:38:20Z

@jorisvandenbossche @wjones127 I think this might be a release blocker. I am happy to mark the issue as a blocker and add it to the release if it gets reviewed / merged

github-actions · 2023-01-17T06:30:59Z

Closes: [Python] Quadratic memory usage of Table.to_pandas with nested data #20512

jorisvandenbossche · 2023-01-17T10:34:42Z

Yes, agreed it would be nice to include this one in the release, given the quadratic memory issues. I did a review, and all looks good to me. @wjones127 I am just going to add one more test with the "hidden" null values (the reason we can't use Flatten), which I think is currently not covered by the tests (unless it was already covered by existing tests)?

AlenkaF

+1, thank you for working on this @wjones127!

…set (#15210) * Closes: #20512 Lead-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>

ursabot · 2023-01-19T10:24:18Z

Benchmark runs are scheduled for baseline = 705e04b and contender = 2b50694. 2b50694 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.81% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.51% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.53% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 2b50694c ec2-t3-xlarge-us-east-2
[Failed] 2b50694c test-mac-arm
[Finished] 2b50694c ursa-i9-9960x
[Finished] 2b50694c ursa-thinkcentre-m75q
[Finished] 705e04bb ec2-t3-xlarge-us-east-2
[Finished] 705e04bb test-mac-arm
[Finished] 705e04bb ursa-i9-9960x
[Finished] 705e04bb ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

test: create tests reproducing underlying issue

449af74

github-actions bot added Component: C++ Component: Python labels Jan 5, 2023

wjones127 commented Jan 5, 2023

View reviewed changes

wjones127 added 2 commits January 5, 2023 13:17

fix: fix numpy list conversion for slices of arrays

7844fa9

test: remove unnecessary test

27b955f

github-actions bot removed the Component: C++ label Jan 5, 2023

doc: add clearer description of BaseListArray.values()

970389f

wjones127 mentioned this pull request Jan 5, 2023

ORC writer doesn't work on sliced list arrays #15212

Closed

github-actions bot added the Component: C++ label Jan 5, 2023

test: fix numpy nested array creation

e632099

wjones127 changed the title ~~ARROW-18400: [C++] ListArray values() doesn't take into account offset~~ ARROW-18400: [Python] Numpy conversion doesn't account for ListArray offset Jan 5, 2023

wjones127 marked this pull request as ready for review January 6, 2023 00:19

wjones127 requested a review from jorisvandenbossche January 6, 2023 00:19

jorisvandenbossche reviewed Jan 9, 2023

View reviewed changes

test: make sure to test null values

af022e1

wjones127 requested a review from jorisvandenbossche January 9, 2023 17:25

asfimport mentioned this pull request Jan 9, 2023

[Python] Quadratic memory usage of Table.to_pandas with nested data #20512

Closed

assignUser changed the title ~~ARROW-18400: [Python] Numpy conversion doesn't account for ListArray offset~~ GH-20512: [Python] Numpy conversion doesn't account for ListArray offset Jan 17, 2023

add extra test

7c201d7

jorisvandenbossche requested a review from AlenkaF as a code owner January 17, 2023 10:42

jorisvandenbossche approved these changes Jan 17, 2023

View reviewed changes

jorisvandenbossche added this to the 11.0.0 milestone Jan 17, 2023

AlenkaF approved these changes Jan 17, 2023

View reviewed changes

assignUser merged commit 2b50694 into apache:master Jan 17, 2023

Tom-Newton added a commit to Tom-Newton/arrow that referenced this pull request Apr 30, 2024

Revert apache#15210

c7da475

Tom-Newton mentioned this pull request Apr 30, 2024

[Python] Segfault in to_pandas() on batch from IPC stream in specific edge cases #41469

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-20512: [Python] Numpy conversion doesn't account for ListArray offset #15210

GH-20512: [Python] Numpy conversion doesn't account for ListArray offset #15210

wjones127 commented Jan 5, 2023 •

edited by github-actions bot

github-actions bot commented Jan 5, 2023

github-actions bot commented Jan 5, 2023

wjones127 Jan 5, 2023

jorisvandenbossche Jan 5, 2023

wjones127 Jan 5, 2023

jorisvandenbossche Jan 9, 2023

wjones127 Jan 9, 2023

jorisvandenbossche Jan 9, 2023 •

edited

wjones127 Jan 9, 2023

wjones127 commented Jan 5, 2023

jorisvandenbossche Jan 9, 2023

wjones127 Jan 9, 2023

raulcd commented Jan 16, 2023

github-actions bot commented Jan 17, 2023

jorisvandenbossche commented Jan 17, 2023 •

edited

AlenkaF left a comment

ursabot commented Jan 19, 2023

GH-20512: [Python] Numpy conversion doesn't account for ListArray offset #15210

GH-20512: [Python] Numpy conversion doesn't account for ListArray offset #15210

Conversation

wjones127 commented Jan 5, 2023 • edited by github-actions bot

github-actions bot commented Jan 5, 2023

github-actions bot commented Jan 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 commented Jan 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulcd commented Jan 16, 2023

github-actions bot commented Jan 17, 2023

jorisvandenbossche commented Jan 17, 2023 • edited

AlenkaF left a comment

Choose a reason for hiding this comment

ursabot commented Jan 19, 2023

wjones127 commented Jan 5, 2023 •

edited by github-actions bot

jorisvandenbossche Jan 9, 2023 •

edited

jorisvandenbossche commented Jan 17, 2023 •

edited