Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-20512: [Python] Numpy conversion doesn't account for ListArray offset #15210

Merged
merged 7 commits into from Jan 17, 2023

Conversation

wjones127
Copy link
Member

@wjones127 wjones127 commented Jan 5, 2023

@github-actions
Copy link

github-actions bot commented Jan 5, 2023

@github-actions
Copy link

github-actions bot commented Jan 5, 2023

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Comment on lines 519 to 521
auto values = checked_cast<ListArray*>(arr_sliced.get())->values();
auto expected_values = ArrayFromJSON(int16(), "[1, 2, 3, 4, 5]");
AssertArraysEqual(*expected_values, *values);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this expected behavior or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just commented on the JIRA as well, but so in the current design, we shouldn't change that. It is Flatten() that already implements this, and the values being unsliced ensures that they still match the offsets of a sliced ListArray.

It is certainly confusing, though. I wonder if we should give it a more scary name like "raw_values"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on renaming to raw_values. Or at the very least, we should modify the doc comment. Right now it's not clear it doesn't account for offset and length.

The problem with Flatten though is that it removes the null values from the values array, but in this case we want them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with Flatten though is that it removes the null values from the values array, but in this case we want them.

Because we use the offsets to slice into the flat values, and those offsets take into account potential values behind a null?

This corner case is covered by existing tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add a test.

And you might be right about flatten, it does include the nulls in the values (not the list). For some reason earlier I had gotten the impression it doesn't, while I was debugging.

>>> import pyarrow as pa
>>> arr = pa.array([[1, 2], [3, 4, 5], [6, None], [7, 8]])
>>> arr.flatten()
<pyarrow.lib.Int64Array object at 0x129b2c820>
[
  1,
  2,
  3,
  4,
  5,
  6,
  null,
  7,
  8
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but your example uses an actual null inside a list, not a "null list". For example, here the first null is not in the flattened output:

In [3]: arr = pa.array([[1, 2], None, [3, None]])

In [4]: arr.flatten()
Out[4]: 
<pyarrow.lib.Int64Array object at 0x7f8b1fbcaa40>
[
  1,
  2,
  3,
  null
]

And I suppose behind that null can be any data (the default when constructed above is that there is no data behind, so the offset doesn't increment for that list element):

In [5]: arr.offsets
Out[5]: 
<pyarrow.lib.Int32Array object at 0x7f8b4f0f5a80>
[
  0,
  2,
  2,
  4
]

It's a bit tricky to construct manually, but something like:

In [10]: arr = pa.ListArray.from_arrays(pa.array([0, 2, 4, 6]), pa.array([1, 2, 99, 99, 3, None]), mask=pa.array([False, True, False]))

In [11]: arr
Out[11]: 
<pyarrow.lib.ListArray object at 0x7f8b1fbcba60>
[
  [
    1,
    2
  ],
  null,
  [
    3,
    null
  ]
]

In [12]: arr.flatten()
Out[12]: 
<pyarrow.lib.Int64Array object at 0x7f8b4f065960>
[
  1,
  2,
  3,
  null
]

In [13]: arr.values
Out[13]: 
<pyarrow.lib.Int64Array object at 0x7f8b4f065780>
[
  1,
  2,
  99,
  99,
  3,
  null
]

In [14]: arr.offsets
Out[14]: 
<pyarrow.lib.Int32Array object at 0x7f8b1fbc9300>
[
  0,
  2,
  4,
  6
]

But, so I am not sure if for this case you actually need to flattened (with nulls removed) or not for this case of converting to numpy.

The offsets still assume those unused values are present, so it was maybe actually a good call to think the "Flattened" values (with values behind nulls removed) was not the correct method to use here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we use the offsets to slice into the flat values, and those offsets take into account potential values behind a null?

Oh now I remember, and I think I understand what you are saying here better now. For the fixed size lists, the values behind a null entry in the list are removed when we call Flatten(). When we go back to try to reconstruct the lists based on offsets, the offsets produced by value_offset are all invalid since they don't account for the values we dropped in Flatten().

@wjones127
Copy link
Member Author

I re-ran the original reproduction and it seems memory usage is no longer quadratic:

Num rows Memory usage (10.0.1) Memory usage (after)
256k 2,153,767,662 1,102,736,461
512k 8,496,047,798 2,185,596,364
Code for test

Write test file:

import numpy as np
import random
import string
import tracemalloc
import pyarrow as pa
import pyarrow.parquet as pq

_characters = string.ascii_uppercase + string.digits + string.punctuation

def make_random_string(N=10):
    return ''.join(random.choice(_characters) for _ in range(N))

nrows = 256_000
filename = 'nested_pandas.parquet'

arr_len = 10
nested_col = []
for i in range(nrows):
    nested_col.append(np.array(
            [{
                'a': None if i % 1000 == 0 else np.random.choice(10000, size=3).astype(np.int64),
                'b': None if i % 100 == 0 else random.choice(range(100)),
                'c': None if i % 10 == 0 else make_random_string(5)
            } for i in range(arr_len)]
        ))

table = pa.table({'c1': nested_col})

# table = pa.table({
#     'c1': pa.array([list(range(random.randint(1, 20))) for _ in range(nrows)])
# })

# Writing to .parquet and loading it into arrow again
pq.write_table(table, filename)

Then measure:

import tracemalloc
import pyarrow.parquet as pq

filename = '/Users/willjones/Documents/arrows/arrow/python/nested_pandas.parquet'
tracemalloc.start()
table_from_parquet = pq.read_table(filename)

out = table_from_parquet.to_pandas()

print(tracemalloc.get_traced_memory())

@wjones127 wjones127 changed the title ARROW-18400: [C++] ListArray values() doesn't take into account offset ARROW-18400: [Python] Numpy conversion doesn't account for ListArray offset Jan 5, 2023
@wjones127 wjones127 marked this pull request as ready for review January 6, 2023 00:19
@@ -4513,3 +4513,27 @@ def test_does_not_mutate_timedelta_nested():
df = table.to_pandas()

assert df["timedelta_2"][0].to_pytimedelta() == timedelta_2[0]


def test_list_no_duplicate_base():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a TestConvertListTypes class that groups some list type related tests, maybe can move this somewhere there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move it there.

@raulcd
Copy link
Member

raulcd commented Jan 16, 2023

@jorisvandenbossche @wjones127 I think this might be a release blocker. I am happy to mark the issue as a blocker and add it to the release if it gets reviewed / merged

@assignUser assignUser changed the title ARROW-18400: [Python] Numpy conversion doesn't account for ListArray offset GH-20512: [Python] Numpy conversion doesn't account for ListArray offset Jan 17, 2023
@github-actions
Copy link

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 17, 2023

Yes, agreed it would be nice to include this one in the release, given the quadratic memory issues. I did a review, and all looks good to me. @wjones127 I am just going to add one more test with the "hidden" null values (the reason we can't use Flatten), which I think is currently not covered by the tests (unless it was already covered by existing tests)?

@jorisvandenbossche jorisvandenbossche added this to the 11.0.0 milestone Jan 17, 2023
Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you for working on this @wjones127!

@assignUser assignUser merged commit 2b50694 into apache:master Jan 17, 2023
raulcd pushed a commit that referenced this pull request Jan 18, 2023
…set (#15210)

* Closes: #20512

Lead-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>
@ursabot
Copy link

ursabot commented Jan 19, 2023

Benchmark runs are scheduled for baseline = 705e04b and contender = 2b50694. 2b50694 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.81% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.51% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.53% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 2b50694c ec2-t3-xlarge-us-east-2
[Failed] 2b50694c test-mac-arm
[Finished] 2b50694c ursa-i9-9960x
[Finished] 2b50694c ursa-thinkcentre-m75q
[Finished] 705e04bb ec2-t3-xlarge-us-east-2
[Finished] 705e04bb test-mac-arm
[Finished] 705e04bb ursa-i9-9960x
[Finished] 705e04bb ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Quadratic memory usage of Table.to_pandas with nested data
6 participants