New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Quadratic memory usage of Table.to_pandas with nested data #20512
Comments
Antoine Pitrou / @pitrou: |
Alenka Frim / @AlenkaF: import numpy as np
import random
import string
_characters = string.ascii_uppercase + string.digits + string.punctuation
def make_random_string(N=10):
return ''.join(random.choice(_characters) for _ in range(N))
nrows = 1_024_000
filename = 'nested_pandas.parquet'
arr_len = 10
nested_col = []
for i in range(nrows):
nested_col.append(np.array(
[{
'a': None if i % 1000 == 0 else np.random.choice(10000, size=3).astype(np.int64),
'b': None if i % 100 == 0 else random.choice(range(100)),
'c': None if i % 10 == 0 else make_random_string(5)
} for i in range(arr_len)]
))
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'c1': nested_col})
# Works correctly
table.to_pandas()
# c1
# 0 [{'a': None, 'b': None, 'c': None}, {'a': [399...
# 1 [{'a': None, 'b': None, 'c': None}, {'a': [832...
# 2 [{'a': None, 'b': None, 'c': None}, {'a': [731...
# 3 [{'a': None, 'b': None, 'c': None}, {'a': [589...
# 4 [{'a': None, 'b': None, 'c': None}, {'a': [159...
# ... ...
# 1023995 [{'a': None, 'b': None, 'c': None}, {'a': [922...
# 1023996 [{'a': None, 'b': None, 'c': None}, {'a': [865...
# 1023997 [{'a': None, 'b': None, 'c': None}, {'a': [222...
# 1023998 [{'a': None, 'b': None, 'c': None}, {'a': [143...
# 1023999 [{'a': None, 'b': None, 'c': None}, {'a': [287...
# [1024000 rows x 1 columns]
# Writing to .parquet and loading it into arrow again
pq.write_table(table, filename)
table_from_parquet = pq.read_table(filename)
# Kill - converting to pandas
table_from_parquet.to_pandas()
print(tracemalloc.get_traced_memory())
# zsh: killed python memory_usage.py I still have to look into what is causing it but there has to be some extra information being passed from parquet to arrow and then to pandas that is triggering this. Will research further next week.
|
Weston Pace / @westonpace: |
Alenka Frim / @AlenkaF: ...
import tracemalloc
tracemalloc.start()
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'c1': nested_col})
pq.write_table(table, filename)
table_from_parquet = pq.read_table(filename, use_legacy_dataset=True)
table_from_parquet.to_pandas()
# c1
# 0 [{'a': None, 'b': None, 'c': None}, {'a': [248...
# 1 [{'a': None, 'b': None, 'c': None}, {'a': [626...
# 2 [{'a': None, 'b': None, 'c': None}, {'a': [148...
# 3 [{'a': None, 'b': None, 'c': None}, {'a': [399...
# 4 [{'a': None, 'b': None, 'c': None}, {'a': [253...
# ... ...
# 1023995 [{'a': None, 'b': None, 'c': None}, {'a': [779...
# 1023996 [{'a': None, 'b': None, 'c': None}, {'a': [309...
# 1023997 [{'a': None, 'b': None, 'c': None}, {'a': [376...
# 1023998 [{'a': None, 'b': None, 'c': None}, {'a': [391...
# 1023999 [{'a': None, 'b': None, 'c': None}, {'a': [815...
# [1024000 rows x 1 columns]
print(tracemalloc.get_traced_memory()) # in bytes
# (23087183, 4328216706)
tracemalloc.stop() Weston, do you think this is something on the C++ side of the Dataset API or should I look at the Python/Cython implementation? @adamreeve could you also try same memory profiling as you did with adding |
Adam Reeve / @adamreeve: |
Anja Boskovic / @anjakefala: I found that the quadratic memory leap happens for larger files for Arrow IPC than it does for Parquet, but it still happens.
|Num rows|Peak Memory Usage Parquet (MB)|Peak Memory Usage Arrow IPC (MB)| |
Will Jones / @wjones127: |
Anja Boskovic / @anjakefala: |
Anja Boskovic / @anjakefala: |
Alenka Frim / @AlenkaF: Old legacy implementation reads the data into a table with one chunk in comparison to the new dataset implementation which reads the data into a table with multiple chunks. When converting to pandas the chunks have to be concatenated and that is causing high memory usage. The solution would be to concatenate a table before transforming it into a pandas dataframe: table_legacy = pq.read_table(filename, use_legacy_dataset=True)
table_dataset = pq.read_table(filename)
col_legacy = table_legacy[0]
col_legacy.num_chunks
# 1
col_dataset = table_dataset[0]
col_dataset.num_chunks
# 8
table_dataset.combine_chunks().to_pandas()
# c1
# 0 [{'a': None, 'b': None, 'c': None}, {'a': [248...
# 1 [{'a': None, 'b': None, 'c': None}, {'a': [626...
# 2 [{'a': None, 'b': None, 'c': None}, {'a': [148...
# 3 [{'a': None, 'b': None, 'c': None}, {'a': [399...
# 4 [{'a': None, 'b': None, 'c': None}, {'a': [253...
# ... ...
# 1023995 [{'a': None, 'b': None, 'c': None}, {'a': [779...
# 1023996 [{'a': None, 'b': None, 'c': None}, {'a': [309...
# 1023997 [{'a': None, 'b': None, 'c': None}, {'a': [376...
# 1023998 [{'a': None, 'b': None, 'c': None}, {'a': [391...
# 1023999 [{'a': None, 'b': None, 'c': None}, {'a': [815...
# [1024000 rows x 1 columns]
table_dataset.to_pandas()
# zsh: killed python I plan to make a PR in the beginning of next week. |
Joris Van den Bossche / @jorisvandenbossche: The main conversion logic for this lives in |
Alenka Frim / @AlenkaF:
|
Joris Van den Bossche / @jorisvandenbossche: Illustrating this, reading the parquet file in two ways (I have been using import pyarrow.parquet as pq
import pyarrow.dataset as ds
table1 = pq.read_table("memory_testing.parquet", use_legacy_dataset=True)
dataset = ds.dataset("memory_testing.parquet", format="parquet")
table2 = dataset.to_table() Table 1 has a single chunk, while table 2 (from reading with dataset API) has two chunks: >>> table1["c1"].num_chunks
1
>>> table2["c1"].num_chunks
2 Taking the first chunk of each of those, and then looking at those arrays: arr1 = table1["c1"].chunk(0)
arr2 = table2["c1"].chunk(0)
>>> len(arr1)
256000
>>> len(arr2) # around half the number of rows (since there are two chunks in this table)
131072
>>> arr1.get_total_buffer_size()
110624012
>>> arr2.get_total_buffer_size() # but still using the same total memory!
110624012 So the smaller chunk of table2 is not using less memory. That is because the two chunks of table2 are actually each a slice into the same underlying buffers: >>> table2["c1"].chunk(0).buffers()[1]
<pyarrow.Buffer address=0x7fc5cc907340 size=1024004 is_cpu=True is_mutable=True>
>>> table2["c1"].chunk(1).buffers()[1] # second chunk points to same memory address and has same size as first chunk
<pyarrow.Buffer address=0x7fc5cc907340 size=1024004 is_cpu=True is_mutable=True>
>>> table2["c1"].chunk(1).offset # and the second chunk has an offset to account for that
131072 And somehow the conversion code for ListArray to numpy (which creates a numpy array of numpy arrays, by first creating one numpy array of the flat values, and then creating slices into that flat array) doesn't seem to take into account this offset, and ends up converting the full parent buffer twice (in my case twice, because of having 2 chunks, but this can grow quadratically). — The reason this happens for parquet and not for feather in this case, is because the Parquet file actually consists of a single row group (and I assume the dataset API will therefore still read that in one go, and then slice output batches from that to return the expected batch size in the dataset API), while the feather file already consists of multiple batches on disk (and thus doesn't result in sliced batches in memory). |
Joris Van den Bossche / @jorisvandenbossche: # creating a chunked list array that consists of two chunks that are both slices into the same parent array
arr = pa.array([[1, 2], [3, 4, 5], [6], [7, 8]])
chunked_arr = pa.chunked_array([arr.slice(0, 2), arr.slice(2, 2)])
# converting this chunked array to numpy
np_arr = chunked_arr.to_numpy()
# the list array gets converted to a numpy array of numpy arrays. Each element (the nested numpy array) is
# a slice of a numpy array of the flat values. We can get this parent flat numpy array through the .base property
>>> np_arr[0].base
array([[1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8]])
# the flat values are included twice. Comparing to the correct behaviour with original non-chunked array:
>>> arr.to_numpy(zero_copy_only=False)[0].base
array([[1, 2, 3, 4, 5, 6, 7, 8]]) |
Yes. The dataset API processes input in fairly small batches (32ki rows). Partly because this is cache friendly but also because some of the hash-join code uses 16-bit signed integers for row indices. The scanner does not support partial reading of row groups from parquet files (I would very much like to support this someday) and so it reads the entire row group in one chunk. Then it slices that chunk. It sounds like this numpy conversion bug should be fixed regardless. I wonder if we also want to someday support better output batch size controls as well. I'll create an issue for it. |
Will Jones / @wjones127: Do you agree with that assessment? |
Joris Van den Bossche / @jorisvandenbossche:
The |
@wjones127 should this be a blocker for the release? |
I added the blocker label to keep track of this issue as most other blockers are now closed and inclusion of this seemed potentially important imo. |
As I understand this issue was completely finished, but I couldn't find it in the release notes for a few past versions. Am I missing something? Does anybody have an estimate of when it's going to be released? Thanks a lot! |
This was included in the pyarrow 11.0.0 release, which was already released in January of this year. It was actually included in the release notes, but you might need to know the details about the fix to find it .. Quoting from https://arrow.apache.org/blog/2023/01/25/11.0.0-release/ in the Python bug fixes section:
|
Reading nested Parquet data and then converting it to a Pandas DataFrame shows quadratic memory usage and will eventually run out of memory for reasonably small files. I had initially thought this was a regression since 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks in at higher row counts.
Example code to generate nested Parquet data:
And then read into a DataFrame with:
Only reading to an Arrow table isn't a problem, it's the to_pandas method that exhibits the large memory usage. I haven't tested generating nested Arrow data in memory without writing Parquet from Pandas but I assume the problem probably isn't Parquet specific.
Memory usage I see when reading different sized files on a machine with 64 GB RAM:
|Num rows|Memory used with 10.0.1 (MB)|Memory used with 7.0.0 (MB)|
|
|-|-|-|-|
|32,000|362|361|
|
|64,000|531|531|
|
|128,000|1,152|1,101|
|
|256,000|2,888|1,402|
|
|512,000|10,301|3,508|
|
|1,024,000|38,697|5,313|
|
|2,048,000|OOM|20,061|
|
|4,096,000| |OOM|
With Arrow 10.0.1, memory usage approximately quadruples when row count doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but then quadruples from 1024k to 2048k rows.
PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something changed between 7.0.0 and 8.0.0.|
Environment: Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X with 64 GB RAM
Reporter: Adam Reeve / @adamreeve
Assignee: Will Jones / @wjones127
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-18400. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: