REGR: fix read_parquet with column of large strings (avoid overflow from concat) #55691

jorisvandenbossche · 2023-10-25T23:05:13Z

closes BUG: regression in read_parquet that raises a pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays #55606
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…rom concat)

jorisvandenbossche · 2023-10-25T23:09:58Z

pandas/tests/io/test_parquet.py

+    def test_string_column_above_2GB(self, tmp_path, pa):
+        # https://github.com/pandas-dev/pandas/issues/55606
+        # above 2GB of string data
+        v1 = b"x" * 100000000
+        v2 = b"x" * 147483646
+        df = pd.DataFrame({"strings": [v1] * 20 + [v2] + ["x"] * 20}, dtype="string")
+        df.to_parquet(tmp_path / "test.parquet")
+        result = read_parquet(tmp_path / "test.parquet")
+        assert result["strings"].dtype == "string"


This test is quite slow (around 20s for me) and uses a lot of memory (> 5 GB), so I am not sure we should add it ... (our "slow" tests are still run by default, so this would be annoying when running the tests locally)

I'm in favor of not adding this test given the potential CI load. Maybe an ASV since this is "performance" related too given the memory trigger if you think that makes sense.

At minimum, would be good to comment in pandas/core/arrays/string_.py why the modification was made

Added a comment about it, and "removed" the test: I left the code here, to make it easier to run that test in the future by just uncommenting (or if we enable some high_memory mark that would be disabled by default)

Adding a ASV sounds useful, but wouldn't prevent catching a regression, as also for ASV we would use a smaller dataset. So leaving that out of the PR here.

…strings

lithomas1 · 2023-10-26T13:59:13Z

thanks @jorisvandenbossche.

…arge strings (avoid overflow from concat)

…n of large strings (avoid overflow from concat)) (#55706) Backport PR #55691: REGR: fix read_parquet with column of large strings (avoid overflow from concat) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

REGR: fix read_parquet with column of large strings (avoid overflow f…

dc7270d

…rom concat)

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version IO Parquet parquet, feather labels Oct 25, 2023

jorisvandenbossche added this to the 2.1.2 milestone Oct 25, 2023

jorisvandenbossche commented Oct 25, 2023

View reviewed changes

jorisvandenbossche mentioned this pull request Oct 25, 2023

BUG: regression in read_parquet that raises a pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays #55606

Closed

3 tasks

jorisvandenbossche added 3 commits October 26, 2023 12:16

Merge remote-tracking branch 'upstream/main' into read-parquet-large-…

90f92b6

…strings

comment out test

6a2e372

add comment

ea3f32b

lithomas1 merged commit 05f2f71 into pandas-dev:main Oct 26, 2023
36 of 39 checks passed

meeseeksmachine mentioned this pull request Oct 26, 2023

Backport PR #55691 on branch 2.1.x (REGR: fix read_parquet with column of large strings (avoid overflow from concat)) #55706

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Oct 26, 2023

Backport PR pandas-dev#55691: REGR: fix read_parquet with column of l…

96308bc

…arge strings (avoid overflow from concat)

jorisvandenbossche deleted the read-parquet-large-strings branch October 26, 2023 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: fix read_parquet with column of large strings (avoid overflow from concat) #55691

REGR: fix read_parquet with column of large strings (avoid overflow from concat) #55691

jorisvandenbossche commented Oct 25, 2023

jorisvandenbossche Oct 25, 2023

mroeschke Oct 26, 2023

jorisvandenbossche Oct 26, 2023

lithomas1 commented Oct 26, 2023

REGR: fix read_parquet with column of large strings (avoid overflow from concat) #55691

REGR: fix read_parquet with column of large strings (avoid overflow from concat) #55691

Conversation

jorisvandenbossche commented Oct 25, 2023

jorisvandenbossche Oct 25, 2023

Choose a reason for hiding this comment

mroeschke Oct 26, 2023

Choose a reason for hiding this comment

jorisvandenbossche Oct 26, 2023

Choose a reason for hiding this comment

lithomas1 commented Oct 26, 2023