read_csv dumps core with python 2.7.10 and pandas 0.17.1 #11716

jdfekete · 2015-11-28T13:08:39Z

I am reading a very large csv file (the NYC taxi dataset at https://storage.googleapis.com/tlc-trip-data/2015/), only two columns:
index_col=False,skipinitialspace=True,usecols=['pickup_longitude', 'pickup_latitude'], chunksize=...
I load it progressively by varying-size chunks, and use 2 threads to do the progressive loading.
After reading about 10M lines (the number varies from one run to the other), it dumps a core.
Here is what GDB finds-out:

Fatal Python error: GC object already tracked
Fatal Python error: GC object already tracked

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffddd98700 (LWP 10284)]
0x00007ffff782dcc9 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) where
#0 0x00007ffff782dcc9 in __GI_raise (sig=sig@entry=6)

at ../nptl/sysdeps/unix/sysv/linux/raise.c:56

#1 0x00007ffff78310d8 in __GI_abort () at abort.c:89
#2 0x000000000045a4f2 in Py_FatalError ()
#3 0x000000000052b5ec in PyTuple_New ()
#4 0x000000000050c73d in ?? ()
#5 0x000000000050d3f6 in Py_BuildValue ()
#6 0x00007fffec3d01d8 in buffer_rd_bytes (source=0x7fffd8006650,

nbytes=<optimized out>, bytes_read=0x7fffddd96d08, status=0x7fffddd96d04)
at pandas/src/parser/io.c:123

#7 0x00007fffec3cf065 in parser_buffer_bytes (nbytes=,

self=0x7fffd8003480) at pandas/src/parser/tokenizer.c:610

#8 _tokenize_helper (self=0x7fffd8003480, nrows=nrows@entry=3186,

all=all@entry=0) at pandas/src/parser/tokenizer.c:1872

#9 0x00007fffec3cf3e7 in tokenize_nrows (self=,

nrows=nrows@entry=3186) at pandas/src/parser/tokenizer.c:1905

#10 0x00007fffec39a3c4 in __pyx_f_6pandas_6parser_10TextReader__tokenize_rows (

__pyx_v_self=0x7fffdddd5050, __pyx_v_nrows=3186) at pandas/parser.c:8745

#11 0x00007fffec3a21a2 in __pyx_f_6pandas_6parser_10TextReader__read_rows (

__pyx_v_self=0x7fffdddd5050, __pyx_v_rows=0x7fffd8249a88, __pyx_v_trim=0)
at pandas/parser.c:8970

#12 0x00007fffec393f0c in __pyx_f_6pandas_6parser_10TextReader__read_low_memory

(__pyx_v_self=0x7fffdddd5050, __pyx_v_rows=0x7fffcb815948)

The text was updated successfully, but these errors were encountered:

jreback · 2015-11-28T14:04:17Z

pls show the exact code you are using

jdfekete · 2015-11-28T14:23:57Z

My code is in https://github.com/jdfekete/progressivis file:
https://github.com/jdfekete/progressivis/blob/master/progressivis/io/csv_loader.py

The method is the following, see the last line for the call, and all the checks before. Running it with pandas 0.16.2 works without dumping core. It might be due to the GIL or lack thereof since this code is run in a second thread.

def run_step(self,run_number,step_size, howlong):
    if step_size==0: # bug
        logger.error('Received a step_size of 0')
        return self._return_run_step(self.state_ready, steps_run=0)
    status = self.validate_parser(run_number)
    if status==self.state_terminated:
        raise StopIteration('no more filenames')
    elif status==self.state_blocked:
        return self._return_run_step(status, steps_run=0, creates=0)
    elif status != self.state_ready:
        logger.error('Invalid state returned by validate_parser: %d', status)
        raise StopIteration('Unexpected situation')
    logger.info('loading %d lines', step_size)
    try:
        df = self.parser.read(step_size) # raises StopIteration at EOF
    except StopIteration:

jreback · 2015-11-28T14:33:07Z

pls just show a short reproducible example

jreback · 2015-11-29T17:28:21Z

This is almost certainly a problem with thread-safeness in how you are calling it. A reproducible example would help. Pls reopen when you post that.

jreback · 2015-12-07T16:47:38Z

xref #11786

jstray · 2017-02-07T22:28:24Z

I am also seeing this error, intermittently, during read_csv. It's not even a particularly large file:

table = pd.read_csv(io.StringIO(csvres.text))
=>
Fatal Python error: GC object already tracked

where the text is the contents of the file http://jonathanstray.com/papers/titanic.csv

I'm not explicitly using threads in my app, though I am on Django channels.

jreback · 2017-02-07T22:40:38Z

you should try a more modern version of pandas., lots of things have been fixed since 0.17.1

jstray · 2017-02-07T23:55:36Z

Indeed I am on 0.17.1. FWIW that's the version that shipped with Anaconda, though now I can't recall when I installed it.

jreback · 2017-02-07T23:58:04Z

conda update pandas

works wonders

jreback closed this as completed Nov 29, 2015

jreback added Can't Repro IO CSV read_csv, to_csv labels Nov 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv dumps core with python 2.7.10 and pandas 0.17.1 #11716

read_csv dumps core with python 2.7.10 and pandas 0.17.1 #11716

jdfekete commented Nov 28, 2015

jreback commented Nov 28, 2015

jdfekete commented Nov 28, 2015

jreback commented Nov 28, 2015

jreback commented Nov 29, 2015

jreback commented Dec 7, 2015

jstray commented Feb 7, 2017 •

edited

jreback commented Feb 7, 2017

jstray commented Feb 7, 2017

jreback commented Feb 7, 2017

read_csv dumps core with python 2.7.10 and pandas 0.17.1 #11716

read_csv dumps core with python 2.7.10 and pandas 0.17.1 #11716

Comments

jdfekete commented Nov 28, 2015

jreback commented Nov 28, 2015

jdfekete commented Nov 28, 2015

jreback commented Nov 28, 2015

jreback commented Nov 29, 2015

jreback commented Dec 7, 2015

jstray commented Feb 7, 2017 • edited

jreback commented Feb 7, 2017

jstray commented Feb 7, 2017

jreback commented Feb 7, 2017

jstray commented Feb 7, 2017 •

edited