Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iloc indexing ~10 times slower than direct column indexing +no documention of it #29316

Closed
GregoryMorse opened this issue Nov 1, 2019 · 5 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance

Comments

@GregoryMorse
Copy link

GregoryMorse commented Nov 1, 2019

Code Sample, a copy-pastable example if possible

Updated without yfinance as it was a lousy choice of example - requires only standard numpy and datetime. A DateTimeIndex exacerbates the time slowness.

import timeit
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Col1', setup='import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col})', number=3)
#3.553613900000073
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.Col1[x]', setup='import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col})', number=3)
#0.5215671999999358
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Col1', setup='import datetime; import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col}, [datetime.datetime.now() + datetime.timedelta(seconds=x) for x in range(10000)])', number=3)
#4.485623600000508
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.Col1[x]', setup='import datetime; import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col}, [datetime.datetime.now() + datetime.timedelta(seconds=x) for x in range(10000)])', number=3)
#0.7185356999998476

Old code example:
#requires pip install yfinance for this example - could substitute with any data set
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Close', setup='import yfinance as yf; hist = yf.Ticker("MSFT").history(period="max")', number=3)
#5.178573199999846
timeit.timeit(stmt='for x in range(len(hist)): _ = hist.iloc[x].Close', setup='import yfinance as yf; hist = yf.Ticker("MSFT").history(period="max")', number=3)
#0.5886300999998184

Problem description

One would naturally expect iloc to have close to identical efficiency as column indexing. This is a dramatic blunder in a lot of code more than likely if it is not well documented and known. I would expect some strange inefficiency is present as logically it could be slightly slower to return a row rather than a data point. But 10 times slower is so dramatic that this function should be avoided.

Expected Output

Less than double the speed of single column value access. Or at least thorough documentation of this limitation.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.5
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1

@jreback
Copy link
Contributor

jreback commented Nov 1, 2019

what does this have to do with pandas? what is yfinance? show a self contained example

@GregoryMorse
Copy link
Author

@jreback this has been done. Thanks for the advice.

@GregoryMorse
Copy link
Author

GregoryMorse commented Nov 1, 2019

It is even stranger than I first noticed. Apparently using iloc on the column is almost twice as fast than just using standard Python list indexing with brackets:

timeit.timeit(stmt='for x in range(len(hist)): _ = hist.Col1.iloc[x]', setup='import datetime; import numpy as np; import pandas as pd; col = np.random.uniform(low=-1, high=1, size=(10000)); hist = pd.DataFrame({"Col1":col,"Col2":col,"Col3":col,"Col4":col,"Col5":col}, [datetime.datetime.now() + datetime.timedelta(seconds=x) for x in range(10000)])', number=3)
0.424690400000145

This is too counter-intuitive to not be thoroughly documented at minimum.

As far as I can tell - iloc is optimized for integers so it makes sense that no data inspection is needed (since standard indexing checks if the type is same as the index data type). Still the time increase is too dramatic. Since a major goal of this library is for fast processing of data sets.

@jreback
Copy link
Contributor

jreback commented Nov 1, 2019

In [20]: %timeit hist.Col1[1]                                                                                                                            
21.9 µs ± 367 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [21]: %timeit hist.iloc[1].Col1                                                                                                                       
164 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [22]: %timeit hist.iat[1, 0]                                                                                                                          
7.97 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

individual scalar selections are rarely important when dealing with vectorized data.

@jreback
Copy link
Contributor

jreback commented Nov 1, 2019

closing as a user question.

@jreback jreback closed this as completed Nov 1, 2019
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Nov 1, 2019
@jreback jreback added this to the No action milestone Nov 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

2 participants