Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Empty list passed to Series returns object dtype, but via DataFrame returns float64 #56679

Open
2 of 3 tasks
galipremsagar opened this issue Dec 29, 2023 · 9 comments · May be fixed by #58669
Open
2 of 3 tasks

BUG: Empty list passed to Series returns object dtype, but via DataFrame returns float64 #56679

galipremsagar opened this issue Dec 29, 2023 · 9 comments · May be fixed by #58669
Assignees
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure Dtype Conversions Unexpected or buggy dtype conversions

Comments

@galipremsagar
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [1]: import pandas as pd
pd.
In [2]: pd.__version__
Out[2]: '2.1.4'

In [3]: pd.Series([])
Out[3]: Series([], dtype: object)

In [4]: pd.DataFrame({'a':[]})
Out[4]: 
Empty DataFrame
Columns: [a]
Index: []

In [5]: pd.DataFrame({'a':[]}).dtypes
Out[5]: 
a    float64
dtype: object

Issue Description

There seems to be an inconsistency when creating a Series from empty list via Series & DataFrame constructors. The former yields object dtype, the later returns float64 dtype.

Expected Behavior

Return object in DataFrame constructor

Installed Versions

/nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit : a671b5a
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-88-generic
Version : #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.4
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : 3.0.6
pytest : 7.4.3
hypothesis : 6.91.0
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.18.1
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.12.1
gcsfs : None
matplotlib : None
numba : 0.57.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 14.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2023.12.1
scipy : 1.11.4
sqlalchemy : 2.0.23
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@galipremsagar galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 29, 2023
@galipremsagar galipremsagar changed the title BUG: Empty list passed to Series returns object dtype, but the same done via DataFrame returns float64 BUG: Empty list passed to Series returns object dtype, but via DataFrame returns float64 Dec 29, 2023
@galipremsagar
Copy link
Author

Same is the behavior for setitem flow too:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame()

In [3]: df['a'] = []

In [4]: df.dtypes
Out[4]: 
a    float64
dtype: object

@mroeschke
Copy link
Member

Thanks for the report. I would also expect this to be object type from these similar constructions

In [4]: pd.DataFrame(columns=["a"]).dtypes
Out[4]: 
a    object
dtype: object

In [5]: pd.DataFrame([], columns=["a"]).dtypes
Out[5]: 
a    object
dtype: object

In [6]: pd.DataFrame({}, columns=["a"]).dtypes
Out[6]: 
a    object
dtype: object

In [7]: pd.DataFrame({"a": []}, columns=["a"]).dtypes
Out[7]: 
a    float64
dtype: object

In [8]: pd.DataFrame({"a": pd.Series()}).dtypes
Out[8]: 
a    object
dtype: object

It looks like this goes through sanitize_array where there's this comment

        if len(data) == 0 and dtype is None:
            # We default to float64, matching numpy
            subarr = np.array([], dtype=np.float64)

I'm not sure if there a reason internally why we need to treat this as float64 but I would expect at least via this constructor route that object is still returned

@mroeschke mroeschke added Dtype Conversions Unexpected or buggy dtype conversions DataFrame DataFrame data structure Constructors Series/DataFrame/Index/pd.array Constructors and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 29, 2023
@srinivaspavan9
Copy link

I would like to take a look at this issue

@srinivaspavan9
Copy link

srinivaspavan9 commented Jan 2, 2024

Thanks for the report. I would also expect this to be object type from these similar constructions

In [4]: pd.DataFrame(columns=["a"]).dtypes
Out[4]: 
a    object
dtype: object

In [5]: pd.DataFrame([], columns=["a"]).dtypes
Out[5]: 
a    object
dtype: object

In [6]: pd.DataFrame({}, columns=["a"]).dtypes
Out[6]: 
a    object
dtype: object

In [7]: pd.DataFrame({"a": []}, columns=["a"]).dtypes
Out[7]: 
a    float64
dtype: object

In [8]: pd.DataFrame({"a": pd.Series()}).dtypes
Out[8]: 
a    object
dtype: object

It looks like this goes through sanitize_array where there's this comment

        if len(data) == 0 and dtype is None:
            # We default to float64, matching numpy
            subarr = np.array([], dtype=np.float64)

I'm not sure if there a reason internally why we need to treat this as float64 but I would expect at least via this constructor route that object is still returned

I have debugged and observed that for all the cases except pd.DataFrame({'a': []}) we are getting the length of the data argument in sanitize_array to be 1.
p1

does that mean there is inconsistency with only when an empty dictionary with specified column is passed.
and i think sanitize_array is being called twice for when its Dataframe, and in the second time we are getting float64. As you can see below in the call stack. The first time during initialising we are getting the data to be ['a'] but the second time its empty. when its being called from arrays_to_mgr() which you can see in second call stack.

p2 image

@rhshadrach
Copy link
Member

Related: on an empty DataFrame, .values and ._values is a float whereas I'd expect object.

print(pd.DataFrame().values.dtype)
# float64

due to this line:

arr = np.empty(self.shape, dtype=float)

I ran into this because DataFrame.stack uses ._values to determine the result dtype, and on an empty frame we wind up with float whereas I'd expect object.

@rhshadrach
Copy link
Member

rhshadrach commented Mar 3, 2024

I'm not sure if there a reason internally why we need to treat this as float64 but I would expect at least via this constructor route that object is still returned

With both changes (handling the OP and the one I mentioned above), I'm seeing 35 tests fail in the expected way (i.e. there isn't some functionality we definitely don't want to change that breaks). It seems clear to me these changes would make dtypes on empty objects more consistent. The only question on my mind is if this is a bug fix or needs deprecation.

@mroeschke
Copy link
Member

Especially with 3.0 as the next release I would be OK treating this as a "bug fix"

@rhshadrach rhshadrach self-assigned this May 4, 2024
@rhshadrach
Copy link
Member

In starting to work on this, one of the things I noticed is that pd.array has the same "default to float" behavior, albeit slightly different.

print(pd.array([]).dtype)
# Float64

I'm thinking this should also be a NumPy object array?

@mroeschke
Copy link
Member

I'm thinking this should also be a NumPy object array?

cc @jbrockmendel thoughts?

@rhshadrach rhshadrach linked a pull request May 15, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants