Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REG: fix regression in df.corrwith on tied data when method is spearman #49032

Closed
wants to merge 12 commits into from

Conversation

GYHHAHA
Copy link
Contributor

@GYHHAHA GYHHAHA commented Oct 10, 2022

@GYHHAHA GYHHAHA changed the title REG: fix regression in df.corr with spearman method and tied data REG: fix regression in df.corrwith on tied data when method is spearman Oct 10, 2022
pandas/core/frame.py Outdated Show resolved Hide resolved
@@ -10598,18 +10598,23 @@ def corrwith(
cols = self.columns
ndf = self.values.transpose()
k = other.values
k_mask = ~np.isnan(k)
if k.dtype == "bool":
k = k.astype("float")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these casts needed? If needed, can it be converted to int instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bool type won't be accepted by rank_1d's signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we cast it to int, then nan in bool array will be -2147483648.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rank should accept uint8, so k = k.view(uint8)

What happens with nullable dtypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeError: No matching signature found

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rank_1d is compiled for uint8_t

Copy link
Contributor Author

@GYHHAHA GYHHAHA Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. My code was tested on 1.4.3.. I find the signature is changed in 1.5.0. as numeric_object_t instead of iu_64_floating_t. Now it works. Thanks. @phofl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked on 1.4.4, so have to fix this too or revert as suggested in the issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we can't use pd.isna?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to pd.isna now

@mroeschke mroeschke added the Regression Functionality that used to work in a prior pandas version label Oct 11, 2022
@mroeschke mroeschke added this to the 1.5.1 milestone Oct 11, 2022
@mroeschke mroeschke added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Oct 11, 2022
with pytest.raises(TypeError, match="not supported for the input types"):
with pytest.raises(
TypeError,
match=r"unsupported operand type\(s\) for /: 'str' and 'int'",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is raised by np.corrcoef.

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Oct 12, 2022

The failure comes from numeric_only. Is this implement on purpose? I think both results should be empty df. @mroeschke @phofl

>>> pd.DataFrame({"A": [True]}).select_dtypes(np.number)
Empty DataFrame
Columns: []
Index: [0]
>>> pd.DataFrame({"A": [True]}, dtype="boolean").select_dtypes(np.number)
      A
0  True

@phofl
Copy link
Member

phofl commented Oct 13, 2022

"boolean" is currently not considered numeric

k = other.values
k_mask = ~other.isna()
if isinstance(k, BaseMaskedArray):
k = k._data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we special Casing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nullable arrays are not supported by rank_1d.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the mask too in that case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we still cast the type, but write rank step as the following?

libalgos.rank_1d(k, mask=nonnull_mask)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant more like k_mask = k._mask

@GYHHAHA
Copy link
Contributor Author

GYHHAHA commented Oct 17, 2022

@mroeschke please move this to milestone 2.0.

@datapythonista datapythonista modified the milestones: 1.5.1, 1.5.2 Oct 17, 2022
@datapythonista
Copy link
Member

@mroeschke please move this to milestone 2.0.

Update to 1.5.2 for now, thanks @GYHHAHA

@mroeschke
Copy link
Member

Moving to 2.0 as we just reverted the original PR in #49140 and I think it's safer than "re-reverting" for 1.5.2

@mroeschke mroeschke modified the milestones: 1.5.2, 2.0 Oct 17, 2022
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 17, 2022
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Regression Functionality that used to work in a prior pandas version Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: The behavior for DataFrame.corrwith is changed in pandas 1.5.0 ??
4 participants