Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: stats.mannwhitneyu: reversed options for alternative param #20733

Open
athanzli opened this issue May 17, 2024 · 11 comments
Open

DOC: stats.mannwhitneyu: reversed options for alternative param #20733

athanzli opened this issue May 17, 2024 · 11 comments
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org scipy.stats

Comments

@athanzli
Copy link

Issue with current documentation:

In

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html

the function "scipy.stats.mannwhitneyu"

the "less" and "greater" options for parameter "alternative" are reversed

Idea or request for content:

In "less", F(u) should be less than G(u), and similarly for "greater"

Additional context (e.g. screenshots, GIFs)

No response

@athanzli athanzli added the Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org label May 17, 2024
@lucascolley lucascolley changed the title DOC: <Please write a comprehensive title after the 'DOC: ' prefix> DOC: stats.mannwhitneyu: reversed options for alternative param May 17, 2024
@nickodell
Copy link
Contributor

To clarify, are you talking about these three sentences?

‘less’: the distribution underlying x is stochastically less than the distribution underlying y, i.e. F(u) > G(u) for all u.

(uses a greater than sign)

‘greater’: the distribution underlying x is stochastically greater than the distribution underlying y, i.e. F(u) < G(u) for all u.

(uses a less than sign)

If F(u) > G(u) for all u, samples drawn from X tend to be less than those drawn from Y.

(uses a greater than sign, and F is the distribution corresponding to X)

@mdhaber
Copy link
Contributor

mdhaber commented May 17, 2024

This comes up about once per year. For the most recent occurrence, see #19009. Please let us know what could be added to the documentation to avoid the confusion.

@fancidev
Copy link
Contributor

fancidev commented May 17, 2024

This comes up about once per year. For the most recent occurrence, see #19009. Please let us know what could be added to the documentation to avoid the confusion.

One possibility is to change F(u) > G(u) to 1-F(u) < 1-G(u). But this may be unnecessary as there is already a long paragraph that follows to explain the direction of inequality.

@mdhaber
Copy link
Contributor

mdhaber commented May 28, 2024

If we want to discourage this from coming up again, I would review a PR that redefines F and G as the "survival functions" of the underlying distributions (rather than leaving them as the CDFs and taking their complement).

@fancidev
Copy link
Contributor

fancidev commented May 28, 2024

If we want to discourage this from coming up again, I would review a PR that redefines F and G as the "survival functions" of the underlying distributions (rather than leaving them as the CDFs and taking their complement).

As someone who don’t know about this test, I do find the documentation not entirely straightforward. In particular, I feel its wording tends to emphasize more what the test does than what the test tries to achieve. I’d propose a PR with an alternative wording for you to evaluate whether it makes it more friendly to the casual reader while being technically accurate. @mdhaber

@mdhaber
Copy link
Contributor

mdhaber commented May 28, 2024

As someone who don’t know about this test...

OK, but I would suggest listing point-by-point what you want to change in this issue before opening a PR. Please include the motivation for the change and how you think it can be improved. Please be respectful of the original author and remember that these things are subjective. Also, note that there is a cost-benefit calculus when working on these things, and maintainer time is often split between many issues.

@fancidev
Copy link
Contributor

fancidev commented May 28, 2024

Sure thing. The point I wanted to make about being new to the subject is that I hope the lack of knowledge provides an advantage in proof-reading the doc, because questions naturally arise as I read it.

The reason why I wanted to make a PR is to leverage on GitHub’s diff and code review functionalities to make it easier to explain the proposed changes. Apart from editorial changes, I’d like to propose a few points for discussion:

  • Reword stochastic ordering in terms of Pr[X>u] <= Pr[Y>u] with at least one inequality, and remove the mentioning of CDF or SF. I expect this to reduce confusion to new comers. (The current definition of strict less is imprecise as far as I can tell.)
  • Make clear the distinct “perspectives” of null and alternative. For example, “two-sided” makes one prone to think it’s the union of “less” and “greater”, but it is not because stochastic ordering is a partial order. So, in fact, “two-sided” operates under a different “perspective” (using the terminology of reference 5). I’d also make clear when the test becomes a test on mean (i.e. when the distributions have the same shape and scale but only differ in location).
  • use_correction is only used when method is asymptotic — is it also used when method is auto and auto chooses asymptotic?
  • Remark that x and y needn’t have the same size. This is implied in several places in the text, but not explicitly clear.
  • After each “consider…” recommendation, add half a sentence to summarize the reason for recommendation.

@mdhaber
Copy link
Contributor

mdhaber commented May 28, 2024

After each “consider…”
Remark that x and y needn’t have the same size.
use_correction is only used when method is asymptotic

Sure. I'd be willing to review all of that.

Make clear the distinct “perspectives” of null and alternative.

Balancing the needs of the API documentation to be readable, technically accurate, and concise can be challenging. Meeting all of those objectives when interpreting rank-based NHSTs alternatives seems particularly tricky, so I tend to shy away from it. For instance, consider "I’d also make clear when the test becomes a test on mean" - if we fail to mention that the mean needs to exist, we run the risk of being considered inaccurate by the few, and if we do get into that level of detail, things tend to get verbose for the many. (In this case, the usual compromise is not so hard, though - make the statement in terms of median.)

There is a also the question about the role of the API documentation - should it go into that, or can't it assume that the user has some familiarity with the test? There is not a scarcity of information about NHSTs out there in formats less restrictive than API docuementation - at what point do we just refer the reader to those? Along those lines, we're in the process of moving extended examples to the tutorials. Perhaps this sort of information would go better in a tutorial?

If you were to add one statement along the lines of (from Wikipedia)
image

I think it would strick the balance well: helpful to some, and not too far into the weeds for the rest. Going to the level of detail of this description of Wilcoxon null and alternatives is too far for my taste. You can try it, but depending on what you write, I might not want to take the time to work on it.

Reword stochastic ordering in terms of Pr[X>u] <= Pr[Y>u] with at least one inequality, and remove the mentioning of CDF or SF.

Here we seem to differ in opinion of what is simple. Some find it simple to reason about symbols (like $\text{Pr}(X > u)$) - especially those with a stronger technical background, I think - whereas I prefer to see a word/acronym of a higher-level concept I already have some grasp of (like "survival function"). I thought statements in terms of stochastic ordering were a great compromise because they give a rough understanding even if you ignore the word "stochastic" or just attribute it to mean "something having to do with randomness"; e.g. "tends to be less" is a reasonably accurate, non-technical understanding of "stochastically less".

So it seems that opinions would differ here, and I would not be a good person to review such a change.

@fancidev
Copy link
Contributor

fancidev commented May 28, 2024

Thanks for the comments @mdhaber . Indeed I will try not to elaborate too much in the doc because I agree conciseness is valuable.

I think we might do away with stochastic ordering confusion (that brought up this issue) by simply rewording the doc in terms of the more strict set-up, i.e. assuming the distributions differ at most by location shift. This is essentially what R’s documentation says:

“The null hypothesis is that the distributions of x and y differ by a location shift of mu and the alternative is that they differ by some other location shift (and the one-sided alternative "greater" is that x is shifted to the right of y).“

It also plays nicely with the “two-sided”, “less”, “greater” wording because the first becomes the union of the latter two. We may then mention that less restrictive set-ups are possible, and refer the reader to reference (5).

What do you think?

@mdhaber
Copy link
Contributor

mdhaber commented May 28, 2024

You could. I wouldn't really be interested in that, personally, since I think it would be simpler to just replace CDF with SF. Every time this issue has been reported, it is because the OP has seen what appears to be an inconsistent sign. Simply pointing out that the sign is not wrong and that the statement is in terms of the CDF has resolved the issue because when one pauses for a moment to form a mental image, it makes a lot of sense. In terms of the survival function - as you pointed out with $1 - F(x)$ - the issue is even less likely to come up because then the signs don't even look inconsistent.

image

So personally, I'd be willing to review those first three bullets and a change from CDF to SF, but I have not been convinced that we should rewrite the documentation under the assumption that $G(x) = F(x - c)$. I do agree it is simpler, but it's a very restrictive case.

@fancidev
Copy link
Contributor

Thanks for your comments. Let me prepare a PR this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org scipy.stats
Projects
None yet
Development

No branches or pull requests

5 participants