-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: stats.mannwhitneyu: reversed options for alternative
param
#20733
Comments
alternative
param
To clarify, are you talking about these three sentences?
(uses a greater than sign)
(uses a less than sign)
(uses a greater than sign, and F is the distribution corresponding to X) |
This comes up about once per year. For the most recent occurrence, see #19009. Please let us know what could be added to the documentation to avoid the confusion. |
One possibility is to change |
If we want to discourage this from coming up again, I would review a PR that redefines F and G as the "survival functions" of the underlying distributions (rather than leaving them as the CDFs and taking their complement). |
As someone who don’t know about this test, I do find the documentation not entirely straightforward. In particular, I feel its wording tends to emphasize more what the test does than what the test tries to achieve. I’d propose a PR with an alternative wording for you to evaluate whether it makes it more friendly to the casual reader while being technically accurate. @mdhaber |
OK, but I would suggest listing point-by-point what you want to change in this issue before opening a PR. Please include the motivation for the change and how you think it can be improved. Please be respectful of the original author and remember that these things are subjective. Also, note that there is a cost-benefit calculus when working on these things, and maintainer time is often split between many issues. |
Sure thing. The point I wanted to make about being new to the subject is that I hope the lack of knowledge provides an advantage in proof-reading the doc, because questions naturally arise as I read it. The reason why I wanted to make a PR is to leverage on GitHub’s diff and code review functionalities to make it easier to explain the proposed changes. Apart from editorial changes, I’d like to propose a few points for discussion:
|
Sure. I'd be willing to review all of that.
Balancing the needs of the API documentation to be readable, technically accurate, and concise can be challenging. Meeting all of those objectives when interpreting rank-based NHSTs alternatives seems particularly tricky, so I tend to shy away from it. For instance, consider "I’d also make clear when the test becomes a test on mean" - if we fail to mention that the mean needs to exist, we run the risk of being considered inaccurate by the few, and if we do get into that level of detail, things tend to get verbose for the many. (In this case, the usual compromise is not so hard, though - make the statement in terms of median.) There is a also the question about the role of the API documentation - should it go into that, or can't it assume that the user has some familiarity with the test? There is not a scarcity of information about NHSTs out there in formats less restrictive than API docuementation - at what point do we just refer the reader to those? Along those lines, we're in the process of moving extended examples to the tutorials. Perhaps this sort of information would go better in a tutorial? If you were to add one statement along the lines of (from Wikipedia) I think it would strick the balance well: helpful to some, and not too far into the weeds for the rest. Going to the level of detail of this description of Wilcoxon null and alternatives is too far for my taste. You can try it, but depending on what you write, I might not want to take the time to work on it.
Here we seem to differ in opinion of what is simple. Some find it simple to reason about symbols (like $\text{Pr}(X > u)$) - especially those with a stronger technical background, I think - whereas I prefer to see a word/acronym of a higher-level concept I already have some grasp of (like "survival function"). I thought statements in terms of stochastic ordering were a great compromise because they give a rough understanding even if you ignore the word "stochastic" or just attribute it to mean "something having to do with randomness"; e.g. "tends to be less" is a reasonably accurate, non-technical understanding of "stochastically less". So it seems that opinions would differ here, and I would not be a good person to review such a change. |
Thanks for the comments @mdhaber . Indeed I will try not to elaborate too much in the doc because I agree conciseness is valuable. I think we might do away with stochastic ordering confusion (that brought up this issue) by simply rewording the doc in terms of the more strict set-up, i.e. assuming the distributions differ at most by location shift. This is essentially what R’s documentation says: “The null hypothesis is that the distributions of x and y differ by a location shift of mu and the alternative is that they differ by some other location shift (and the one-sided alternative "greater" is that x is shifted to the right of y).“ It also plays nicely with the “two-sided”, “less”, “greater” wording because the first becomes the union of the latter two. We may then mention that less restrictive set-ups are possible, and refer the reader to reference (5). What do you think? |
You could. I wouldn't really be interested in that, personally, since I think it would be simpler to just replace CDF with SF. Every time this issue has been reported, it is because the OP has seen what appears to be an inconsistent sign. Simply pointing out that the sign is not wrong and that the statement is in terms of the CDF has resolved the issue because when one pauses for a moment to form a mental image, it makes a lot of sense. In terms of the survival function - as you pointed out with So personally, I'd be willing to review those first three bullets and a change from CDF to SF, but I have not been convinced that we should rewrite the documentation under the assumption that |
Thanks for your comments. Let me prepare a PR this week. |
Issue with current documentation:
In
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
the function "scipy.stats.mannwhitneyu"
the "less" and "greater" options for parameter "alternative" are reversed
Idea or request for content:
In "less", F(u) should be less than G(u), and similarly for "greater"
Additional context (e.g. screenshots, GIFs)
No response
The text was updated successfully, but these errors were encountered: