Write a script to aggregate OSS data on Ruff configuration #3365

charliermarsh · 2023-03-06T19:27:13Z

Right now, it's hard for us to make data-informed decisions. It'd be nice to leverage our open-source usage to help understand questions like:

Which rules are commonly turned on?
Which rules are commonly turned off?
Which rules are commonly # noqa ignored (i.e., false positives)?

This is good prior art (\ht @konstin): rust-lang/rust-clippy#7666

The text was updated successfully, but these errors were encountered:

charliermarsh · 2023-03-06T19:27:50Z

Labeling this as "good first issue", not because it's well-scoped, but because it's a relatively independent project (and could be done without writing any Rust, if anyone is eager to help out).

akx · 2023-03-08T13:58:38Z

I could take a look – https://github.com/search?q=path%3A**%2Fpyproject.toml+ruff&type=code seems like a good starting point (along with **/ruff.toml).

konstin · 2023-03-08T14:12:11Z

fwiw it seems that github doesn't expose an api to their dependency graph, so i'd also go with that code search (or something api friendlier if github has too tight query limits)

akx · 2023-03-08T14:18:14Z

@konstin Dependency graph probably wouldn't directly pick up e.g. ruff-pre-commit in a pre-commit config file (which is how I tend to use ruff). Anyway, I have a skeleton of an aggregator up already 🔧

akx · 2023-03-08T17:35:26Z

Alright, here goes with some initial data 😁

The GitHub Search API doesn't seem to be using the new Code Search stuff and is heavily rate-limited, so the dataset may not be great just yet (PRs welcome, of course). Let me know if you want the files I have so far!

JonathanPlasse · 2023-03-08T18:13:08Z

What about the ALL data?

akx · 2023-03-08T18:15:48Z

What about the ALL data?

If you mean the ALL selector, it's enabled in 23/306 files seen. (You can see it in the "Other values" list in the Select section.) It would probably be interesting to see if those repos also have lots of ignores.

Similarly, I guess a more holistic view (that knew about all of the codes Ruff knows about) would be an interesting next step!

charliermarsh · 2023-03-09T02:56:42Z

Oooh this is so useful! Thank you @akx! The other piece of data that would be really useful to see, though not sure whether it can fit into this paradigm, is how often various codes are used in # noqa, since a # noqa is often indicative of a false positive.

Another question I'd have (though not your responsibility to answer, only if you're curious): I know the data is based on 306 TOML files. I'd be interested to know how often various fields are set vs. unset. (E.g., the .md version says 200 files enable E and F. Are the remaining 306 omitting them? Or do they use the default config, and so they're enabled anyway?)

akx · 2023-03-09T07:26:33Z

@charliermarsh Sure! I updated the gist with "Fields set in configuration" (since I already had that data anyway, it just wasn't aggregated). It's also based on a slightly larger dataset (I realized I didn't process ruff.tomls correctly before).

Finding noqas will require a bit more magic, but now that I do have a good starter dataset of which repos may have Ruff enabled, we can go from there (but not right now since I have 9% laptop battery left 😂)

akx · 2023-03-10T11:33:28Z

Welp, turns out there was a bug or two that conflated "unset" with "empty set" 🙄 – I think this data is now more or less correct:

Name	Value
Total TOML files	1613
Unique TOML files	1352
Deduplicated TOML files	261
No select	28.7%
No ignore	35.3%
No fixable	94.8%
No unfixable	84.2%
Most popular configured Python version	py310
Median configured line length	100
Most common unfixable	F401
Most common ignore	E501
Most common select	F

Also updated into the gist, with all of the details...

akx · 2023-03-21T08:45:58Z

Updated with some 90+ new repos in a fortnight: https://gist.github.com/akx/211308a4d2b31aaf4412558af6fe62a1

Name	Value
Total TOML files	1846
Unique TOML files	1577
Deduplicated TOML files	269
No select	29.4%
No ignore	36.0%
No fixable	94.2%
No unfixable	84.6%
Most popular configured Python version	py310
Median configured line length	100.0
Most common unfixable	F401
Most common ignore	E501
Most common select	F

charliermarsh · 2023-03-22T02:13:47Z

Awesome, thank you for this! (Gonna close issue :))

akx · 2023-03-28T12:21:33Z

https://gist.github.com/akx/817f5dc5663b80ae1315e108393b11a5

126 new unique configuration files in a week 🎉

Name	Value
Total TOML files	1982
Unique TOML files	1703
Deduplicated TOML files	279
No select	29.8%
No ignore	36.1%
No fixable	94.2%
No unfixable	85.1%
Most popular configured Python version	py310
Median configured line length	100.0
Most common unfixable	F401
Most common ignore	E501
Most common select	F

MichaReiser · 2023-03-28T12:30:22Z

I love these updates. Thank you @akx, for updating them.

akx · 2023-04-27T11:50:28Z

https://gist.github.com/akx/291e96a3cb4f085d86e4830eecb5375e

Name	Value
Total TOML files	3143
Unique TOML files	2628
Deduplicated TOML files	515
No select	29.6%
No ignore	35.8%
No fixable	92.3%
No unfixable	85.1%
Most popular configured Python version	py310
Median configured line length	100
Most common unfixable	F401
Most common ignore	E501
Most common select	F

akx · 2023-05-10T09:27:14Z

https://gist.github.com/akx/01c0d37eedd921c0b88d06262812413a

Name	Value
Total TOML files	3812
Unique TOML files	3230
Deduplicated TOML files	582
No select	29.8%
No ignore	35.3%
No fixable	91.9%
No unfixable	84.9%
Most popular configured Python version	py310
Median configured line length	100.0
Most common unfixable	F401
Most common ignore	E501
Most common select	F

konstin · 2023-05-11T20:07:09Z

Thank you for the continued data collection! Would it be feasible to filter out forks? i think we're seeing quite a bit of duplication in the dataset from them, there's e.g. 111 OpenBBTerminal in https://github.com/akx/ruff-usage-aggregate/blob/master/data/known-github-tomls.jsonl .

akx · 2023-05-11T20:09:41Z

@konstin Forks might be hard to find without consulting the GitHub API more, but we're already only considering unique TOMLs:

https://github.com/akx/ruff-usage-aggregate/blob/47a3db9be5a03d5f107a9001831bb2caf6620843/ruff_usage_aggregate/models.py#L19-L29

charliermarsh added good first issue Good for newcomers internal An internal refactor or improvement labels Mar 6, 2023

This comment was marked as outdated.

Sign in to view

charliermarsh closed this as completed Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a script to aggregate OSS data on Ruff configuration #3365

Write a script to aggregate OSS data on Ruff configuration #3365

charliermarsh commented Mar 6, 2023 •

edited

charliermarsh commented Mar 6, 2023 •

edited

akx commented Mar 8, 2023

konstin commented Mar 8, 2023

akx commented Mar 8, 2023

akx commented Mar 8, 2023

JonathanPlasse commented Mar 8, 2023

akx commented Mar 8, 2023 •

edited

charliermarsh commented Mar 9, 2023

akx commented Mar 9, 2023 •

edited

This comment was marked as outdated.

akx commented Mar 10, 2023

akx commented Mar 21, 2023

charliermarsh commented Mar 22, 2023

akx commented Mar 28, 2023

MichaReiser commented Mar 28, 2023

akx commented Apr 27, 2023

akx commented May 10, 2023

konstin commented May 11, 2023

akx commented May 11, 2023

Write a script to aggregate OSS data on Ruff configuration #3365

Write a script to aggregate OSS data on Ruff configuration #3365

Comments

charliermarsh commented Mar 6, 2023 • edited

charliermarsh commented Mar 6, 2023 • edited

akx commented Mar 8, 2023

konstin commented Mar 8, 2023

akx commented Mar 8, 2023

akx commented Mar 8, 2023

JonathanPlasse commented Mar 8, 2023

akx commented Mar 8, 2023 • edited

charliermarsh commented Mar 9, 2023

akx commented Mar 9, 2023 • edited

This comment was marked as outdated.

akx commented Mar 10, 2023

akx commented Mar 21, 2023

charliermarsh commented Mar 22, 2023

akx commented Mar 28, 2023

MichaReiser commented Mar 28, 2023

akx commented Apr 27, 2023

akx commented May 10, 2023

konstin commented May 11, 2023

akx commented May 11, 2023

charliermarsh commented Mar 6, 2023 •

edited

charliermarsh commented Mar 6, 2023 •

edited

akx commented Mar 8, 2023 •

edited

akx commented Mar 9, 2023 •

edited