Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a script to aggregate OSS data on Ruff configuration #3365

Closed
charliermarsh opened this issue Mar 6, 2023 · 19 comments
Closed

Write a script to aggregate OSS data on Ruff configuration #3365

charliermarsh opened this issue Mar 6, 2023 · 19 comments
Labels
good first issue Good for newcomers internal An internal refactor or improvement

Comments

@charliermarsh
Copy link
Member

charliermarsh commented Mar 6, 2023

Right now, it's hard for us to make data-informed decisions. It'd be nice to leverage our open-source usage to help understand questions like:

  • Which rules are commonly turned on?
  • Which rules are commonly turned off?
  • Which rules are commonly # noqa ignored (i.e., false positives)?

This is good prior art (\ht @konstin): rust-lang/rust-clippy#7666

@charliermarsh charliermarsh added good first issue Good for newcomers internal An internal refactor or improvement labels Mar 6, 2023
@charliermarsh
Copy link
Member Author

charliermarsh commented Mar 6, 2023

Labeling this as "good first issue", not because it's well-scoped, but because it's a relatively independent project (and could be done without writing any Rust, if anyone is eager to help out).

@akx
Copy link
Contributor

akx commented Mar 8, 2023

I could take a look – https://github.com/search?q=path%3A**%2Fpyproject.toml+ruff&type=code seems like a good starting point (along with **/ruff.toml).

@konstin
Copy link
Member

konstin commented Mar 8, 2023

fwiw it seems that github doesn't expose an api to their dependency graph, so i'd also go with that code search (or something api friendlier if github has too tight query limits)

@akx
Copy link
Contributor

akx commented Mar 8, 2023

@konstin Dependency graph probably wouldn't directly pick up e.g. ruff-pre-commit in a pre-commit config file (which is how I tend to use ruff). Anyway, I have a skeleton of an aggregator up already 🔧

@akx
Copy link
Contributor

akx commented Mar 8, 2023

Alright, here goes with some initial data 😁

The GitHub Search API doesn't seem to be using the new Code Search stuff and is heavily rate-limited, so the dataset may not be great just yet (PRs welcome, of course). Let me know if you want the files I have so far!

@JonathanPlasse
Copy link
Contributor

What about the ALL data?

@akx
Copy link
Contributor

akx commented Mar 8, 2023

What about the ALL data?

If you mean the ALL selector, it's enabled in 23/306 files seen. (You can see it in the "Other values" list in the Select section.) It would probably be interesting to see if those repos also have lots of ignores.

Similarly, I guess a more holistic view (that knew about all of the codes Ruff knows about) would be an interesting next step!

@charliermarsh
Copy link
Member Author

Oooh this is so useful! Thank you @akx! The other piece of data that would be really useful to see, though not sure whether it can fit into this paradigm, is how often various codes are used in # noqa, since a # noqa is often indicative of a false positive.

Another question I'd have (though not your responsibility to answer, only if you're curious): I know the data is based on 306 TOML files. I'd be interested to know how often various fields are set vs. unset. (E.g., the .md version says 200 files enable E and F. Are the remaining 306 omitting them? Or do they use the default config, and so they're enabled anyway?)

@akx
Copy link
Contributor

akx commented Mar 9, 2023

@charliermarsh Sure! I updated the gist with "Fields set in configuration" (since I already had that data anyway, it just wasn't aggregated). It's also based on a slightly larger dataset (I realized I didn't process ruff.tomls correctly before).

Finding noqas will require a bit more magic, but now that I do have a good starter dataset of which repos may have Ruff enabled, we can go from there (but not right now since I have 9% laptop battery left 😂)

@akx

This comment was marked as outdated.

@akx
Copy link
Contributor

akx commented Mar 10, 2023

Welp, turns out there was a bug or two that conflated "unset" with "empty set" 🙄 – I think this data is now more or less correct:

Name Value
Total TOML files 1613
Unique TOML files 1352
Deduplicated TOML files 261
No select 28.7%
No ignore 35.3%
No fixable 94.8%
No unfixable 84.2%
Most popular configured Python version py310
Median configured line length 100
Most common unfixable F401
Most common ignore E501
Most common select F

Also updated into the gist, with all of the details...

@akx
Copy link
Contributor

akx commented Mar 21, 2023

Updated with some 90+ new repos in a fortnight: https://gist.github.com/akx/211308a4d2b31aaf4412558af6fe62a1

Name Value
Total TOML files 1846
Unique TOML files 1577
Deduplicated TOML files 269
No select 29.4%
No ignore 36.0%
No fixable 94.2%
No unfixable 84.6%
Most popular configured Python version py310
Median configured line length 100.0
Most common unfixable F401
Most common ignore E501
Most common select F

@charliermarsh
Copy link
Member Author

Awesome, thank you for this! (Gonna close issue :))

@akx
Copy link
Contributor

akx commented Mar 28, 2023

https://gist.github.com/akx/817f5dc5663b80ae1315e108393b11a5

126 new unique configuration files in a week 🎉

Name Value
Total TOML files 1982
Unique TOML files 1703
Deduplicated TOML files 279
No select 29.8%
No ignore 36.1%
No fixable 94.2%
No unfixable 85.1%
Most popular configured Python version py310
Median configured line length 100.0
Most common unfixable F401
Most common ignore E501
Most common select F

@MichaReiser
Copy link
Member

I love these updates. Thank you @akx, for updating them.

@akx
Copy link
Contributor

akx commented Apr 27, 2023

https://gist.github.com/akx/291e96a3cb4f085d86e4830eecb5375e

Name Value
Total TOML files 3143
Unique TOML files 2628
Deduplicated TOML files 515
No select 29.6%
No ignore 35.8%
No fixable 92.3%
No unfixable 85.1%
Most popular configured Python version py310
Median configured line length 100
Most common unfixable F401
Most common ignore E501
Most common select F

@akx
Copy link
Contributor

akx commented May 10, 2023

https://gist.github.com/akx/01c0d37eedd921c0b88d06262812413a

Name Value
Total TOML files 3812
Unique TOML files 3230
Deduplicated TOML files 582
No select 29.8%
No ignore 35.3%
No fixable 91.9%
No unfixable 84.9%
Most popular configured Python version py310
Median configured line length 100.0
Most common unfixable F401
Most common ignore E501
Most common select F

@konstin
Copy link
Member

konstin commented May 11, 2023

Thank you for the continued data collection! Would it be feasible to filter out forks? i think we're seeing quite a bit of duplication in the dataset from them, there's e.g. 111 OpenBBTerminal in https://github.com/akx/ruff-usage-aggregate/blob/master/data/known-github-tomls.jsonl .

@akx
Copy link
Contributor

akx commented May 11, 2023

@konstin Forks might be hard to find without consulting the GitHub API more, but we're already only considering unique TOMLs:

https://github.com/akx/ruff-usage-aggregate/blob/47a3db9be5a03d5f107a9001831bb2caf6620843/ruff_usage_aggregate/models.py#L19-L29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers internal An internal refactor or improvement
Projects
None yet
Development

No branches or pull requests

5 participants