Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the projects endpoint need to be case sensitive #2063

Open
spencerschrock opened this issue Oct 11, 2023 · 7 comments
Open

Does the projects endpoint need to be case sensitive #2063

spencerschrock opened this issue Oct 11, 2023 · 7 comments

Comments

@spencerschrock
Copy link

Scorecard uses the /projects.json?url= endpoint when checking a project's best practices badge status. We've have an open issue (ossf/scorecard#3466) where some projects call scorecard with a different capitalization than their official repo name.

For example, github.com/kubearmor/KubeArmor was effectively running scorecard with:

scorecard --repo github.com/kubearmor/kubearmor

Which makes an http call to /projects.json?url=https://github.com/kubearmor/kubearmor, which provides an empty response.
Compared to the expected call of /projects.json?url=https://github.com/kubearmor/KubeArmor which is a hit.

GitHub has an API call which returns the "official" capitalization of a repo, so Scorecard can likely satisfy this requirement when we make the request, but opening an issue in case this was unintentional.

I know the code in question is here, but I'm not much of a Ruby guy. There's also some comments about indices and efficiency, so feel free to close this if it's not feasible.

# Search for exact match on URL
# (home page, repo, and maybe package URL someday)
# We have indexes on each of these columns, so this will be fast.
# We remove trailing space and slash to make it quietly "work as expected".
scope :url_search, (
lambda do |url|
clean_url = url.chomp(' ').chomp('/')
where('homepage_url = ? OR repo_url = ?', clean_url, clean_url)
end
)

@david-a-wheeler
Copy link
Collaborator

Hmmm.

In the general case the URL pathname can be case-sensitive, so it's best to treat it that way. Git is also normally case-sensitive (depending on the underlying filesystem). Some specific systems have pathnames that are case-insensitive, but there's no obvious way to determine which is which. We support arbitrary repos, not just GitHub and GitLab.

It's true that the domain name is not case sensitive when it's ASCII, per IETF RFC 4343. E.g., "I" and "i" are considered the same (apologies to those who speak Turkish).

Handling this in the general case is hard. Here's one idea:

  1. Do a case-sensitive search. If that works, use it.
  2. If that fails, but a case-insensitive search finds a result, use that.

What do you think?

@TonyLHansen
Copy link

wouldn't URI.parse allow you to grab the path separately from the fqdn?

@spencerschrock
Copy link
Author

Handling this in the general case is hard. Here's one idea:

  1. Do a case-sensitive search. If that works, use it.
  2. If that fails, but a case-insensitive search finds a result, use that.

To clarify, is this your proposal for scorecard or for the best practices API?

@david-a-wheeler
Copy link
Collaborator

wouldn't URI.parse allow you to grab the path separately from the fqdn?

Yes, it's definitely possible. The problem is "what to do with the information". Whether or not the path is case-sensitive depends on the details of the specific system being queried. It can even change over time for a given system being queried. I think "case-sensitive first, then case-insensitive" covers all cases and is simpler to implement.

To clarify, is this your proposal for scorecard or for the best practices API?

I'm thinking of this as a proposal for the best practices badge, as this is an issue against the best practice badge.

This might make sense to do this in Scorecard as well, but I think that should be a different issue in that case.

@spencerschrock
Copy link
Author

It can even change over time for a given system being queried. I think "case-sensitive first, then case-insensitive" covers all cases and is simpler to implement.

I think the only case this doesn't cover is a false match. Consider a host where path is case sensitive, and there are two projects, but only one is in the best practices dataset:

  • foo.com/x/y (not in the dataset)
  • foo.com/X/Y (in the dataset)

A request for foo.com/X/Y would successfully case match and return the intended project. A request for foo.com/x/y would miss the exact match and return the wrong project during the case insensitivity.

@david-a-wheeler
Copy link
Collaborator

@spencerschrock - you're right, this approach does risk a false match. I think the risk is low, but it does give pause. I can't think of another approach though, so I think we end up with two possibilities:

  1. Close this unchanged.
  2. Make the change proposed (case-sensitive then case-insensitive).

Anyone have a third way?

@TonyLHansen
Copy link

I say: accept the risk and go with case-sensitive then case-insensitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants