Effective posture of various software packages in the dataset is not identified correctly, leading to inaccurate scoring #1

bureado · 2023-10-22T23:07:30Z

When reading the paper my attention was drawn to Table VII: Top twenty most frequently used packages with low Scorecard score (i.e., Popular Pitfalls) where various software packages are poorly scored due to the apparent absence of code reviews, security policies or even active maintainership.

I came to this repo to review the dataset, where I can effectively see that the GitHub repositories that have been loosely matched to the software components [1] are, in many cases, not representative of the actual security posture of the software component, leading to an inaccurate scoring; for example, while there's little debate on the criticality of packages like tzdata [2] or bash by virtue of how prevalent they are, I wouldn't characterize them as having a particularly low "Achilles score".

I also noticed the following mention in the Limitations section, page 7:

Moreover, we could not ﬁnd corresponding Github repositories for many packages in our containers because the package name and repository name may not match exactly [...] Nonetheless, we found Github repositories for 26k out of 190k unique dependencies collected from SBOM data.

I do not see this is as a limitation, but as exclusion criteria for such an analysis, particularly when the scoring tool in consideration is OpenSSF Scorecards or other tools listed in the OpenSSF SCM Best Practices Guide which, for scoring completeness, rely on deterministic mapping of the DevOps platform that generates the artifact in the container image.

As you alluded to in the paper, not all software components are maintained on GitHub, many equally named projects are mirrors (which may or may not have releases), they might be wrappers or, in extreme cases, might be malicious typosquatting. Different base images and package ecosystems require specific analysis. For example, in a Debian-based container system, sources are offered in places like sources.debian.net as a public archive, on GitLab (salsa) and in its own native source package format, which is ultimately what's fed into the build system - nuances that are necessary to determine a scorable posture. Other base images will use specs and patches or other maintenance models with different threat models. And that doesn't account for higher order distribution models, for example with extended LTS providers that will tick the "maintainability" box for years after the main SCM becomes unmaintained.

While many other types of dependencies you've evaluated (such as most development libraries and project dependencies) the approach taken in the paper might be sufficient, the security posture for other software components (such as deb, rpm and apk, which make a significant portion of your dataset, and all of the Top 10) requires specific domain expertise which generalist posture assessment tools currently don't offer.

I hope my observation helps extend the conversation you're initiating with your paper: traceability of packages to CI platforms in the OSS ecosystem needs to improve, component naming needs to improve and tooling needs to improve -- all in parallel with improving the security posture, starting with critical projects and with at-scale efforts, many of which OpenSSF is leading.

[1] I took a quick pass at the list of GitHub repos in your dataset for the packages in Table VII; most of the namespaces appear to be private users, a few are language bindings (tzdata, findutils), a few are mirrors (libxml2, coreutils) and at first glance it looks like openssl/openssl might be the only match in that group (and yet, I don't consider the posture of that GitHub repo to be the only contributor to the threat model of OpenSSL installed in a given container image)

[2] To use tzdata as an example, in your dataset you are pointing to github.com/python/tzdata as the SCM for tzdata, which is inaccurate. python/tzdata is a wrapper, in Debian-based containers, the SCM would be salsa.debian.org, the governance is codified in RFC 6557, etc.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Effective posture of various software packages in the dataset is not identified correctly, leading to inaccurate scoring #1

Effective posture of various software packages in the dataset is not identified correctly, leading to inaccurate scoring #1

bureado commented Oct 22, 2023 •

edited

Effective posture of various software packages in the dataset is not identified correctly, leading to inaccurate scoring #1

Effective posture of various software packages in the dataset is not identified correctly, leading to inaccurate scoring #1

Comments

bureado commented Oct 22, 2023 • edited

bureado commented Oct 22, 2023 •

edited