Policy for checking for arbitrary file existence #500

wesley-dean-flexion · 2024-03-18T21:05:58Z

Policy for checking for arbitrary file existence

User Story

As a Security Engineer, so that I can enforce organization-wide policies regarding arbitrary files, I would like a new File Check policy that I may configure to meet my organization's needs.

Overview

OSSF Allstar current includes policies for checking the existence of various specific files. For example, consider SECURITY.md via pkg/policies/security/ ; this policy specifies a constant at security.md line 31 of polName with a value of SECURITY.md. In #469, @matias-jt asked if it was possible to verify if a file exists and @ArisBee responded with advice to update the polName variable. Similarly, @melmos asked in #450 if would be possible for a policy to check for security.json instead of SECURITY.md and it was suggested that the two issues may be duplicates with #469 referencing a general use-case while #450 is more specific.

The suggested change in the value of polName would require recompiling the tool -- that is, the value is set at compile-time, not at configuration-time or run-time.

Current Use-Case

My organization has multiple repositories. We have an organizational policy (governance, not Allstar policy) of requiring repositories to include a configuration files for Pre-Commit, a tool that supports hooks for running detect-secrets, gitleaks, etc.. That is, we want to iterate across all our repos and make sure they all have .pre-commit-config.yaml files -- basically exactly what Allstar does, but instead of SECURITY.md, we would look for .pre-commit-config.yaml files. While we could fork Allstar, we're thinking that it may be possible to help others who are in similar situations (e.g., the aforementioned issues) by generalizing a policy in Allstar that could be configured separately.

Alternative Implementation

High-level Goals

Allow changing the name of the file to be check at configuration-time
Allow the policy to check for the existence of absence of a file
Allow the policy to check in multiple locations
Allow the policy to check for multiple files

Without getting overly-prescriptive, it may be worthwhile to consider breaking down a new policy into smaller goals and iterate across them instead of trying to do the whole thing at once.

Iterative Steps

Iteration 1: variable filenames

Allow polName to be set via configuration file. The configuration file would function just like the other tools' configuration files (e.g., security.md line 30 which specifies configFile = "security.yaml"). The sample quickstart file would be extended to support an additional parameter, such as fileName. At runtime, instead of referencing a constant, a variable would be extracted from the configuration file. Everything else would remain the same. That is, if the file in question exists, policy compliance is accepted; if it's missing, the repository is non-compliant and the usual remedies (block checks, create issues, etc.) are taken.

It is recommended that basic checks are run to only allow tests for files in the repository, not arbitrary files on the filesystem.

read fileName string from the config file
if fileName exists, pass
otherwise fail

Consider the following configuration file:

---
optConfig:
  optOutStrategy: true
  action: issue

  fileName: "SECURITY.md"

Iteration 2: support file absence

Instead of having a policy check receive configuration via a string named fileName, the tool will accept a map with two keys of fileName and state where fileName is the name of the target file (as before) and state is either present or absent (consider Ansible's ansible.builtin.file module which allows absent and a variety of potential file types).

File exists?	State	Result
Yes	Present	Pass
Yes	Absent	Fail
No	Present	Fail
No	Absent	Pass

The most significant aspect of this iteration isn't the supporting the absence of a file, it's changing the configuration approach
from accepting a string to accepting a map.

read fileCheck map from the config file
extract fileName string from fileCheck
extract state string from fileCheck
if fileName exists and state == present, pass
if filename doesn't exist and state == absent, pass
otherwise, fail

Consider the following configuration file:

---
optConfig:
  optOutStrategy: true
  action: issue

  fileCheck:
    fileName: "SECURITY.md"
    state: "present"

Iteration 3: support multiple locations

Instead of having the policy check for a single specific location for a file (e.g., /SECURITY.md) specified as a string, accept an array of locations to check. In the case of checking for file existence, if any of the locations match, consider the check passed; in checking for file absence, if any of the locations match, conside the check failed. For example, the SECURITY.md file may be located at:

/SECURITY.md
/.github/SECURITY.md
/docs/SECURITY.md

The most significant change here is having the location be specified as an array rather than a string and iterating across the array instead of just checking one location.

read fileCheck map from the config file
extract fileNames array from fileCheck
extract state string from fileCheck
iterate across fileNames
a. if state == present and any match, pass; otherwise fail
b. if state == absent and any match, fail; otherwise pass

Consider the following configuration file:

---
optConfig:
  optOutStrategy: true
  action: issue

  fileCheck
    fileNames:
      - "/SECURITY.md"
      - "/.github/SECURITY.md"
      - "/docs/SECURITY.md"
    state: "present"

Iteration 4: support multiple checks

Currently, there's a 1:1 correspondence with regards to file checks and policies (e.g., the Security policy can't look for a CODEOWNERS file); this iteration supports multiple checks.

Instead of accepting a single map, fileCheck, accept an array of maps with each item representing the check for
a single file (although it may be in multiple locations).

Consider the following configuration file:

---
optConfig:
  optOutStrategy: true
  action: issue

  files:
    - SECURITY.md:
      fileNames:
        - "/SECURITY.md"
        - "/.github/SECURITY.md"
        - "/docs/SECURITY.md"
      state: "present"

    - CODEOWNERS:
      fileNames:
        - "/CODEOWNERS"
        - "/.github/CODEOWNERS"
        - "/docs/CODEOWNERS"
      state: "present"

read files array from the config file
iterate across files array
extract fileNames array from iterant
extract state string from iterant
iterate across fileNames
a. if state == present and any match, pass; otherwise fail
b. if state == absent and any match, fail; otherwise pass

The text was updated successfully, but these errors were encountered:

xee5ch · 2024-03-19T00:37:12Z

Hi @wesley-dean-flexion, for the fileNames like other GitHub Actions, is there a reason to not support the "Typescripty" way with GitHub Action configurations that support recursive wildcards?

So instead of supporting just the example below from your report.


  files:
    - SECURITY.md:
      fileNames:
        - "/SECURITY.md"
        - "/.github/SECURITY.md"
        - "/docs/SECURITY.md"

It could also support this?


  files:
    - SECURITY.md:
      fileNames:
        - "*/**/SECURITY.md"

You can look for specific files in specific subdirectories you can use the original method.

In either case, this looks like a nice a recommendation and I would support this addition. I have similar needs for such functionality in allstar. 😄

wesley-dean-flexion · 2024-03-19T13:42:32Z

@xee5ch that's a really good thought. It would be really nice if it could accept recursive globs (**/SECURITY.md) or, really, any globs.

It appears -- and please correct me if I'm wrong as this is really foundational to the whole idea -- that Allstar does not download / clone / fetch / pull the repositories being scanned; instead, it makes calls using the GitHub client library for Go:

allstar/pkg/policies/security/security.go

Lines 98 to 101 in 964a34c

    
           func (s Security) Check(ctx context.Context, c *github.Client, owner, 
        
           	repo string) (*policydef.Result, error) { 
        
           	v4c := githubv4.NewClient(c.Client()) 
        
           	return check(ctx, c, v4c, owner, repo)

(sidenote: Allstar is coded to use v59 of the library; v60 was released on February 29, 2024)

I'm looking through the Github's GraphQL documentation to see if it's possible to use GraphQL to use globs to match files.

wesley-dean-flexion · 2024-03-19T14:16:45Z

On the GitHub forum, I found a post where someone wanted to fetch the content of action.yml or action.yaml from the repo's HEAD. The response suggested to me that queries execute expressions on repositories to fetch objects. I'm struggling to find the GraphQL documentation asserting exactly what syntax expressions takes, specifically with regards to files (objects). The examples I've found so far have all listed specific files and not wildcards, globs, regexes, etc..

So... I don't know. I'll write more as I learn more.

jeffmendoza · 2024-03-27T22:24:18Z

@wesley-dean-flexion This is great, thanks for digging into this.

The polName const in the existing "SECURITY.md" Allstar policy is not used for anything except for identifying the policy in logs. The actual check is done by querying GitHub's GraphQL API here: https://github.com/ossf/allstar/blob/main/pkg/policies/security/security.go#L124-L137 Checking the Repository->IsSecurityPolicyEnabled bool. This allows for a SECURITY.md file to be placed in the GitHub Org's ".github" repository as well.

I'd suggest a new policy and to not touch the current "SECURITY.md" Allstar policy at all.

As far as iterations, feel free to go at your own pace. If you want to start with a simple single-file policy, that is ok, or if you want to go all the way to multiple checks with either/or lists of files to find, that works as well.

One thing to consider with the multiple-file checks is what to do about org vs repo level config. For policies that have single-variable options, the repo level config can override what is set at the org level (If overrides are allowed). When you have a list, then it is hard to remove/override an org level requirement at the repo level. Usually you would just add both lists together.

For getting files, we have a cache-friendly wrapper around Repository.GetContents here: https://github.com/ossf/allstar/blob/main/pkg/config/contents.go#L26 This can be made to be exported from the config pkg, or moved to another package in Allstar.

I agree that getting contents of files directly using the API is the most desired implementation, but that might not work well with globs. I'm not sure if there is a search API or something under the GraphQL side that will work better. In the end, the Scorecard policy downloads the tarball, so we can do that for this as well if we find a way to share the downloaded tarball with the Scorecard policy.

wesley-dean-flexion · 2024-03-28T14:43:59Z

@jeffmendoza this is amazing feedback. Thank you so much!!

The polName const

I suspected that as I couldn't see where or how a call for that constant was used. Reading this comment in #469 led me to my questioning myself (i.e., clearly I'm missing something):

Yes, that's doable if you duplicate the security.go check and replace SECURITY.md variable in polName.

I'd suggest a new policy

Yes, absolutely. Agreed.

As far as iterations, feel free to go at your own pace.

My thinking with the iterations was three-fold:

make it clear what efforts would be involved and describe them as finite chunks of work
allow for regular demonstrations that can be tested cleanly
lower the barrier to entry for developers wishing to gain some experience working on an Open Source project

org vs repo level config

Oh, that's a really good point. I suppose one could add an override option on a file-by-file basis; however, that may lead to an undesired level of complexity. My hope was to keep things as simple as possible (I'm prone to over-engineering / bike-shedding), but that's a really good point that individual repos may have additional needs or customizations. Based on your comment, I'm thinking of 3 models:

org level wins
repo level wins
merge both

So, maybe instead of a boolean (allowOverride, my first thought), perhaps it's a "merge strategy" (org, repo, mergeas options formergeStrategy`).

Something to think about.... 🤔

For getting files, we have a cache-friendly wrapper

Excellent. This is extremely helpful. Thank you!

might not work well with globs

I haven't found evidence to see if globs would be supported. It would be great if it could, but I don't know how much effort would be involved. While it would be more resource-intensive, we could fetch the list of files and replicate a glob match (against the list in-memory) instead of using a GraphQL query. I don't know what effort would be involved in fetching that list.

Some examples showed how to perform a non-recursive query for the list of names of files in a repo and suggested that a recursive option may be available in future revisions of the the API. However, those conversations were several years old.

That said, running a query on the ossf/allstar repo's / path in the HEAD branch shows that while the query doesn't recurse, the isDirectory response field would allow us to recurse ourselves with multiple API queries.

I just ran a test with a path of /**/*.md and the response tree object came back as empty.

So, if we were to build a list of files in a repository, that would likely result in several API calls (one per directory entry); however, if we just queried the files for a given policy check, we're talking about one query per file check. If we're looking for one..two..three locations for a file, it's highly probable a full repository scan would result in more API calls.

We could codify a check to see if a path query looks like a glob, we could do the full repo scan and do our own matching; however, that's a level of complexity that I don't think would be warranted at this time.

I really, really appreciate your taking the time to review this, offer suggestions and corrections, and support the refinement of this idea.

Tangentially, I noticed the Allstar list of future policies includes Dependabot. Cool. A file existence checker could look for the presence of .github/dependabot.yml and address a significant portion of that policy. It could tell if that file existed but it wouldn't verify if it's valid YAML or if it matches the Dependabot schema or if it's configured properly.

jeffmendoza · 2024-04-01T20:39:37Z

@wesley-dean-flexion All this looks great. I support passing on globs for now and getting something helpful for many/most cases in sooner without it.

The merge strategy you described sounds good. Most policies use the "allow override" from the OptConfig, so we will just need appropriate documentation on this policy that it follows the merge policy setting instead.

wesley-dean-flexion · 2024-04-01T20:42:35Z

@jeffmendoza Terrific. Thank you very much. Agreed on passing on globs for now -- seems nice to have, possibly for a future iteration.

Agreed re: documentation. Perfect.

Again, I really appreciate your taking the time to read, review, and respond to this thread and for being open to the conversation. Thank you.

wesley-dean-flexion · 2024-04-15T19:27:14Z

I was pointed to todogroup/repolinter earlier today. It looks like it can do what I wanted with this policy, so I'm having serious questions about whether or not it makes sense to duplicate it in Allstar.

One nice thing that I like about Allstar (among many!) is that it interacts with GitHub's API without cloning the repository being examined. Repolinter is able to accept URLs to repositories hosted on SCMs with the --git flag; this results in the repo in question being cloned and then examined locally. For a small repo, this may be just fine; however, for a large repo, this could result in a lot of unnecessary traffic as opposed to, say, a dozen API calls.

Another nice thing I like about Allstar is that it includes the ability to be used on an organization level, not just on a repository level. I stumbled across a posting on organizationally required GitHub Actions that may be able to accomplish something similar. Based on that article -- and I haven't tested anything yet -- an opt-out model appears to be more effort than Allstar's model, especially for organizations with many repositories. It's entirely possible that I misinterpreted the source material or the functionality has changed since the article was posted.

A third nice thing that I like about Allstar is the ability to trigger actions based on findings, such as by creating issues in repositories that are missing, for example, SECURITY.md files. I believe it may be possible to create an issue if a Repolint run failed (i.e., if: {{ failure{} }} ); however, implementing that would involve publishing an action to call Repolinter and react to Repolinter failures, and making that action required organizationally. Long story short, more code to write and maintain.

I'm curious, @jeffmendoza , if you have any thoughts. I have no ego in this so if you believe that it's not helpful to add this functionality to Allstar and would recommend this issue be closed / wont-do, that's totally fine with me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy for checking for arbitrary file existence #500

Policy for checking for arbitrary file existence #500

wesley-dean-flexion commented Mar 18, 2024 •

edited

xee5ch commented Mar 19, 2024

wesley-dean-flexion commented Mar 19, 2024 •

edited

wesley-dean-flexion commented Mar 19, 2024

jeffmendoza commented Mar 27, 2024

wesley-dean-flexion commented Mar 28, 2024 •

edited

jeffmendoza commented Apr 1, 2024

wesley-dean-flexion commented Apr 1, 2024

wesley-dean-flexion commented Apr 15, 2024

Policy for checking for arbitrary file existence #500

Policy for checking for arbitrary file existence #500

Comments

wesley-dean-flexion commented Mar 18, 2024 • edited

Policy for checking for arbitrary file existence

User Story

Overview

Current Use-Case

Alternative Implementation

High-level Goals

Iterative Steps

Iteration 1: variable filenames

Iteration 2: support file absence

Iteration 3: support multiple locations

Iteration 4: support multiple checks

xee5ch commented Mar 19, 2024

wesley-dean-flexion commented Mar 19, 2024 • edited

wesley-dean-flexion commented Mar 19, 2024

jeffmendoza commented Mar 27, 2024

wesley-dean-flexion commented Mar 28, 2024 • edited

jeffmendoza commented Apr 1, 2024

wesley-dean-flexion commented Apr 1, 2024

wesley-dean-flexion commented Apr 15, 2024

wesley-dean-flexion commented Mar 18, 2024 •

edited

wesley-dean-flexion commented Mar 19, 2024 •

edited

wesley-dean-flexion commented Mar 28, 2024 •

edited