Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support specifying what parser to use in --lockfile #94

Merged
merged 11 commits into from
Feb 6, 2023
Merged

feat: support specifying what parser to use in --lockfile #94

merged 11 commits into from
Feb 6, 2023

Conversation

G-Rath
Copy link
Collaborator

@G-Rath G-Rath commented Dec 21, 2022

Resolves #67
Resolves #124

@G-Rath G-Rath requested a review from another-rex December 21, 2022 18:52
@G-Rath G-Rath added the enhancement New feature or request label Dec 21, 2022
@inferno-chromium
Copy link
Contributor

Thank you for your PR. @another-rex is on vacation and we will review it early next year and then release as part of the next minor release 1.0.x release.

@G-Rath
Copy link
Collaborator Author

G-Rath commented Dec 23, 2022

yup all good - fwiw I might do another PR or two over the holiday break; btw I think you mean 1.x release, since 1.0.x is for fixes not minors :)

Copy link
Collaborator

@oliverchang oliverchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton!

Note that while this helps #67, I wouldn't consider this to fully "resolve" the issue. The intention of that issue was to automatically detect such cases without user intervention.

README.md Outdated
files with the `-parse-as` flag:

```bash
$ osv-scanner --parse-as 'requirements.txt' --lockfile=/path/to/your/extra-requirements.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of this (applies to every single file you scan) seems a bit limiting.

Some alternatives that make this more flexible:

1. Explicit argument for every lockfile type

osv-scanner --requirements-txt=/path/to/requirements.txt
osv-scanner --cargo-lock=/path/to/...

Downsides: This results in multiple flags that do the same thing (scanning a lockfile), which doesn't seem ideal. The format of the flag itself is also not obvious (i.e. "Cargo.lock" corresponds to "--cargo-lock").

2. A delimiter in the existing lockfile value.

osv-scanner --lockfile=PARSE-AS:requirements.txt:/path/to/....

Downsides: The string format is hard to read and easy to get wrong.

3. Allow --parse-as to be overridable

--parse-as would apply to all subsequent --lockfile arguments, and can be specified multiple times.

osv-scanner --parse-as=requirements.txt --lockfile=1/2.txt  --lockfile=3/4.txt --parse-as=Cargo.lock --lockfile=5/Cargo.lock

The first two lockfiles (1/2.txt, 3/4.txt) are parsed using requirements.txt. 5/Cargo.lock is parsed using Cargo.lock because we passed a subsequent --parse-as that overrides this.

Option 3. Seems like the best one to me, in a way that's consistent with our existing CLI format, without introducing redundancies. WDYT?

@G-Rath
Copy link
Collaborator Author

G-Rath commented Jan 3, 2023

The intention of that issue was to automatically detect such cases without user intervention.

I thought about that when implementing the requirements.txt parser and came to the conclusion that so long as such file does not have a well-defined identifier, it's not feasible because any file could be intended to be used as a requirements.txt - so you'd end up having to open every file, and either assume when you come across a line that doesn't look like it belongs in requirements.txt that it's not that kind of file or error, which if you choose to report to the user will be extremely verbose and if you don't then it can lead to unexpected behaviour as a user could think a file is being checked when it's not.

In addition, currently the requirements.txt parser (like all the parsers) tries to be forgiving so it just skips any line that doesn't denote a requirement - so i.e. a markdown file with a valid requirement line would get parsed as a valid requirements.txt with that as a package.

Finally, there are also cases where a user might not want a requirements.txt-like file to be scanned - the first one that comes to mind which is pretty active is the requirements.in file used by pip-tools/pip-compile; while that specific file name could be detected, like requirements.txt users are free to name the file whatever they like (and for similar reasons as with requirements.txt, there could be multiple i.e requirements.dev.in).

So I personally think it's better to mirror the behaviour of pip: pick up requirements.txt by default, and require users to explicitly name any other files that should be parsed as such, which they need to pass explicitly to pip anyway whenever they're using that file so it should be a familiar behaviour.

One improvement that could be made to the requirements.txt parser that I'd consider related to this would be supporting referring to other requirements.txt files and constraints files - I had not implemented them yet because I wasn't sure if it was correct to just treat them as being part of the original requirements.txt or if they should be somehow pushed out as their own lockfile (effectively having the parser report them as a heuristic for identifying other requirements.txt files).

Option 3. Seems like the best one to me, in a way that's consistent with our existing CLI format, without introducing redundancies. WDYT?

Sounds ok, but note it's not actually consistent with the existing CLI flags - currently everything is based on "last value wins", i.e. doing the equivalent as your example comment but with --config will result in the last --config being used for all lockfiles.

Note that there is a tradeoff too: you're removing the ability for users to confidently override/enforce certain options which can be useful in automatic tooling where the command line arguments are being generated (i.e. ensuring that the --config the tooling generates is used regardless of what's in the repo), or in project scripts when debugging/trying something.

Given that, I'd prefer to land this PR as-is and then follow it up with another PR changing the overall flag logic to work as you described (if that is still something you'd like).

@oliverchang
Copy link
Collaborator

The intention of that issue was to automatically detect such cases without user intervention.

I thought about that when implementing the requirements.txt parser and came to the conclusion that so long as such file does not have a well-defined identifier, it's not feasible because any file could be intended to be used as a requirements.txt

Thanks for explaining. Yeah, it seems like a difficult problem to solve perfectly. One compromise solution here is to account for any filename matching *requirements*.txt.

This particular issue came up when we tried scanning https://github.com/home-assistant/core, where they have a bunch of requirements_*.txt files in the root that we didn't pick up.

This is also the approach that syft takes: https://github.com/anchore/syft/blob/e3d6ffd30e44428b898675922a0474221a7f7dc7/syft/pkg/cataloger/python/cataloger.go#L16

Sounds ok, but note it's not actually consistent with the existing CLI flags - currently everything is based on "last value wins", i.e. doing the equivalent as your example comment but with --config will result in the last --config being used for all lockfiles.

Good call on the inconsistency with --config. Unfortunately because of our SemVer adherance we can't change the behaviour. We'll need to call this out as a global level config that can only be specified once in our CLI documentation.

That said, with --config we already provide a similar per-lockfile level mechanism to override things via osv-scanner.toml files living in the same directory.

Note that there is a tradeoff too: you're removing the ability for users to confidently override/enforce certain options which can be useful in automatic tooling where the command line arguments are being generated (i.e. ensuring that the --config the tooling generates is used regardless of what's in the repo), or in project scripts when debugging/trying something.

Not sure I fully understand this in the case of --parse-as. If the rules are clearly documented and understood, then automated tooling can still ensure the same outcome?

Given that, I'd prefer to land this PR as-is and then follow it up with another PR changing the overall flag logic to work as you described (if that is still something you'd like).

Unfortunately as above we need to ensure backwards compatibility for all of our releases, so we can't change the behaviour of a flag like so after we merge (and release this).

@G-Rath
Copy link
Collaborator Author

G-Rath commented Jan 4, 2023

That said, with --config we already provide a similar per-lockfile level mechanism to override things via osv-scanner.toml files living in the same directory.

I assume you mean that provided --config isn't provided; since if it is then osv-scanner.toml files won't be read (which is the correct behaviour imo)

If the rules are clearly documented and understood, then automated tooling can still ensure the same outcome?

Yes, but you increase the amount of work required to ensure that outcome because now users and tools have to examine the command more closely, figure out what rules are in place, and modify the whole command to ensure their outcome; whereas currently, if you just stick --config at the end you know regardless of if there are configs that exist in the directories being scanned or if an explicit config is being used in the command, your config will get used because it's the last option specified.

I think in the case of --parse-as it's less likely to be an issue, but I'm also aware how often other people come up with perfectly valid ways of using tools that do get impacted by these sorts of things; really what I'm driving at here is I think the general behaviour of how all CLI flags are applied should be as consistent as possible to reduce the cognitive load when using the tool (and also sadly in my experience people don't read the documentation enough to rely on that alone)

because of our SemVer adherance we can't change the behaviour.
Unfortunately as above we need to ensure backwards compatibility for all of our releases

Actually so long as the behaviour is not documented (which it currently isn't), you can make the argument that the change is a bugfix as it was always the desired behaviour - of course I agree care should be taken to minimize disruptions, but I personally think it would be pretty confusing to have some CLI flags behave so differently/inversely to others; it is of course your call though, so I can try and find some time to change this.

@oliverchang
Copy link
Collaborator

oliverchang commented Jan 5, 2023

That said, with --config we already provide a similar per-lockfile level mechanism to override things via osv-scanner.toml files living in the same directory.

I assume you mean that provided --config isn't provided; since if it is then osv-scanner.toml files won't be read (which is the correct behaviour imo)

Yep!

I personally think it would be pretty confusing to have some CLI flags behave so differently/inversely to others; it is of course your call though, so I can try and find some time to change this.

Thanks for the detailed rationale. I see your point on the inconsitency wrt flags, but I also see a global --parse-as as inconsistent when our CLI examples have cases like https://github.com/google/osv-scanner#example-2

$ osv-scanner --lockfile=/path/to/your/package-lock.json --lockfile=/path/to/another/Cargo.lock

Adding a --parse-as to this would not make very much sense.

Thinking more -- is requirements.txt the only edge case where the filename/contents are unpredictable (and the filename commonly ends in .txt)? If so, then perhaps we're overthinking all of this, and we can just do something like parse all *.txt files as requirements.txt without needing this additional flag.

@G-Rath
Copy link
Collaborator Author

G-Rath commented Jan 6, 2023

--parse-as would apply to all subsequent --lockfile arguments, and can be specified multiple times.

I've done some initial looking and it seems that urfave/cli doesn't support this as it does not expose the position each flag was passed in to the command.

then perhaps we're overthinking all of this, and we can just do something like parse all *.txt files as requirements.txt without needing this additional flag.

I don't mind saying that this solves a separate issue, but I think there is value in having the flag in either form (and even if its hidden) because it makes the tool more flexible - consider that right now you have to name your file based on the lockfile for a particular ecosystem but people could use exotic systems which could use alternate names (for better or worse), and we are already doing that here with the fixtures for lockfile.

@another-rex
Copy link
Collaborator

Another option similar to option 1 is to have parse-as argument where you specify the additional mapping in value rather than the flag itself, e.g.:

--parse-as=requirements_extra.txt:requirements.txt

And you can then specify multiple of these similar to how it can be done with other flags.

@oliverchang
Copy link
Collaborator

This seems like a reasonable compromise that allows the flexibility we want without being inconsistent with our CLI examples.

I'd probably invert the order though:

--parse-as=requirements.txt:/path/to/requirements_extra.txt

To make this more consistent/easier to parse in case there are ":" characters in the input paths.

@G-Rath
Copy link
Collaborator Author

G-Rath commented Jan 12, 2023

that actually also enables the original form of this i.e. --parse-as=requirements.txt could be applied to everything, because there is no :.

@oliverchang
Copy link
Collaborator

that actually also enables the original form of this i.e. --parse-as=requirements.txt could be applied to everything, because there is no :.

This could, but I think we should try to have a single way of doing something (i.e. the full --parse-as=requirements.txt:...), rather than multiple ways (i.e. also supporting the original --parse-as=requirements.txt --lockfile=... ), as that can end up being confusing in the future.

@G-Rath
Copy link
Collaborator Author

G-Rath commented Jan 13, 2023

Yeah I wasn't meaning supporting both of those ways (especially since the second way is not possible with the current CLI library being used), but that because --parse-as=x:y is effectively a way of passing a map, you could make --parse-as=x (without the : delimiter) set the default value when there isn't an item in the map (which right now is effectively "", which has lockfile infer the parser to use based on the filename)

@G-Rath G-Rath requested a review from oliverchang January 22, 2023 22:12
@G-Rath
Copy link
Collaborator Author

G-Rath commented Jan 22, 2023

@oliverchang @another-rex I've implemented this per our discussions - it's come out pretty nicely; a couple of thoughts:

  • I've tried to implement what I feel is the current core of the flag in a way that should hopefully allow for a bit of semver-flexibility for exploring future improvements like the ability to apply a mapping to a whole directory, setting a default parser, etc that I think could be useful but probably require too much additional consideration to be worth blocking this PR over
    • alternatively, this PR could do a "starts with" path check instead of an equality check which should immediately support at least those two features - the trade-off being what if someone had a file or folder named something like package-lock.json or Gemfile.lock2 etc
  • This should enable support for at least reading CSV files (rows might require more work) with only a few more lines of code - I've deliberately not done that in this PR because its self-standing (and you might want to go with a different format than what lockfile currently expected), but I can do a PR straight after this one lands
  • I considered if the installed parser should be included in the map within lockfile and then have it only used if the related lockfile functions are called explicitly with installed, which felt like it could be right. but I was concerned about edge-cases and just general ikyness in other parts of the codebase - I think it makes sense to go with what I've done for now, and reconsider the structure if/when similar special-case parsers are introduced

@another-rex
Copy link
Collaborator

Thanks! Probably for a follow up PR, but something we can add as well is instead of passing in the full path, you only have to pass in the file name, or even a glob to match against. This PR seems structured quite nicely to support those cases.

README.md Outdated
tell the scanner what parser to use for specific files using the `--parse-as` flag:

```bash
$ osv-scanner --parse-as 'requirements.txt:/path/to/your/extra-requirements.txt' --lockfile=/path/to/your/extra-requirements.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead make --parse-as just do the parsing itself as well? It seems a little clunky to have to specify the path twice in these cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also rather indirect -- it sets some global state that then needs to be utilised later by specifying the same path again. Not sure if there are benefits to this, and it seems much simpler/easier to use to just streamline this.

Copy link
Collaborator Author

@G-Rath G-Rath Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too fussed, but note what you're describing sounds based solely on the use of -L, when you can pass directories and files as args too.

While still less useful in this PR, as @another-rex pointed out this could be extended to support globs which make this make more sense, i.e. osv-scanner --parse-as requirements.txt:requirements-*.txt my/project.

I don't think there's a major loss so like I said I'm not too fussed as I agree it'd be nice to reduce the verbosity, but it might mean there's some edge-cases in future going down this path (I can play around with it tomorrow of course).

A middleground could be to assume -L unless ones actually present

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the general globbing/directory case, I think that's something that would be better suited as a config file option, rather than trying to support this complex mapping as a CLI option.

I think we should make --parse-as less verbose and just have it do the scanning as well.

@another-rex thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I see with making parse-as do scanning as well is that it won't be able to support directory scanning at all, which we currently advertise as the easiest way to use osv-scanner.

osv-scanner --parse-as requirements.txt:alt_requirement.txt -r my/project

will turn into

osv-scanner --parse-as requirements.txt:my/project/alt_requirement.txt  --parse-as requirements.txt:my/project/inner/alt_requirement.txt  ...etc.

which doesn't look that good.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was one of the original proposals in #94 (comment) :)

My concern there was that it was hard to parse perfectly in all cases. e.g. what if we had ':' characters in the given pathnames? Or if we had a filename called "requirements.txt:foo" ? Unlikely, but it does feel like a gap.

Copy link
Collaborator Author

@G-Rath G-Rath Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I completely missed that somehow.

I think it would be fine, and I provided a way to handle edgecases: if we say that the split is always on the first : and that an empty "parse-as" means "use default/infer", then that should prevent any edgecases - even if your file started with an :, you'd just provide -L=::my-file (and repeat for as many :s your filename starts with)

If that doesn't make sense, but you like the idea I can switch and then you can see it in actual action via the tests

Copy link
Collaborator

@oliverchang oliverchang Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm still not sure that handles the edge case where the full filename is "requirements.txt:foo".

In this case, perhaps we can expand this into multiple files if needed. i.e.

  • if "requirements.txt:foo" exists, scan that through the default/infer rules.
  • if "foo" also exists, scan "foo" as "requirements.txt" as well.
  • or both, if both cases are satisfied.

Copy link
Collaborator Author

@G-Rath G-Rath Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should because you'd be providing -L=:requirements.txt:foo

(and to be clear, that's only required in these cases - otherwise the starting : can be omitted)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I think that would be a slight behaviour change, because previously scanning "requirements.txt:foo" would scan that file directly.

That said, it's enough of an edge case that I don't think anyone would actually be affected by this, so what you suggested seems good to me.

@oliverchang
Copy link
Collaborator

(I realise I may not have communicated very clearly about what I had thought --parse-as with the proposed format would do -- sorry about that!)

Copy link
Collaborator

@another-rex another-rex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! Looks good to me, just minor nit about import order being moved.
@oliverchang is out today but let's get this merged in soon on Monday.

cmd/osv-scanner/main.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@oliverchang oliverchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!! just some comments.

README.md Outdated Show resolved Hide resolved
pkg/lockfile/parse.go Outdated Show resolved Hide resolved
@another-rex
Copy link
Collaborator

Also resolves #124

@oliverchang
Copy link
Collaborator

@G-Rath I think we just have two minor comments here to resolve, and we're good to go?

@G-Rath
Copy link
Collaborator Author

G-Rath commented Feb 2, 2023

@oliverchang yup I've been a bit slammed between personal and work stuff, so not had a chance to do anything big but tomorrow's 20% time so should be able to pick it up then

@G-Rath
Copy link
Collaborator Author

G-Rath commented Feb 2, 2023

note the documentation is a little ham-y right now, but I think that'll be easier to improve once #168 is landed because then there'll be two cases like this.

@G-Rath G-Rath changed the title feat: support --parse-as flag feat: support specifying what parser to use in --lockfile Feb 2, 2023
@oliverchang
Copy link
Collaborator

thanks a ton as always @G-Rath !

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@another-rex another-rex merged commit 4b72f08 into google:main Feb 6, 2023
@G-Rath G-Rath deleted the support-parse-as branch February 6, 2023 18:02
cmaritan pushed a commit to cmaritan/osv-scanner that referenced this pull request Feb 12, 2023
@agmond
Copy link

agmond commented Feb 20, 2023

Hi,
When is this feature expected to be released (i.e. included in ghcr.io/google/osv-scanner:latest)?

@another-rex
Copy link
Collaborator

@agmond A new version is planned to be released this week, we will also publish a releasing schedule soon.

hayleycd pushed a commit that referenced this pull request Mar 9, 2023
julieqiu pushed a commit to julieqiu/osv-scanner that referenced this pull request May 2, 2023
julieqiu pushed a commit to julieqiu/osv-scanner that referenced this pull request May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
5 participants