Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix --skip matching #2058

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

DimitriPapadopoulos
Copy link
Collaborator

@DimitriPapadopoulos DimitriPapadopoulos commented Sep 10, 2021

  • Remove trailing system path separator from pattern.
  • Match relative file and directory paths against patterns. By relative, I mean relative to the top-level directories, the directories passed to codespell as file arguments.

Fixes #1915 and #2047.

@DimitriPapadopoulos DimitriPapadopoulos changed the title WIP: Fix --skip matching Fix --skip matching Sep 10, 2021
@DimitriPapadopoulos
Copy link
Collaborator Author

I believe one of the tests cannot pass until the patch is merged.

@peternewman
Copy link
Collaborator

I believe one of the tests cannot pass until the patch is merged.

Are you referring to the failures here:
https://github.com/codespell-project/codespell/pull/2058/checks?check_run_id=3571876883

Yes I'd agree, it's running against master, but you've obviously updated the config to work post merge.

@DimitriPapadopoulos
Copy link
Collaborator Author

Yes, that's the test.

This patch is obviously breaking backwards compatibility, as demonstrated by the test. Would it be worth keeping both behaviours for some time, and deprecate the old behaviour in a future release?

Copy link
Collaborator

@peternewman peternewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @DimitriPapadopoulos .

Please can we have some tests to go with it, both to confirm the before behaviour, prove you've fixed the bug (and ensure we don't break it again in future) and maybe to validate the behaviour of some of my more theoretical issue comments? I don't think they're necessarily deal breakers but it would be good to at least check how we currently behave.

Given a file foo in:
bar/foo

And a directory:
doc/foo/baz
or
foo/baz

Can I skip one without it also skipping the other?

# Pattern might be a list of comma-delimited strings
self.pattern_list = ','.join(pattern).split(',')
# Pattern might be a list of comma-separated patterns,
# we remove any trailing path separator from each pattern
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a brief reason for why? Is it just consistency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 reasons::

  1. The trailing separator can be copied/pasted from the command line, at least in some Linux setups. It did happen to me, perhaps using <TAB> for completion under Bash:
    $ ls foo/bar/
  2. The trailing separator appears in issues Behavior of --skip with folder names #1915 and --skip doesn't work #2047, adding to the confusion when trying to understand what's going wrong.

It's clear to me that removing the trailing space does what end-users would expect. At least when dealing with directories. The only caveat could be -S foo/bar/, some end-users could interpret it as match foo/bar only if it is a directory. But this is far-fetched. The current behaviour is worse, -S doc/foo/baz/ simply cannot match anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could remove this specific commit for now, and then attempt to match directories only without the separator. That's more complicated, though, and better suited to a new PR later on. But then why would a user prepend a separator to a file name? That's really an obscure corner case that shouldn't ever happen. Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend accepting this commit for now. Then I can work on limiting the possibility of removing a trailing separator for directories only in an upcoming pull request.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry, I wasn't necessarily meaning you have to change it, just that the comment doesn't explain why for someone revisiting it. You've currently written "We paint the room red", I'm suggesting that "We paint the room red, to annoy the bull" will be more useful in the future. 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I have added something like "annoy the bull".

@@ -95,8 +95,10 @@ class QuietLevels(object):
class GlobMatch(object):
def __init__(self, pattern):
if pattern:
# Pattern might be a list of comma-delimited strings
self.pattern_list = ','.join(pattern).split(',')
# Pattern might be a list of comma-separated patterns,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what would happen here if you had a file you wanted to skip called foo,bar but that's probably one for another issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree about the use case of a file name including a comma. However, this pull request doesn't alter the behaviour of the -S foo,bar option, does it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sorry I'm basically reviewing the code when I'm reviewing your changes, given it's unlikely a TODO in here saying this will break with files with a comma in is sufficient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the pattern separator is a comma, I think it's obvious we cannot handle patterns containing a comma. Perhaps we should make that clear in the documentation:
Because the comma is a pattern separator, patterns cannot include commas. In the unlikely event you need to match a path containing a comma, please consider working around this limitation using globs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could introduce an escape mechanism, perhaps \,. But I'd wait for actual users to ask for it...

Comment on lines 877 to 879
for root, dirs, files in os.walk(filename):
if glob_match.match(root): # skip (absolute) directories
rel_root = os.path.relpath(root, filename)
if glob_match.match(rel_root): # skip relative directory path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right, given the file structure:
foo/bar/foo/baz

I think os.walk will descend and therefore test foo/, foo/bar/ and foo/bar/foo (if I'm reading it right). I think this means your relative check could match both foos, although I'm not sure this is necessarily an issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which exact pattern do you have in mind?

  • -S foo matches used to match and still matches foo/ and foo/bar/foo/ - or so I think and will double-check.
  • -S foo/bar/foo will match only foo/bar/foo/.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't be fooled by the absolute → relative change in the comment. The comments were using absolute and relative incorrectly, I have just fixed the comments, not necessarily changed the behaviour.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've added the relpath function too, does this really have no impact?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relpath() function helps match the relative path foo/bar in addition to the "absolute" paths ./foo/bar or /other/path/to/top/level/foo/bar. I don't really see an issue here, but I may be missing something of course.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope the tests I have added address your concerns. If not, happy to modify them or add more tests.

@peternewman
Copy link
Collaborator

This patch is obviously breaking backwards compatibility, as demonstrated by the test. Would it be worth keeping both behaviours for some time, and deprecate the old behaviour in a future release?

Yes! I hadn't really thought about it, but this will break all the places where I've configured skip in exactly the same way...

As per my initial issue ( #1915 (comment) ), is there some reason we can't just support all variants so people don't have to remember a particular version?

@DimitriPapadopoulos
Copy link
Collaborator Author

I see the previous variant as broken. In the case of a foo/bar/ subdirectory in the top-level directory to check, I believe (but will have to double-check) that:

  • skip $PWD -S ./foo/bar didn't work,
  • skip relative/path/to/directory -S relative/path/to/directory/foo/bar didn't work either.

That said, we could support all variants. It's just that the old variant seems totally broken to me, I spent half a day trying to make sense out of it.

@DimitriPapadopoulos
Copy link
Collaborator Author

I totally agree about adding tests -also to double-check my above assertions!

@DimitriPapadopoulos
Copy link
Collaborator Author

DimitriPapadopoulos commented Sep 13, 2021

The test directory looks like this:

$ mkdir codespell_test
$ 
$ mkdir codespell_test/foo
$ echo "errror" > codespell_test/foo/bar
$ 
$ mkdir -p codespell_test/doc/foo/baz
$ echo "errror" > codespell_test/doc/foo/baz/file
$ 
$ tree codespell_test
codespell_test
├── doc
│   └── foo
│       └── baz
│           └── file
└── foo
    └── bar

4 directories, 2 files
$ 

Here are some tests with the old behaviour, and as you can see, -S foo already used to skip the contents of both foo/ and doc/foo/:

$ codespell -S foo/bar,doc/foo/baz  # FAIL
./doc/foo/baz/file:1: errror  ==> error
./foo/bar:1: errror  ==> error
$ 
$ codespell -S ./foo/bar,./doc/foo/baz  # PASS
$ 
$ codespell -S ./foo/bar/,./doc/foo/baz/  # FAIL
./doc/foo/baz/file:1: errror  ==> error
./foo/bar:1: errror  ==> error
$ 
$ codespell -S foo  # PASS
$ 
$ codespell $PWD -S foo  # PASS
$ 
$ codespell "$PWD" -S "$PWD/foo/bar,$PWD/doc/foo/baz"  # PASS
$ 
$ codespell "$PWD" -S foo/bar,doc/foo/baz  # FAIL
/tmp/codespell_test/doc/foo/baz/file:1: errror  ==> error
/tmp/codespell_test/foo/bar:1: errror  ==> error
$ 
$ codespell -S "$PWD/foo/bar,$PWD/doc/foo/baz"  # FAIL
./doc/foo/baz/file:1: errror  ==> error
./foo/bar:1: errror  ==> error
$ 
$ codespell "$PWD" -S ./foo/bar,./doc/foo/baz  # FAIL
./doc/foo/baz/file:1: errror  ==> error
./foo/bar:1: errror  ==> error
$ 

@DimitriPapadopoulos
Copy link
Collaborator Author

DimitriPapadopoulos commented Sep 13, 2021

Here are some tests with the new behaviour:

$ codespell -S foo/bar,doc/foo/baz  # PASS
$ 
$ codespell -S ./foo/bar,./doc/foo/baz  # FAIL (but I plan on restoring the old variant!)
./doc/foo/baz/file:1: errror ==> error
./foo/bar:1: errror ==> error
$ 
$ codespell -S foo/bar/,doc/foo/baz/  # PASS (I agree that may feel strange because foo/bar is a file, not a directory)
$ 
$ codespell -S foo  # PASS
$ 
$ codespell $PWD -S foo  # PASS
$ 
$ codespell "$PWD" -S "$PWD/foo/bar,$PWD/doc/foo/baz"  # FAIl (but I plan on restoring the old variant!)
/tmp/codespell_test/doc/foo/baz/file:1: errror ==> error
/tmp/codespell_test/foo/bar:1: errror ==> error
$ 
$ codespell -S "$PWD/foo/bar,$PWD/doc/foo/baz"   # FAIL (as in the old variant)
./doc/foo/baz/file:1: errror ==> error
./foo/bar:1: errror ==> error
$ 
$ codespell "$PWD" -S ./foo/bar,./doc/foo/baz  # FAIL (as in the old variant)
/tmp/codespell_test/doc/foo/baz/file:1: errror ==> error
/tmp/codespell_test/foo/bar:1: errror ==> error
$ 
$ codespell "$PWD" -S foo/bar,doc/foo/baz  # PASS
$ 

@DimitriPapadopoulos
Copy link
Collaborator Author

DimitriPapadopoulos commented Sep 13, 2021

And now after the last commit, combining both behaviours:

$ codespell -S foo/bar,doc/foo/baz  # PASS
$ 
$ codespell -S ./foo/bar,./doc/foo/baz  # PASS
$ 
$ codespell -S foo/bar/,doc/foo/baz/  # PASS (I agree that may feel strange because foo/bar is a file, not a directory)
$ 
$ codespell -S foo  # PASS
$ 
$ codespell $PWD -S foo  # PASS
$ 
$ codespell "$PWD" -S "$PWD/foo/bar,$PWD/doc/foo/baz"  # PASS
/tmp/codespell_test/doc/foo/baz/file:1: errror ==> error
/tmp/codespell_test/foo/bar:1: errror ==> error
$ 
$ codespell -S "$PWD/foo/bar,$PWD/doc/foo/baz"   # FAIL (as in the old variant)
./doc/foo/baz/file:1: errror ==> error
./foo/bar:1: errror ==> error
$ 
$ codespell "$PWD" -S ./foo/bar,./doc/foo/baz  # FAIL (as in the old variant)
/tmp/codespell_test/doc/foo/baz/file:1: errror ==> error
/tmp/codespell_test/foo/bar:1: errror ==> error
$ 
$ codespell "$PWD" -S foo/bar,doc/foo/baz  # PASS
$ 

@DimitriPapadopoulos
Copy link
Collaborator Author

I would still recommend deprecating the old behaviour in the future, as it is counterintuitive and confusing. Anyway, for now we have both.

@peternewman
Copy link
Collaborator

That said, we could support all variants. It's just that the old variant seems totally broken to me, I spent half a day trying to make sense out of it.

I'm not disputing that, but equally I don't want to have to do a big bang fix on the many repos I've pushed spellchecking on which match that pattern.

I would still recommend deprecating the old behaviour in the future, as it is counterintuitive and confusing. Anyway, for now we have both.

Perfect thanks.

I guess we should introduce a message at some point (now?) telling people they should migrate their config?

@peternewman
Copy link
Collaborator

And now after the last commit, combining both behaviours:

Great thanks.

Did you also test with a file named foo?

Sorry, I meant a test for this behaviour:
https://github.com/codespell-project/codespell/tree/master/codespell_lib/tests

Not simply testing it works with the code. I.e. something that will automatically run and confirm if it changes in future. See what we've already got here (and possibly other tests):
https://github.com/codespell-project/codespell/blob/master/codespell_lib/tests/test_basic.py#L289-L313

@DimitriPapadopoulos
Copy link
Collaborator Author

DimitriPapadopoulos commented Sep 14, 2021

I think the message to end-users can wait. Let users get acquainted with the new behaviour first. However, I should probably guard the old behaviour with something like this:

#ifndef CODESPELL_DISCARD_DEPRECATED

This way future maintainers are informed the code should be deprecated, then removed at some point. How does this sound? Can you recommend a better alternative to CODESPELL_DISCARD_DEPRECATED?

EDIT: This is Python, not C 😄 so instead see #2058 (comment)

@DimitriPapadopoulos
Copy link
Collaborator Author

I'm not sure what you want me to test. A directory and a file named foo? Then -S foo will skip both, but that was already the case. Anyway, I'll look into the tests and try to add more tests.

@DimitriPapadopoulos
Copy link
Collaborator Author

I have add a phony option "deprecated" set to True, after parsing arguments and the configuration file. When we want to disable the deprecated patterns, it can be changed into a real option set to False by default.

@DimitriPapadopoulos
Copy link
Collaborator Author

I have also changed the documentation in the README.

@DimitriPapadopoulos
Copy link
Collaborator Author

I have restored these files to their previous state to avoid CI errors:

.github/workflows/codespell-private.yml
.github/workflows/codespell.yml

I will modify them in a subsequent PR if this one is merged.

@DimitriPapadopoulos
Copy link
Collaborator Author

@peternewman Lots of new tests added in test_ignore(). Do they look like what you had in mind?

@DimitriPapadopoulos
Copy link
Collaborator Author

@peternewman Do you need me to fix anything else? The CI error will vanish when #2069 is merged.

@DimitriPapadopoulos
Copy link
Collaborator Author

DimitriPapadopoulos commented Jan 4, 2022

@peternewman I think I have added enough tests. All of them pass after renaming ignoredir to topdir to avoid name clashes between the name chosen by pytest for its temporary directory and our own subdirectories.

The only remaining check failure appears to be with flake8, but that's a flake8 “feature” that I cannot fix in the source code. Please merge #2069 which fixes flake8 options.

@larsoner Could you have a look at this PR?

Edit: I have added even more tests with corner cases and found a discrepancy in the current handling of the old syntax (--skip ./). My suggestion would be to:

  1. add these tests to the current master branch, in a separate PR,
  2. rebase this PR on the above PR and make sure all corner cases are properly handled.

I have added additional tests in #2227. They show the current limitations of --skip I am trying to fix here. I am confident we will avoid regressions if the fixes pass the test that are not marked FIXME.

Shell completion often adds a separator '/' after directory names.
Because os.walk() return paths without a trailing separator, we have
to remove trailing separators from patterns to ensure a proper match.

Granted, we could attempt to remove the trailing separator '/' only
from "directory patterns", not from "file patterns", but patterns are
currently just patterns - we do not distinguish between "file patterns"
and "directory patterns". A future improvment could be to interpret
a pattern with a trailing "/" as "directory pattern", but then how
would we define a "file pattern"? It's probably better to find a
different mechanism and keep discarding the trailing separator from
patterns.
The --skip arguments are interpreted as paths relative to the
directories to check (the "file" arguments passed to codespell),
in addition to being interpreted "as is".

The benefit is that "codespell ." and "codespell $PWD" behave
the same, you just pass relative paths to --skip. No more adding
"./" to directory patterns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Behavior of --skip with folder names
2 participants