New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MBUILDCACHE-64] Exclusion mechanism bugfix #91
[MBUILDCACHE-64] Exclusion mechanism bugfix #91
Conversation
…nctionnality, from global and project configuration
6984d4e
to
70b1992
Compare
Miss-clicked on the sync button on my repo for this branch (I only wanted to sync master ...). |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit skeptical about the way which has been chosen to filter files. I think it would be more intuitive to use a glob PathMatcher
rather than a startsWith()
call on the path.
This seems inconsistent with what is usually done in the maven land, for example https://maven.apache.org/plugins/maven-jar-plugin/examples/include-exclude.html
Sure, I can do this. It will also bring a bit more flexibility. I suggest that I stay focused on the exclude section and I don't touch the input section. PS : Since I'm new there, hello everyone and great job! This project is a really good idea. I'm dying to be able to use it on my projects. :) |
/** | ||
* default glob, bbsdk/abfx specific | ||
*/ | ||
public static final String DEFAULT_GLOB = "{*.java,*.groovy,*.yaml,*.svcd,*.proto,*assembly.xml,assembly" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was not used except in a UT.
@Test | ||
public void testGetDirectoryFiles() { | ||
List<Path> directoryFiles = new ArrayList<>(); | ||
public void testCollectFilteredFiles() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned dead code and updated a test with a wrong assumption. Not the most useful test but better than before.
inputExcludeDirectoryPathMatchers.add(new TreeWalkerPathMatcher(pathMatcherGlob, false)); | ||
} | ||
|
||
// If the pattern ends with "/**", the directory exclude list gets an extra version without this suffix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important for a folder containing a lot of files, as "node_modules" for example.
Changed the exclude mechanism to use path matchers. A bit more than a bugfix now, feedbacks appreciated if you see any lack, implementation detail you disagree with or extra test case to add. |
@kbuntrock nicely done, it would be useful to update the sample config as well https://github.com/apache/maven-build-cache-extension/blob/master/src/site/resources/maven-build-cache-config.xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using paths when working with directories feels more natural compared to analyzing globs. Separate abstractions for normalized paths and(or) globs might make the code more robust.
/** | ||
* Wrapped regular path matcher | ||
*/ | ||
private final PathMatcher pathMatcher; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it lead to OS-specific errors? For example, if we replace the slashes, could that break the path matching on Windows? Ideally, we need to encapsulate Path
s and work on the normalized paths only to ensure portability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not, the underlying conversion from glob to regex pattern (sun.nio.fs.Globs#toRegexPattern) used by different FS pathmatcher implementations (ex : sun.nio.fs.WindowsFileSystem#getPathMatcher) seems to handle FS specific syntax conversion since java 8. (https://github.com/adoptium/jdk8u)
But PathMatching is not something I've dealt with a lot. Do you think we should take a belt and braces approach ? (french expression, but it looks like existing in English too ^^)
PS : On a side note, I'm always testing on Windows (but not all jvm ...) since I primarily work on Windows those days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be room for error here. For example, the same file with back/forward slashes is not matched on my workstation:
String linux = "dir/file.txt";
String win = "dir\\file.txt";
Path linuxPath = Paths.get(linux);
Path winPath = Paths.get(win);
System.out.println("linuxPath: " + linuxPath);
System.out.println("winPath: " + winPath);
PathMatcher linuxMatcher = FileSystems.getDefault().getPathMatcher("glob:" + linuxPath);
System.out.println(linuxMatcher.matches(winPath));
System.out.println(linuxMatcher.matches(linuxPath));
------OUT----
linuxPath: dir/file.txt
winPath: dir\file.txt
false
true
I have no objections against the matcher as such. But I wouldn't rely on it without a test demonstrating the same results on Windows and Mac in the presence of exclusions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could use some intel about the current IT tests on different platforms and how to launch them / develop some for not yet supported platforms.
And I get that it could be useful to normalize it before going through the implementation of "getPathMatcher" in order to reduce bug investigation in case of an exotic "getPathMatcher" implementation, I will think about adding this extra step.
/** | ||
* Glob suffix meaning "all files in this directory or any sub-directory" | ||
*/ | ||
private static final String GLOB_SX_ALL_SUB_FILES = "/**"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect /*
for all sub-files. GLOB_SX_FULL_SUB_TREE
could be a more accurate naming for a full depth subtree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, fixed
@@ -165,45 +184,94 @@ public MavenProjectInput( | |||
this.repoSystem = repoSystem; | |||
this.remoteCache = remoteCache; | |||
Properties properties = project.getProperties(); | |||
this.dirGlob = properties.getProperty(CACHE_INPUT_GLOB_NAME, config.getDefaultGlob()); | |||
this.defaultFilenameGlob = properties.getProperty(CACHE_INPUT_GLOB_NAME, config.getDefaultGlob()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The glob is not limited to filenames, and the exact format is not guaranteed, etc. There is no base to incorporate the content assumption in the variable name. Better naming would be projectGlob
, which properly reflects the intention - a project-specific glob, overriding default if specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, changed by the suggested name.
this.processPlugins = | ||
Boolean.parseBoolean(properties.getProperty(CACHE_PROCESS_PLUGINS, config.isProcessPlugins())); | ||
this.tmpDir = System.getProperty("java.io.tmpdir"); | ||
|
||
this.baseDirectoryGlob = baseDirPath.toString().replace("\\", "/") + "/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, this should probably be normalized to a <path>/**
format, or the variable should be renamed *Path
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant is used as a prefix and concatenated with other values. Renamed it to baseDirectoryGlobPrefix
to improve the readability.
} | ||
|
||
@Override | ||
public boolean matches(Path path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need a wrapper type here to ensure consistency in the path matching. The second, less preferable option is to normalize paths within the matcher. But in the latter case, the normalized paths cannot be reused in the other parts of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean in case we decide to do some extra work about OS specific filesystems? Or anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paths could be expressed differently, and before matching, it should be normalized - placeholders resolved, etc.
PathMatcher pathMatcher = FileSystems.getDefault().getPathMatcher("glob:./file.txt");
System.out.println("./file.txt -> "+ Paths.get("./file.txt").normalize().toAbsolutePath());
System.out.println("./dir/../file.txt -> "+ Paths.get("dir/../file.txt").normalize().toAbsolutePath());
System.out.println(pathMatcher.matches(Paths.get("dir/../file.txt")));
----Out---
./file.txt -> <dirpath>/sandbox/file.txt
./dir/../file.txt -> <dirpath>/sandbox/file.txt
false
The matcher might be required to normalize paths to match reliably. Just taking any path as an input might not work. It's better to have clear semantics here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it. As I do not want to go too far with this MR, I suggest to do like the WalkKey class for the include section -> naming the path in the "matches" function : "normalized". To indicate to a developer that a provided path should be normalized.
Added to the list of topics to be discussed on Thursday.
inputExcludePathMatchers.add(new TreeWalkerPathMatcher(pathMatcherGlob, false)); | ||
|
||
// Globs ending with "**" should end any sub-directory inspection | ||
if (pathMatcherGlob.endsWith("**")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very thin ice. Glob syntax is rich. For example, it could be [*][*]
or /my/fav/dir/{**}
or /my/*/dir/**
. It is a fragile approach to interpreting the globs based on the content. excludedValue
could be anything, and the logic might not be applied properly. That's why the original logic operated with normalized paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree, even if the flexibility offered is pretty nice, I wasn't so pleased with the complexity of this demand.
As a trade-off, my strategy was :
- if the syntax is not written in an exotic manner : exclusion is fast as it can stop tree walking
- if not : exclusion is slower but works as expected
@@ -526,6 +603,10 @@ public FileVisitResult visitFileFailed(Path path, IOException exc) throws IOExce | |||
}); | |||
} | |||
|
|||
private boolean entryMustBeSkipped(Path path) { | |||
return inputExcludePathMatchers.stream().anyMatch(pathMatcher -> pathMatcher.matches(path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat picking: use method reference instead of the lambda - PathMatcher::matches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't, "matches" takes a parameter (or you refer to a syntax I don't know).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, iteration is over the matchers, not paths 🤦
} | ||
} | ||
return false; | ||
return inputExcludeDirectoryPathMatchers.stream().anyMatch(matcher -> matcher.stopTreeWalking(path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for skipping dirs should be paths based. Globs should be used as globs without any content assumptions. To get the benefits of both approaches, creating a wrapper class for the globs might be better. And if the glob is path-based (eg derived by adding /**
, some optimizations could be applied). Which will keep consistent values of the original literal, derived path, and the derived glob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following. Could you give me some code snippets to illustrate your reasoning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there are different semantics now and before for project-level properties.
Before: declaring <exclude.me>mydir/mysubdir</exclude.me>
exclusion was filtering out the whole subtree (through the Path#startsWith
)
Now: declaring mydir/mysubdir
doesn't exclude anything.
FileSystems.getDefault().getPathMatcher("glob:dir");
System.out.println(pathMatcher.matches(Paths.get("dir/file.txt")));`
-----OUT-----
false
This is what I meant - properties might assume the exclusion of directories. For consistency, we must update docs, keep the path semantics, or migrate properties to globs/regex semantics, delegating filtering expressions to the users. But the main point is that it should be consistent with inclusion semantics (and maven in general), transparent to an end user, and deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT : My response was wrong. Removing its content to improve readability.
I agree with the point, but I did not want to go too far with this MR. :P
normalizedPath(build.getDirectory()), // target by default | ||
normalizedPath(build.getOutputDirectory()), | ||
normalizedPath(build.getTestOutputDirectory()))); | ||
addToExcludedSection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is room for a small optimization here - not adding sub-dirs of the target
will reduce the length of the filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Added it. I did a naive implementation to keep the code simple.
@@ -0,0 +1,107 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing this test. It is long past due to implement 👍
Done it. Wasn't comfortable with this file since I found it confusing when I discovered this extension + it was still valid with my changes. Added some comments on some related other parameters around. Is it ok for you @maximilian-novikov-db ? |
I'm starting to wonder if the use of the JDK |
Well, good news, we are already using @ALL : Seems this PR draw attention to details about this functionality but I would suggest to not go too far with it. |
* Add a value from the excluded section list to the directories and/or the filenames ban list. | ||
* @param excludedValue a value from the exclude list | ||
*/ | ||
private void addToExcludedSection(String excludedValue, boolean addProjectBaseDir) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need the addProjectBaseDir
parameter? Intuitively, the project dir should always be prepended. Could that lead to an exclude
applied to an unexpected subtree? Like dir1/file.txt
applied to <root>/dir2/dir1/file.txt
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's because of the default exclude sections. Their values have already the project dir prepended. But I'm open to suggestions.
(addProjectBaseDir ? baseDirectoryGlobPrefix : "") | ||
+ | ||
// If the glob start with "/", we remove it since it's already added in the added basedir glob | ||
(excludedValue.startsWith("/") ? excludedValue.substring(1) : excludedValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stripping the leading /
from user-supplied parameters might be confusing. a common convention is that /
denotes an absolute path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. What about forbidding a leading "/" ? I don't see any use of an absolute glob exclusion.
<!-- Exclude directory in an extra included folder --> | ||
<maven.build.cache.exclude.subfolder>**/excluded_subfolder/**</maven.build.cache.exclude.subfolder> | ||
<!-- Exclude by filepath + filename --> | ||
<maven.build.cache.exclude.filepath>folder_outside_src/this_one_should_NOT_be_scanned.txt</maven.build.cache.exclude.filepath> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be good to cover ambiguous cases here:
- just
<maven.build.cache.exclude.subfolder>excluded_subfolder</maven.build.cache.exclude.subfolder>
- previously, I expected to have the whole dir subtree excluded consistently withsecond_folder_outside_src
. Now - it excludes only a file with this name. - just
<maven.build.cache.exclude.subfolder>excluded_subfolder/**</maven.build.cache.exclude.subfolder>
- it should exclude<dirRoot>/excluded_subfolder/**
, but not<dirRoot>/subfolder/excluded_subfolder/**
<maven.build.cache.exclude.subfolder>/tmp/**</maven.build.cache.exclude.subfolder>
- I believe it should exclude the absolute path, e.g. all tmp files
<maven.build.cache.input.extraFile>extraFile.txt</maven.build.cache.input.extraFile> | ||
<maven.build.cache.input.extraFolder>second_folder_outside_src</maven.build.cache.input.extraFolder> | ||
<!-- Exclude directory in an extra included folder --> | ||
<maven.build.cache.exclude.subfolder>**/excluded_subfolder/**</maven.build.cache.exclude.subfolder> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please notice that the documentation describes the parameters as follows:
Paths to exclude from source code search. Relative paths calculated from the current project/module root
Please notice the documentation doesn't say this is a pattern, a glob, or an ant expression. We need to update the documentation to reflect the support for globs and how they are applied. The key issue is whether the base dir is prepended, with or without leading slashes. To make it deterministic, we might follow this approach:
- If the pattern starts with
/
, it is considered an absolute path expression - If not, any path expression is prepended by the absolute project dir path
- The content of the properties will be interpreted as a glob expression.
About this PR, I'm still working on it. Even if I couldn't find the time the last past weeks. |
67faf5a
to
0801c66
Compare
0801c66
to
fb8e211
Compare
fb8e211
to
eeaa49f
Compare
Re-opening the PR for merge. It's now a mix between the very first implementation and the one with globs, and it's combining better with the input section mechanism. I feel much more at ease with this version. We can use only path declaration, only glob declaration, or combine both if we want the best trade-off between performances and functionalities. Here a small list of improvements:
Here is the documentation presenting it : And the updated "project property" page. Looking forward to your feedbacks. :) Especially in the project property section, I'm undecided about the naming of the exclusion value : "maven.build.cache.exclude.value.my-exclusion" versus "maven.build.cache.exclude.my-exclusion". (currently the first option, simpler syntax / readability with the additional properties like "maven.build.cache.exclude.glob.my-exclusion") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description of this pull request
Fix a bug regarding the exclusion mechanism not working properly on folders.
It also adds an integration test on this functionality (global + per project) + some documentation on these aspects.
Here a screenshot of the new "exclude" element in the documentation to understand how this fix intend to work.
I have not implemented the following sentence in the Jira issue since I don't think we need to overthink the feature without some use case examples :
Nevertheless, I think this MR is already a huge improvement on the exclusion topic since it's an important feature and it was not possible to use it on folders before (actually inconvenient to use the extension on a project with a "node_modules" folder for example).
MR checklist
Following this checklist to help us incorporate your
contribution quickly and easily:
for the change (usually before you start working on it). Trivial changes like typos do not
require a JIRA issue. Your pull request should address just this issue, without
pulling in other changes.
[MBUILDCACHE-XXX] - Fixes bug in ApproximateQuantiles
,where you replace
MBUILDCACHE-XXX
with the appropriate JIRA issue. Best practiceis to use the JIRA issue title in the pull request title and in the first line of the
commit message.
mvn clean verify
to make sure basic checks pass. A more thorough check willbe performed on your pull request automatically.
If your pull request is about ~20 lines of code you don't need to sign an
Individual Contributor License Agreement if you are unsure
please ask on the developers list.
To make clear that you license your contribution under
the Apache License Version 2.0, January 2004
you have to acknowledge this by using the following check-box.
I hereby declare this contribution to be licenced under the Apache License Version 2.0, January 2004
In any other case, please file an Apache Individual Contributor License Agreement.