Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MBUILDCACHE-64] Exclusion mechanism bugfix #91

Merged
merged 7 commits into from Oct 12, 2023

Conversation

kbuntrock
Copy link
Contributor

@kbuntrock kbuntrock commented Aug 1, 2023

Description of this pull request

Fix a bug regarding the exclusion mechanism not working properly on folders.
It also adds an integration test on this functionality (global + per project) + some documentation on these aspects.

Here a screenshot of the new "exclude" element in the documentation to understand how this fix intend to work.
image

I have not implemented the following sentence in the Jira issue since I don't think we need to overthink the feature without some use case examples :

"That function could be enhanced with an additional criterion like in MavenProjectInput.java#L510"

Nevertheless, I think this MR is already a huge improvement on the exclusion topic since it's an important feature and it was not possible to use it on folders before (actually inconvenient to use the extension on a project with a "node_modules" folder for example).

MR checklist

Following this checklist to help us incorporate your
contribution quickly and easily:

  • Make sure there is a MBUILDCACHE JIRA issue filed
    for the change (usually before you start working on it). Trivial changes like typos do not
    require a JIRA issue. Your pull request should address just this issue, without
    pulling in other changes.
  • Each commit in the pull request should have a meaningful subject line and body.
  • Format the pull request title like [MBUILDCACHE-XXX] - Fixes bug in ApproximateQuantiles,
    where you replace MBUILDCACHE-XXX with the appropriate JIRA issue. Best practice
    is to use the JIRA issue title in the pull request title and in the first line of the
    commit message.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will
    be performed on your pull request automatically.
  • You have run the Core IT successfully.

If your pull request is about ~20 lines of code you don't need to sign an
Individual Contributor License Agreement if you are unsure
please ask on the developers list.

To make clear that you license your contribution under
the Apache License Version 2.0, January 2004
you have to acknowledge this by using the following check-box.

@kbuntrock kbuntrock changed the title [MBUILDCACHE-64] Exclusion mechanism bugfix + improvement [MBUILDCACHE-64] Exclusion mechanism bugfix Aug 2, 2023
@olamy olamy added the bug Something isn't working label Aug 3, 2023
@kbuntrock kbuntrock force-pushed the MBUILDCACHE-64-exclusions-improvement branch from 6984d4e to 70b1992 Compare August 3, 2023 20:49
@kbuntrock
Copy link
Contributor Author

Miss-clicked on the sync button on my repo for this branch (I only wanted to sync master ...).
So I finished to rebase it on master to keep the tree clean. But it reseted the workflow, sorry. :s

@olamy olamy requested a review from gnodet August 4, 2023 01:48
@olamy
Copy link
Member

olamy commented Aug 4, 2023

LGTM

Copy link
Contributor

@gnodet gnodet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit skeptical about the way which has been chosen to filter files. I think it would be more intuitive to use a glob PathMatcher rather than a startsWith() call on the path.
This seems inconsistent with what is usually done in the maven land, for example https://maven.apache.org/plugins/maven-jar-plugin/examples/include-exclude.html

@kbuntrock
Copy link
Contributor Author

I'm a bit skeptical about the way which has been chosen to filter files. I think it would be more intuitive to use a glob PathMatcher rather than a startsWith() call on the path. This seems inconsistent with what is usually done in the maven land, for example https://maven.apache.org/plugins/maven-jar-plugin/examples/include-exclude.html

Sure, I can do this. It will also bring a bit more flexibility. I suggest that I stay focused on the exclude section and I don't touch the input section.

PS : Since I'm new there, hello everyone and great job! This project is a really good idea. I'm dying to be able to use it on my projects. :)

/**
* default glob, bbsdk/abfx specific
*/
public static final String DEFAULT_GLOB = "{*.java,*.groovy,*.yaml,*.svcd,*.proto,*assembly.xml,assembly"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was not used except in a UT.

@Test
public void testGetDirectoryFiles() {
List<Path> directoryFiles = new ArrayList<>();
public void testCollectFilteredFiles() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned dead code and updated a test with a wrong assumption. Not the most useful test but better than before.

inputExcludeDirectoryPathMatchers.add(new TreeWalkerPathMatcher(pathMatcherGlob, false));
}

// If the pattern ends with "/**", the directory exclude list gets an extra version without this suffix.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important for a folder containing a lot of files, as "node_modules" for example.

@kbuntrock kbuntrock requested review from gnodet and olamy August 7, 2023 17:08
@kbuntrock
Copy link
Contributor Author

Changed the exclude mechanism to use path matchers. A bit more than a bugfix now, feedbacks appreciated if you see any lack, implementation detail you disagree with or extra test case to add.

@maximilian-novikov-db
Copy link
Contributor

Changed the exclude mechanism to use path matchers. A bit more than a bugfix now, feedbacks appreciated if you see any lack, implementation detail you disagree with or extra test case to add.

@kbuntrock nicely done, it would be useful to update the sample config as well https://github.com/apache/maven-build-cache-extension/blob/master/src/site/resources/maven-build-cache-config.xml

Copy link
Contributor

@AlexanderAshitkin AlexanderAshitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using paths when working with directories feels more natural compared to analyzing globs. Separate abstractions for normalized paths and(or) globs might make the code more robust.

/**
* Wrapped regular path matcher
*/
private final PathMatcher pathMatcher;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it lead to OS-specific errors? For example, if we replace the slashes, could that break the path matching on Windows? Ideally, we need to encapsulate Paths and work on the normalized paths only to ensure portability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not, the underlying conversion from glob to regex pattern (sun.nio.fs.Globs#toRegexPattern) used by different FS pathmatcher implementations (ex : sun.nio.fs.WindowsFileSystem#getPathMatcher) seems to handle FS specific syntax conversion since java 8. (https://github.com/adoptium/jdk8u)

But PathMatching is not something I've dealt with a lot. Do you think we should take a belt and braces approach ? (french expression, but it looks like existing in English too ^^)

PS : On a side note, I'm always testing on Windows (but not all jvm ...) since I primarily work on Windows those days.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be room for error here. For example, the same file with back/forward slashes is not matched on my workstation:

        String linux = "dir/file.txt";
        String win = "dir\\file.txt";
        Path linuxPath = Paths.get(linux);
        Path winPath = Paths.get(win);
        System.out.println("linuxPath: " + linuxPath);
        System.out.println("winPath: " + winPath);
        PathMatcher linuxMatcher = FileSystems.getDefault().getPathMatcher("glob:" + linuxPath);
        System.out.println(linuxMatcher.matches(winPath));
        System.out.println(linuxMatcher.matches(linuxPath));
        ------OUT----
        linuxPath: dir/file.txt
        winPath: dir\file.txt
        false
        true

I have no objections against the matcher as such. But I wouldn't rely on it without a test demonstrating the same results on Windows and Mac in the presence of exclusions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use some intel about the current IT tests on different platforms and how to launch them / develop some for not yet supported platforms.
And I get that it could be useful to normalize it before going through the implementation of "getPathMatcher" in order to reduce bug investigation in case of an exotic "getPathMatcher" implementation, I will think about adding this extra step.

/**
* Glob suffix meaning "all files in this directory or any sub-directory"
*/
private static final String GLOB_SX_ALL_SUB_FILES = "/**";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect /* for all sub-files. GLOB_SX_FULL_SUB_TREE could be a more accurate naming for a full depth subtree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, fixed

@@ -165,45 +184,94 @@ public MavenProjectInput(
this.repoSystem = repoSystem;
this.remoteCache = remoteCache;
Properties properties = project.getProperties();
this.dirGlob = properties.getProperty(CACHE_INPUT_GLOB_NAME, config.getDefaultGlob());
this.defaultFilenameGlob = properties.getProperty(CACHE_INPUT_GLOB_NAME, config.getDefaultGlob());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The glob is not limited to filenames, and the exact format is not guaranteed, etc. There is no base to incorporate the content assumption in the variable name. Better naming would be projectGlob, which properly reflects the intention - a project-specific glob, overriding default if specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, changed by the suggested name.

this.processPlugins =
Boolean.parseBoolean(properties.getProperty(CACHE_PROCESS_PLUGINS, config.isProcessPlugins()));
this.tmpDir = System.getProperty("java.io.tmpdir");

this.baseDirectoryGlob = baseDirPath.toString().replace("\\", "/") + "/";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, this should probably be normalized to a <path>/** format, or the variable should be renamed *Path.

Copy link
Contributor Author

@kbuntrock kbuntrock Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant is used as a prefix and concatenated with other values. Renamed it to baseDirectoryGlobPrefix to improve the readability.

}

@Override
public boolean matches(Path path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need a wrapper type here to ensure consistency in the path matching. The second, less preferable option is to normalize paths within the matcher. But in the latter case, the normalized paths cannot be reused in the other parts of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in case we decide to do some extra work about OS specific filesystems? Or anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paths could be expressed differently, and before matching, it should be normalized - placeholders resolved, etc.

PathMatcher pathMatcher = FileSystems.getDefault().getPathMatcher("glob:./file.txt");
System.out.println("./file.txt -> "+  Paths.get("./file.txt").normalize().toAbsolutePath());
System.out.println("./dir/../file.txt -> "+  Paths.get("dir/../file.txt").normalize().toAbsolutePath());
System.out.println(pathMatcher.matches(Paths.get("dir/../file.txt")));
----Out---
./file.txt -> <dirpath>/sandbox/file.txt
./dir/../file.txt -> <dirpath>/sandbox/file.txt
false

The matcher might be required to normalize paths to match reliably. Just taking any path as an input might not work. It's better to have clear semantics here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it. As I do not want to go too far with this MR, I suggest to do like the WalkKey class for the include section -> naming the path in the "matches" function : "normalized". To indicate to a developer that a provided path should be normalized.

Added to the list of topics to be discussed on Thursday.

inputExcludePathMatchers.add(new TreeWalkerPathMatcher(pathMatcherGlob, false));

// Globs ending with "**" should end any sub-directory inspection
if (pathMatcherGlob.endsWith("**")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very thin ice. Glob syntax is rich. For example, it could be [*][*] or /my/fav/dir/{**} or /my/*/dir/**. It is a fragile approach to interpreting the globs based on the content. excludedValue could be anything, and the logic might not be applied properly. That's why the original logic operated with normalized paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree, even if the flexibility offered is pretty nice, I wasn't so pleased with the complexity of this demand.
As a trade-off, my strategy was :

  • if the syntax is not written in an exotic manner : exclusion is fast as it can stop tree walking
  • if not : exclusion is slower but works as expected

@@ -526,6 +603,10 @@ public FileVisitResult visitFileFailed(Path path, IOException exc) throws IOExce
});
}

private boolean entryMustBeSkipped(Path path) {
return inputExcludePathMatchers.stream().anyMatch(pathMatcher -> pathMatcher.matches(path));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat picking: use method reference instead of the lambda - PathMatcher::matches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't, "matches" takes a parameter (or you refer to a syntax I don't know).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, iteration is over the matchers, not paths 🤦

}
}
return false;
return inputExcludeDirectoryPathMatchers.stream().anyMatch(matcher -> matcher.stopTreeWalking(path));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for skipping dirs should be paths based. Globs should be used as globs without any content assumptions. To get the benefits of both approaches, creating a wrapper class for the globs might be better. And if the glob is path-based (eg derived by adding /**, some optimizations could be applied). Which will keep consistent values of the original literal, derived path, and the derived glob.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following. Could you give me some code snippets to illustrate your reasoning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there are different semantics now and before for project-level properties.
Before: declaring <exclude.me>mydir/mysubdir</exclude.me> exclusion was filtering out the whole subtree (through the Path#startsWith)
Now: declaring mydir/mysubdir doesn't exclude anything.

FileSystems.getDefault().getPathMatcher("glob:dir");
System.out.println(pathMatcher.matches(Paths.get("dir/file.txt")));`
-----OUT-----
false

This is what I meant - properties might assume the exclusion of directories. For consistency, we must update docs, keep the path semantics, or migrate properties to globs/regex semantics, delegating filtering expressions to the users. But the main point is that it should be consistent with inclusion semantics (and maven in general), transparent to an end user, and deterministic.

Copy link
Contributor Author

@kbuntrock kbuntrock Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT : My response was wrong. Removing its content to improve readability.

I agree with the point, but I did not want to go too far with this MR. :P

normalizedPath(build.getDirectory()), // target by default
normalizedPath(build.getOutputDirectory()),
normalizedPath(build.getTestOutputDirectory())));
addToExcludedSection(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is room for a small optimization here - not adding sub-dirs of the target will reduce the length of the filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Added it. I did a naive implementation to keep the code simple.

@@ -0,0 +1,107 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing this test. It is long past due to implement 👍

@kbuntrock
Copy link
Contributor Author

@kbuntrock nicely done, it would be useful to update the sample config as well https://github.com/apache/maven-build-cache-extension/blob/master/src/site/resources/maven-build-cache-config.xml

Done it. Wasn't comfortable with this file since I found it confusing when I discovered this extension + it was still valid with my changes. Added some comments on some related other parameters around. Is it ok for you @maximilian-novikov-db ?

@gnodet
Copy link
Contributor

gnodet commented Aug 25, 2023

I'm starting to wonder if the use of the JDK PathMatcher is sufficient here. If we want to optimize the file system access and not traverse all files / subdirectories, we'll have to use Files.walkFileTree as it has built-in support for skipping subtrees. However, I'm not sure there is any good implementation of glob syntax that would skip subtree when known to not match.

@kbuntrock
Copy link
Contributor Author

I'm starting to wonder if the use of the JDK PathMatcher is sufficient here. If we want to optimize the file system access and not traverse all files / subdirectories, we'll have to use Files.walkFileTree as it has built-in support for skipping subtrees. However, I'm not sure there is any good implementation of glob syntax that would skip subtree when known to not match.

Well, good news, we are already using Files#walkFileTree. One of the discussion is precisely how I managed to keep as well as possible the capacity to skip subtree scanning (it is a key point for the projects I want to use this extension on).

@ALL : Seems this PR draw attention to details about this functionality but I would suggest to not go too far with it.
I already agreed to change it from a bugfix to an evolution. Maybe the current state is good enough for now. And if good ideas emerge they can be implemented in another PR. I have the feeling that in term of functionality / performance / test and documentation, this PR brings already a lot to the table. And people will be easily capable after merging to change some details they would have seen differently.

* Add a value from the excluded section list to the directories and/or the filenames ban list.
* @param excludedValue a value from the exclude list
*/
private void addToExcludedSection(String excludedValue, boolean addProjectBaseDir) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need the addProjectBaseDir parameter? Intuitively, the project dir should always be prepended. Could that lead to an exclude applied to an unexpected subtree? Like dir1/file.txt applied to <root>/dir2/dir1/file.txt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's because of the default exclude sections. Their values have already the project dir prepended. But I'm open to suggestions.

(addProjectBaseDir ? baseDirectoryGlobPrefix : "")
+
// If the glob start with "/", we remove it since it's already added in the added basedir glob
(excludedValue.startsWith("/") ? excludedValue.substring(1) : excludedValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stripping the leading / from user-supplied parameters might be confusing. a common convention is that / denotes an absolute path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. What about forbidding a leading "/" ? I don't see any use of an absolute glob exclusion.

<!-- Exclude directory in an extra included folder -->
<maven.build.cache.exclude.subfolder>**/excluded_subfolder/**</maven.build.cache.exclude.subfolder>
<!-- Exclude by filepath + filename -->
<maven.build.cache.exclude.filepath>folder_outside_src/this_one_should_NOT_be_scanned.txt</maven.build.cache.exclude.filepath>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to cover ambiguous cases here:

  1. just <maven.build.cache.exclude.subfolder>excluded_subfolder</maven.build.cache.exclude.subfolder> - previously, I expected to have the whole dir subtree excluded consistently with second_folder_outside_src. Now - it excludes only a file with this name.
  2. just <maven.build.cache.exclude.subfolder>excluded_subfolder/**</maven.build.cache.exclude.subfolder> - it should exclude <dirRoot>/excluded_subfolder/**, but not <dirRoot>/subfolder/excluded_subfolder/**
  3. <maven.build.cache.exclude.subfolder>/tmp/**</maven.build.cache.exclude.subfolder> - I believe it should exclude the absolute path, e.g. all tmp files

<maven.build.cache.input.extraFile>extraFile.txt</maven.build.cache.input.extraFile>
<maven.build.cache.input.extraFolder>second_folder_outside_src</maven.build.cache.input.extraFolder>
<!-- Exclude directory in an extra included folder -->
<maven.build.cache.exclude.subfolder>**/excluded_subfolder/**</maven.build.cache.exclude.subfolder>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please notice that the documentation describes the parameters as follows:

Paths to exclude from source code search. Relative paths calculated from the current project/module root

Please notice the documentation doesn't say this is a pattern, a glob, or an ant expression. We need to update the documentation to reflect the support for globs and how they are applied. The key issue is whether the base dir is prepended, with or without leading slashes. To make it deterministic, we might follow this approach:

  • If the pattern starts with /, it is considered an absolute path expression
  • If not, any path expression is prepended by the absolute project dir path
  • The content of the properties will be interpreted as a glob expression.

@kbuntrock
Copy link
Contributor Author

kbuntrock commented Sep 28, 2023

About this PR, I'm still working on it. Even if I couldn't find the time the last past weeks.
I turn it in draft @maximilian-novikov-db et @AlexanderAshitkin, I think I'll be able to submit a more consensual version in the next days.

@kbuntrock kbuntrock marked this pull request as draft September 28, 2023 12:20
@kbuntrock kbuntrock force-pushed the MBUILDCACHE-64-exclusions-improvement branch from 67faf5a to 0801c66 Compare October 2, 2023 21:20
@kbuntrock kbuntrock marked this pull request as ready for review October 2, 2023 21:24
@kbuntrock kbuntrock force-pushed the MBUILDCACHE-64-exclusions-improvement branch from 0801c66 to fb8e211 Compare October 2, 2023 21:28
@kbuntrock kbuntrock force-pushed the MBUILDCACHE-64-exclusions-improvement branch from fb8e211 to eeaa49f Compare October 2, 2023 21:31
@kbuntrock
Copy link
Contributor Author

kbuntrock commented Oct 3, 2023

Re-opening the PR for merge.

It's now a mix between the very first implementation and the one with globs, and it's combining better with the input section mechanism.

I feel much more at ease with this version. We can use only path declaration, only glob declaration, or combine both if we want the best trade-off between performances and functionalities.

Here a small list of improvements:

  • the glob interpretation can be configured to better handle any specific case (= less code to handle the specific cases ... :p )
  • easier to test the exclusion mechanism with a unit test (added one)
  • unit + IT test about unix/dos path style (and tested on Windows and Debian)
  • handling absolute path like "inputs" (even if I suppose it will only used to test a configuration)

Here is the documentation presenting it :

image

And the updated "project property" page.

image

Looking forward to your feedbacks. :)

Especially in the project property section, I'm undecided about the naming of the exclusion value : "maven.build.cache.exclude.value.my-exclusion" versus "maven.build.cache.exclude.my-exclusion". (currently the first option, simpler syntax / readability with the additional properties like "maven.build.cache.exclude.glob.my-exclusion")

Copy link
Contributor

@AlexanderAshitkin AlexanderAshitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@olamy olamy merged commit 6eb05e6 into apache:master Oct 12, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants