Run tests in parallel on Windows and always cleanup test directories #5879

chrisd8088 · 2024-10-05T23:20:29Z

Since we revised our test suite to use the prove command in PR #3125, an undiagnosed bug in our lfstest-count-tests test helper program has prevented us from running our test scripts in parallel on Windows. Instead, several workarounds were introduced in that PR to avoid counting parallel test script executions and run all our test scripts sequentially on Windows, which slows down our CI jobs.

(The lfstest-count-tests program is used to track the number of active test scripts, so our lfstest-gitserver test server program can be started once, shared by all our test scripts, and then stopped after they are all complete. This works as expected on Linux and macOS systems.)

In this PR, we fix the bug in the lfstest-count-tests program, which is the result of the fact that we never call Close() on the test_count.lock file we open to ensure exclusive access to the test_count file. On Unix systems, this causes no problems, because when we call Remove() to delete the lock file, the operating system allows us to unlink the file while a file handle remains open. On Windows, however, this is not permitted, so the Remove() call does not succeed, leaving the lock file in place. Thus when we run the lfstest-count-tests program a second time, it is unable to acquire the lock and exits with an error.

With the bug fixed, we then remove the workarounds from PR #3125, which set the GIT_LFS_NO_TEST_COUNT and GIT_LFS_LOCK_ACQUIRE_DISABLED environment variables on Windows and use their presence to avoid running the lfstest-count-tests program and, when it is run, to avoid creating or removing the lock file.

In addition, we also make sure our test suite once again removes all the temporary directories it creates to hold the Git repositories created by our tests. Although we set the RM_GIT_LFS_TEST_DIR environment variable to yes when we use mktemp(1) to create a temporary directory, this is a remnant of PR #1107, and has not been honoured since the relevant cleanup code was removed from our shutdown() shell function in PR #3125. In this PR, we restore the cleanup steps from the original implementation.

Moreover, we ensure that we export the GIT_LFS_TEST_DIR and RM_GIT_LFS_TEST_DIR environment variables so that our test scripts can reuse the same temporary directory, instead of each script creating a unique directory, as happens now.

At present, running the make test build target on any platform results in a large number of stale temporary directories being left behind, which no longer occurs with the changes in this PR.

Finally, we remove some unused test variables, improve the lfstest-count-tests program's handling of file permissions and errors, revise our Linux RPM package builds to run more test scripts in parallel, and skip installing Strawberry Perl on Windows as our CI workflow now installs the Git for Windows SDK, which includes Perl and the prove command.

This PR will be most easily reviewed on a commit-by-commit basis, with whitespace differences ignored. Each commit has a detailed description of its changes.

The SHUTDOWN_LFS environment variable was added as part of the original integration test suite implementation in commit b1f5025 of PR git-lfs#306, and was used to indicate whether the Git server which runs as part of the test suite should be stopped by the shutdown() shell function at the end of a set of tests, or only when the last set of tests completed. When the integration test suite was refactored to use the "prove" command and to align with Git's test suite, in PR git-lfs#3125, most references to the SHUTDOWN_LFS environment variable were removed, particularly in commits 1926e94 and a4cc106. However, one reference was left in the t/testlib.sh script, so we remove it now.

The LFS_CONFIG variable, which we set in our t/testenv.sh library, is not referenced by any part of our current test suite. The last use of this variable was removed in 2015 in commit 1feee58 of PR git-lfs#335, when the test suite was revised to create a private home directory for the tests, so as to be able to set a standard global Git configuration separate from that of the current user's configuration. Hence we can just remove the LFS_CONFIG variable now.

Since commit c591ff7 of PR git-lfs#3125 we have installed Strawberry Perl before running our CI test suite on Windows. This was necessary when we used the AppVeyor service to run our test suite on Windows, and subsequently also when we converted our CI jobs to GitHub Actions in PR git-lfs#3808. We require Perl because we use the "prove" command, which runs all of our t/t-*.sh test scripts and collects and summarizes the results. However, since commit 8818630 of PR git-lfs#5666 we have installed the Git for Windows SDK before running our test suite, and it includes a full Perl distribution, which means we no longer need to install Strawberry Perl separately.

Since commit 9b0adeb of PR git-lfs#3125, when we converted our test suite to use the "prove" command, we have run up to nine test jobs in parallel in our primary test builds. However, as of commit 7fd5aa6 in PR git-lfs#3144, we only use a maximum of four parallel test jobs when building our Linux RPM packages. This has the effect of slightly slowing down our CI suite, because, ever since commit 212c051 in PR git-lfs#3932, we build a range of Linux packages in our GitHub Actions CI workflow. (Note that for historical reasons, we only run the full "make test" Makefile target in our "t" test directory when building RPM packages; in our Debian packages, we just run the Go language tests. While this discrepancy is worth addressing, for the moment we simply aim to improve the runtime of our RPM package builds.)

In commit 1926e94 of PR git-lfs#3125 we migrated our test suite to use the "prove" command, and introduced the lfstest-count-tests utility to help control a common instance of our lfstest-gitserver program, which can be shared and reused by all the individual test scripts. In that PR we also moved our test suite into the "t" directory, and added a Makefile with a "test" target that runs the "prove" command over all the individual test scripts; this is still the framework we use today. The lfstest-count-tests utility did not work as intended on the Windows platform, however, as noted in the description of commit 9b73c80 in the same PR: On the AppVeyor CI Windows environment, there exist issues that prevent the lfstest-count-tests program from acquiring the test_count.lock file correctly. Rather than debug this problem, several workarounds were added, such as the GIT_LFS_NO_TEST_COUNT environment variable, which disables the use of the lfstest-count-tests utility in our individual t/t-*.sh test scripts, and the GIT_LFS_LOCK_ACQUIRE_DISABLED environment variable, which is used to prevent the lfstest-count-tests utility from trying to acquire the test_count.lock file at all on Windows. The bug which causes the lfstest-count-tests program to be unable to acquire the lock file the second time the program is run is the result of the lock file not being removed when the program runs the first time. This in turn is due to the fact that the file handle returned by the call to the OpenFile() function of the "os" package is never released with a call to Close(), so the subsequent call to Remove() results in failure. On Unix systems, this does not occur, because the file can be unlinked while file handles to it remain open. On Windows, however, all file handles must be closed before the file can be deleted. We fix the underlying bug here by simply calling Close() on the new file handle immediately after creating the file, so long as no error was returned by the call to OpenFile(). In addition, we make a number of other fixes and enhancements to the lfstest-count-tests program. We use the more conventional permissions mode of 0777 for any directories we create, rather than 0666. We include the O_RDWR flag with the others we pass to the OpenFile() function, since we are technically supposed to provide at least one of O_RDWR, O_RDONLY, and O_WRONLY. And we output an error message if the deferred call to our release() function is unable to delete the lock file.

In commit 9b73c80 of PR git-lfs#3125 the GIT_LFS_LOCK_ACQUIRE_DISABLED environment variable was introduced for use with our test suite. When this variable is set to the value "1", it indicates to the lfstest-count-tests test utility that it should skip creating a test_count.lock file before updating the test_count file we use to track the number of active test scripts. This was done because the lfstest-count-tests program had a bug which prevented it from deleting the lock file on Windows, causing subsequent invocations of the program to be unable to exclusively create it again. As this bug was not resolved at the time, the choice was made to work around the problem by skipping the creation of a lock file altogether on Windows. This change was also made in conjunction with the addition of another environment variable, GIT_LFS_NO_TEST_COUNT, in commit c591ff7 of the same PR. That variable is also set only on Windows, and stops the setup() and shutdown() shell functions defined in our t/testhelpers.sh script from calling the lfstest-count-tests program. As we have now resolved the underlying problem in the lfstest-count-tests utility, in a prior commit in this PR, we can now remove both of these environment variables and the test suite features they control. We start by removing support for the GIT_LFS_LOCK_ACQUIRE_DISABLED environment variable from the lfstest-count-tests utility, and changing our script/cibuild script so it no longer sets the variable. We can also remove the environment variable from the list of those we unset before running certain tests. We had to introduce this behaviour in commit aca1aff of PR git-lfs#3808, when we migrated our CI jobs to GitHub Actions. However, we will no longer need to unset these variables as we will now never set them at all, and they will have no effect if they were set.

In commit c591ff7 of PR git-lfs#3125 the GIT_LFS_NO_TEST_COUNT environment variable was introduced for use with our test suite. When this variable is set to a non-empty value, it indicates to the setup() and shutdown() test helper shell functions that they should skip running the lfstest-count-tests utility program entirely. This was done because the lfstest-count-tests program had a bug which prevented it from deleting the lock file on Windows, causing subsequent invocations of the program to be unable to exclusively create it again. As this bug was not resolved at the time, the choice was made to work around the problem in part by not counting the number of running test scripts on Windows. (We also introduced the GIT_LFS_LOCK_ACQUIRE_DISABLED environment variable, in commit 9b73c80 of the same PR, to further address the problem.) When running our test suite on Windows, we set the GIT_LFS_NO_TEST_COUNT to "1", which causes the individual test scripts to skip invoking lfstest-count-tests entirely. Instead, the program is run just once in the "test" t/Makefile target before executing the test scripts with the "prove" command, and then once again afterwards, both times by explicitly resetting the GIT_LFS_NO_TEST_COUNT variable to an empty string value. These two invocations cause the lfstest-count-tests program to start the lfstest-gitserver program before all the test scripts are executed and then stop it once they have all finished. As we have now resolved the underlying problem in the lfstest-count-tests utility, in a prior commit in this PR, we can remove this environment variable and the test suite features it controls. We change our script/cibuild script to no longer set the GIT_LFS_NO_TEST_COUNT variable, and we revise the conditionals in the setup() and shutdown() test helper shell functions so they always call the lfstest-count-tests program, rather than only doing so if the GIT_LFS_NO_TEST_COUNT variable is unset or empty. We also change the "test" target's recipe in our t/Makefile file to not unset GIT_LFS_NO_TEST_COUNT when running lfstest-count-tests before and after it runs the "prove" command. Finally, as we have already removed the GIT_LFS_LOCK_ACQUIRE_DISABLED environment variable and all support for it, in a previous commit in this PR, we are now also able to eliminate the unset_vars() shell functions in some of our t/t-*.sh test scripts. These functions are used to unset both the GIT_LFS_LOCK_ACQUIRE_DISABLED and GIT_LFS_NO_TEST_COUNT variables before running the tests in those scripts. We added this behaviour in commit aca1aff of PR git-lfs#3808, when we migrated our CI jobs to GitHub Actions. However, we will no longer need to unset these variables as we will now never set them at all, and they will have no effect if they are set.

In commit e1c2c8a of PR git-lfs#1107 we revised our test suite to use a separate test directory, either a temporary one we create or one specified in advance in a GIT_LFS_TEST_DIR environment variable. If that variable is empty or unset, we populate it with the output of a "mktemp -d" command with a directory name prefix pattern of "git-lfs_TEMP". We then create all of our test repositories in subdirectories of the location given by GIT_LFS_TEST_DIR. If we create a directory because GIT_LFS_TEST_DIR is empty or unset, we also set the RM_GIT_LFS_TEST_DIR variable with a flag value of "yes". As of commit e1c2c8a, we would then check that variable in the shutdown() shell function, and if the variable was set to "yes", we would then remove the directory specified by the GIT_LFS_TEST_DIR variable if it matched the pattern we use for temporary directories and if the KEEPTRASH variable was empty or unset. This ensured that, in the normal case, we removed all of our test artifacts after completing our test scripts. In commit 1926e94 of PR git-lfs#3125 we then migrated our test suite to use the "prove" command, and introduced the lfstest-count-tests utility to help control a common instance of our lfstest-gitserver program, which can be shared and reused by all the individual test scripts. In that PR we also moved our test suite into the "t" directory, and added a Makefile with a "test" target that runs the "prove" command over all the individual test scripts; this is still the framework we use today. In that commit, the "test" target's recipe in the t/Makefile file was written so it first loads the t/testenv.sh script and then runs the setup() shell function, which calls "lfstest-count-tests increment", which starts the lfstest-gitserver program because it increments the test count from zero to one. Every test script run by the "prove" command then calls setup() again, which further increments the test count, and then as it exits it calls the shutdown() shell function, which runs the "lfstest-count-tests decrement" command. That never decrements the test count below one, however, because of the initial call to setup() made prior to running "prove". Finally, the "test" Makefile target recpie runs shutdown() after "prove" is complete, which decrements the test count to zero, at which point lfstest-count-tests sends a shutdown POST HTTP request to the running lfstest-gitserver instance. In this design, though, each individual test script loads the t/testenv.sh script, and the script is directly loaded twice by the "test" Makefile target's recipe. In each case, the script runs in a separate shell session, and so does not find a GIT_LFS_TEST_DIR environment variable (unless one is set in the user's shell session when running "make test"). Thus each invocation of t/testenv.sh results in the creation of a new temporary test directory. Commit 1926e94 also removed the logic from the shutdown() shell function that handled the deletion of the directory specified by the GIT_LFS_TEST_DIR variable. The consequence is that when we run an individual test script by itself, it leaves a single temporary test directory behind, and when we run the "make test" Makefile target, every t/t-*.sh script leaves a test directory behind, plus two additional ones created when the recipe loads the t/testenv.sh script directly. We can improve this situation by restoring the logic to clean up the directory specified by the GIT_LFS_TEST_DIR variable (if the RM_GIT_LFS_TEST_DIR variable contains "yes" and the other conditions described above are met), and then adjusting our "test" target's recipe so it runs all of its key steps in a single Bash shell session. We also make sure to export the GIT_LFS_TEST_DIR and RM_GIT_LFS_TEST_DIR variables into the environment, so that when t/testenv.sh is loaded by the recipe the first time, it creates a temporary directory and exports those variables, which ensures they are then available to all the subsequent test scripts and the recipe's final call to shutdown(). By making these changes, we ensure that in the normal case, when GIT_LFS_TEST_DIR is empty or unset and we run "make test", we create only a single temporary test directory, use it in all the individual test scripts, and then remove it completely afterwards. In our t/t-env.sh and t/t-workflow.sh test scripts, we check the output of the "git lfs env" command in a number of tests, comparing it to the values we expect based on the test conditions, including the values of all environment variables whose names begin with the "GIT_" prefix. As we now export the GIT_LFS_TEST_DIR variable into the environment, it is among these environment variables, which causes the tests to fail on Windows unless we make one additional change. In our CI jobs, we run our test suite on Windows using the Git Bash environment provided by the Git for Windows project, which is in turn based on the MSYS2 environment. In this context, the mktemp(1) command returns a path beginning with /tmp, so that is now what is stored in our GIT_LFS_TEST_DIR environment variable. However, MSYS2 translates such Unix-style paths into actual Windows filesystem paths when it passes command-line arguments and environment variables to any programs it executes, including our Git LFS client. Thus the "git lfs env" command returns a Windows-style path for the GIT_LFS_TEST_DIR variable, while our tests in our t/t-env.sh and t/t-workflow.sh test scripts are expecting a Unix-style path beginning with /tmp. To resolve this, when running on Windows, we set the MSYS2_ENV_CONV_EXCL environment variable at the top of these scripts to contain the name of the GIT_LFS_TEST_DIR variable. As described in the MSYS2 documentation, this ensures MSYS2 does not alter the value of the GIT_LFS_TEST_DIR environment variable: https://www.msys2.org/docs/filesystem-paths/#environment-variables Finally, we take the opportunity to ensure that the command substitutions we use to set the GIT_LFS_TEST_DIR environment variable are fully quoted, and to clean up some nearby excess whitespace.

larsxschneider

Great find! Thank you 🙇

As described in commit 8d456a7 of PR git-lfs#5879, we currently depend on the Git for Windows SDK to install a version of the "prove" command from Perl Test::Harness distribution, because we use that command to run our suite of shell tests. However, in our CI jobs on Windows, the SDK does not at present install the "prove" command into a location listed in the PATH environment variable, so these CI jobs having begun failing. This change may be related to the recent migration of the SDK's release artifacts to GitHub from Azure Blobs, as outlined in PR git-for-windows/git-for-windows-automation#109, and is possibly a consequence of PR git-for-windows/build-extra#588. Regardless, the "prove" command is fortunately still included in the "minimal" flavour of the SDK, since it is listed in the manifest for that flavour: https://github.com/git-for-windows/git-sdk-64/blob/f845bb725e56e058a0fbf93aba18c70906a1068e/.sparse/minimal-sdk#L188 We simply need to ensure that the directory in which the command is now located, namely /usr/bin/core_perl, is added to the set of paths in the PATH variable for the Git Bash environment. Note that we only need to add this extra path to PATH when we are running our script/cibuild script, as that runs the default recipe from our t/Makefile file, which in turn executes the "test" recipe, which uses the "prove" command to run and summarize all our shell test scripts.

As described in commit 8d456a7 of PR git-lfs#5879, we currently depend on the Git for Windows SDK to install a version of the "prove" command from the Perl Test::Harness distribution, because we use that command to run our suite of shell tests. However, in our CI jobs on Windows, the SDK does not at present install the "prove" command into a location listed in the PATH environment variable, so these CI jobs having begun failing. This change may be related to the recent migration of the SDK's release artifacts to GitHub from Azure Blobs, as outlined in PR git-for-windows/git-for-windows-automation#109, and is possibly a consequence of PR git-for-windows/build-extra#588. Regardless, the "prove" command is fortunately still included in the "minimal" flavour of the SDK, since it is listed in the manifest for that flavour: https://github.com/git-for-windows/git-sdk-64/blob/f845bb725e56e058a0fbf93aba18c70906a1068e/.sparse/minimal-sdk#L188 We simply need to ensure that the directory in which the command is now located, namely /usr/bin/core_perl, is added to the set of paths in the PATH variable for the Git Bash environment. Note that we only need to add this extra path to PATH when we are running our script/cibuild script, as that runs the default recipe from our t/Makefile file, which in turn executes the "test" recipe, which uses the "prove" command to run and summarize all our shell test scripts.

chrisd8088 added 8 commits October 5, 2024 13:09

chrisd8088 marked this pull request as ready for review October 6, 2024 00:05

chrisd8088 requested a review from a team as a code owner October 6, 2024 00:05

chrisd8088 added the tests label Oct 6, 2024

larsxschneider approved these changes Oct 7, 2024

View reviewed changes

Merge branch 'main' into tests-cleanup-dirs-fix-counts

Loading
Loading status checks…

1f4a99c

chrisd8088 merged commit f8cf14b into git-lfs:main Oct 7, 2024
10 checks passed

chrisd8088 deleted the tests-cleanup-dirs-fix-counts branch October 7, 2024 17:06

chrisd8088 mentioned this pull request Feb 6, 2025

Update Windows CI jobs to find Perl prove command #5983

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run tests in parallel on Windows and always cleanup test directories #5879

Run tests in parallel on Windows and always cleanup test directories #5879

chrisd8088 commented Oct 5, 2024

larsxschneider left a comment

Run tests in parallel on Windows and always cleanup test directories #5879

Run tests in parallel on Windows and always cleanup test directories #5879

Conversation

chrisd8088 commented Oct 5, 2024

larsxschneider left a comment

Choose a reason for hiding this comment