Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short-circuit the check-hooks-apply hook #2935

Closed
mxr opened this issue Jul 20, 2023 · 3 comments · Fixed by #2998
Closed

Short-circuit the check-hooks-apply hook #2935

mxr opened this issue Jul 20, 2023 · 3 comments · Fixed by #2998

Comments

@mxr
Copy link
Member

mxr commented Jul 20, 2023

search you tried in the issue tracker

check-hooks-apply

describe your actual problem

I was looking at the implementation of check_hooks_apply and noticed that while the filenames_for_hook function returns multiple matching files, the caller is only interested in whether the tuple is empty or not. I was curious if short-circuiting would have any benefit:

I prototyped a function Classifier.any_filenames_for_hook which short-circuits:

	def any_filenames_for_hook(self, hook: Hook) -> bool:
	    include_re, exclude_re = re.compile(hook.files), re.compile(hook.exclude)
	    types, types_or, exclude_types = (
	        frozenset(hook.types),
	        frozenset(hook.types_or),
	        frozenset(hook.exclude_types),
	    )
	
	    for i, name in enumerate(self.filenames):
	        if include_re.search(name) and not exclude_re.search(name):
	            return True
	
	        tags = self._types_for_file(name)
	        if (
	            tags >= types
	            and (not types or tags & types_or)
	            and not tags & exclude_types
	        ):
	            return True
	
	    return False
     for hook in all_hooks(config, Store()):
         if hook.always_run or hook.language == 'fail':
             continue
-        elif not classifier.filenames_for_hook(hook):
+        elif not classifier.any_filenames_for_hook(hook):
             print(f'{hook.id} does not apply to this repository')
             retv = 1

Just to try it out, I applied this change in my pre-commit install directory and profiled with hyperfine on a local repo. The repo has about 12000 files. The short-circuit version ran about 20% faster. I also added logging and saw the short-circuit version cut out after ~26 files, whereas the existing version looks at all 12000.

Old

$ hyperfine 'pre-commit run check-hooks-apply --all-files'
Benchmark 1: pre-commit run check-hooks-apply --all-files
  Time (mean ± σ):      1.253 s ±  0.066 s    [User: 0.546 s, System: 0.695 s]
  Range (min … max):    1.193 s …  1.419 s    10 runs

New

$ hyperfine 'pre-commit run check-hooks-apply --all-files'
Benchmark 1: pre-commit run check-hooks-apply --all-files
  Time (mean ± σ):     969.9 ms ±  65.6 ms    [User: 429.0 ms, System: 535.0 ms]
  Range (min … max):   873.1 ms … 1067.6 ms    10 runs

pre-commit --version

3.2.2

@asottile
Copy link
Member

I'd be concerned that the two functions would drift -- the current implementation directly matches what pre-commit would do. I'm not sure the performance benefits are worth it?

@mxr
Copy link
Member Author

mxr commented Jul 21, 2023

Agreed, the POC was to measure the improvement. I can take a stab at unifying the two (for example refactor filenames_for_hook to return an iterator) but in general if even that causes more harm than good I'm happy to close this issue

@asottile
Copy link
Member

sure give that a shot -- though sometimes a generator will be slower but maybe not in this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants