Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of os.walk() #119169

Open
barneygale opened this issue May 19, 2024 · 1 comment
Open

Improve performance of os.walk() #119169

barneygale opened this issue May 19, 2024 · 1 comment
Labels
performance Performance or resource usage

Comments

@barneygale
Copy link
Contributor

barneygale commented May 19, 2024

There are a couple of minor performance improvements possible in os.walk():

  • We don't need to manually pump the os.scandir iterator, given we handle exceptions from next() like exceptions from scandir() itself, i.e. by ignoring the problematic directory and moving on. We can use a for loop like filthy casuals.
  • In bottom-up mode, we can handle exceptions from entry.is_symlink() in the same block as those from entry.is_dir(), which avoids a few temporary variables.
  • In top-down mode, we can call os.path.join() once on a parent directory rather than for each child path.

Linked PRs

@barneygale barneygale added the performance Performance or resource usage label May 19, 2024
@barneygale
Copy link
Contributor Author

barneygale commented May 19, 2024

In bottom-up mode, we can handle exceptions from entry.is_symlink() in the same block as those from entry.is_dir(), which avoids a few temporary variables.

Upon closer inspection, I think I was wrong. Symlinks to directories where is_symlink() raises should still be categorised as dirs.

barneygale added a commit to barneygale/cpython that referenced this issue May 19, 2024
Handle errors from `os.scandir()` and `ScandirIterator` similarly, which
lets us loop over directory entries with `for`. In top-down mode, call
`os.path.join()` at most once per iteration.
barneygale added a commit to barneygale/cpython that referenced this issue May 23, 2024
…wn=False)`

In `os.walk(topdown=False)`, don't bother reversing `walk_dirs`. This means
that sibling directories are visited in a different order, but 1) that
order is arbitrary and comes from `os.scandir()`, and 2) unlike in top-down
mode, users can't influence which directories are visited or in what order.

This change caused `test_walk_bottom_up` to fail. I think this test made
assertions that were too specific and relied on `os.scandir()` returning
things in a specific order, and the test code is pretty hard to understand
once you get into the details. I've replaced it with a version of the same
test from `test_pathlib_abc.py`.
barneygale added a commit to barneygale/cpython that referenced this issue May 26, 2024
For silly reasons, pathlib's generic implementation of `walk()` currently
resides in `glob._Globber`. This commit moves it into
`pathlib._abc.PathBase.walk()` where it really belongs, and makes
`pathlib.Path.walk()` call though to `os.walk()`.

Symlink handling is a little different between the two `walk()`
implementations when `followlinks=False`. In `pathlib` it means never
following symlinks, not even for distinguishing between files and
directories. In `os` it means never *walking* into symlinks, including any
symlinks created by the user between iterations. We smooth over these
differences with a private sentinel - `os._walk_symlinks_as_files` - that
enables the pathlib behaviour.
barneygale added a commit to barneygale/cpython that referenced this issue May 26, 2024
For silly reasons, pathlib's generic implementation of `walk()` currently
resides in `glob._Globber`. This commit moves it into
`pathlib._abc.PathBase.walk()` where it really belongs, and makes
`pathlib.Path.walk()` call though to `os.walk()`.

Symlink handling is a little different between the two `walk()`
implementations when `followlinks=False`. In `pathlib` it means never
following symlinks, not even for distinguishing between files and
directories. In `os` it means never *walking* into symlinks, including any
symlinks created by the user between iterations. We smooth over these
differences with a private sentinel - `os._walk_symlinks_as_files` - that
enables the pathlib behaviour.
barneygale added a commit that referenced this issue May 29, 2024
For silly reasons, pathlib's generic implementation of `walk()` currently
resides in `glob._Globber`. This commit moves it into
`pathlib._abc.PathBase.walk()` where it really belongs, and makes
`pathlib.Path.walk()` call `os.walk()`.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 29, 2024
…ythonGH-119573)

For silly reasons, pathlib's generic implementation of `walk()` currently
resides in `glob._Globber`. This commit moves it into
`pathlib._abc.PathBase.walk()` where it really belongs, and makes
`pathlib.Path.walk()` call `os.walk()`.
(cherry picked from commit 7ff61f5)

Co-authored-by: Barney Gale <barney.gale@gmail.com>
barneygale added a commit that referenced this issue May 29, 2024
…H-119573) (#119750)

GH-119169: Implement `pathlib.Path.walk()` using `os.walk()` (GH-119573)

For silly reasons, pathlib's generic implementation of `walk()` currently
resides in `glob._Globber`. This commit moves it into
`pathlib._abc.PathBase.walk()` where it really belongs, and makes
`pathlib.Path.walk()` call `os.walk()`.
(cherry picked from commit 7ff61f5)

Co-authored-by: Barney Gale <barney.gale@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

No branches or pull requests

1 participant