Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use posix_spawn to spawn processes for more fasterness, especially on large processes #5710

Closed
twisted-trac opened this issue Jun 14, 2012 · 11 comments

Comments

@twisted-trac
Copy link

glyph's avatar @glyph reported
Trac ID trac#5710
Type enhancement
Created 2012-06-14 06:39:56Z
Branch https://github.com/twisted/twisted/tree/5710-posix_spawn

(Note that since this is a performance ticket, a spawnProcess benchmark is a prerequisite.)

posix_spawn is not burdened with the unfortunate requirement that fork/ exec implies, which is that user code runs in the post-fork-pre-exec environment. Many platforms do actually have practical optimizations for posix_spawn beyond its idiomatic counterpart. These optimizations would likely be magnified by Python's often unfortunate reference-counting/gc behaviors across fork() boundaries.

Not to mention the fact that posix_spawn actually has a signature that is quite similar to Twisted's spawnProcess (with struct attributes instead of parameters, but hey, it's C, what do you want).

Searchable metadata
trac-id__5710 5710
type__enhancement enhancement
reporter__glyph glyph
priority__normal normal
milestone__ 
branch__5710_posix_spawn 5710-posix_spawn
branch_author__ 
status__new new
resolution__None None
component__core core
keywords__None None
time__1339655996000000 1339655996000000
changetime__1642532819956244 1642532819956244
version__None None
owner__glyph glyph

@twisted-trac
Copy link
Author

nekto0n's avatar nekto0n commented

What about proposed preExecFn argument to spawnProcess?

@twisted-trac
Copy link
Author

jyknight's avatar @jyknight commented

Doing something like this is actually a correctness fix, not a performance fix. I'm nots sure if posix_spawn is actually useful, though, given this (note RESOLVED/FIXED state; Ulrich Drepper at his finest): http://sources.redhat.com/bugzilla/show_bug.cgi?id=10354

But, twisted.internet.process does need to be rewritten to either use posix_spawn or python-subprocess32's _posixsubprocess.c C module (or a variant thereof adding additional functionality if necessary), so that there is no python code executed between the call to fork() and the call to exec().

@twisted-trac
Copy link
Author

jyknight's avatar @jyknight commented

You currently can't do just about /anything/ if you want the vfork optimization in glibc. Doesn't seem too useful.

      /* If no major work is done, allow using vfork.  Note that we
         might perform the path searching.  But this would be done by
         a call to execvp(), too, and such a call must be OK according
         to POSIX.  */
      || ((flags & (POSIX_SPAWN_SETSIGMASK | POSIX_SPAWN_SETSIGDEF
                    | POSIX_SPAWN_SETSCHEDPARAM | POSIX_SPAWN_SETSCHEDULER
                    | POSIX_SPAWN_SETPGROUP | POSIX_SPAWN_RESETIDS)) == 0
          && file_actions == NULL))
    new_pid = __vfork ();
  else
    new_pid = __fork ();

@twisted-trac
Copy link
Author

glyph's avatar @glyph commented

Replying to jknight:

You currently can't do just about /anything/ if you want the vfork optimization in glibc. Doesn't seem too useful.

A few points:

  1. if you're doing process spawning for utilization of multiple cores (a quite popular use-case), you don't need to do any of that stuff.
  2. Did they get rid of the POSIX_SPAWN_USEVFORK flag? I thought you could tell it to be a little more permissive if you know that one of those things is "safe". (I don't know how you can ever really know that, in the face of how "all bets are off" vfork is, but still, I thought it was in there.)
  3. You can do some of that work in a helper C executable if you really need to, and just exec twice. If your main / master process is large and wants to fork a lot, you can still get a big win even paying twice for exec().

@twisted-trac
Copy link
Author

glyph's avatar @glyph commented

Worth noting that the glibc bug is open again. It seems that Drepper is no longer involved with glibc maintenance, so we no longer have to ... enjoy his inimitable style of maintaining relationships with users.

Also worth noting that glibc is not the only implementation of posix_spawn, so it might be better elsewhere. I am not sure how practically true this is right now, as I don't know the details of any other implementations, but it always might become true in the future; posix_spawn's semantics are such that it can be optimized, fork / exec aren't.

@twisted-trac
Copy link
Author

glyph's avatar @glyph commented

Replying to glyph:

Replying to jknight:

  1. You can do some of that work in a helper C executable if you really need to, and just exec twice. If your main / master process is large and wants to fork a lot, you can still get a big win even paying twice for exec().

Reading this now it strikes me as somewhat unclear. What I meant to say is, as a separate ticket, we could implement the following optimization:

If spawnProcess is passed parameters that, if passed to posix_spawn as arguments, would cause fork rather than vfork to be called on glibc (or an equivalent non-optimal behavior on a different libc), we could have spawnProcess detect that it's running on libc, then use posix_spawn - using only the attributes which would cause vfork (or equivalent) to be used, to start a short C helper program.

That program would take, as environment variables or command-line arguments, instructions on which process-attribute manipulations to perform (dup2-ing file descriptors, changing user IDs, etc). Said program would perform those instructions, then exec the program that the user intended to spawn.

This could still be a significant advantage over the current fork/exec-based implementation, because such a C program could allocate no memory, or very close to it. Therefore, despite two exec calls, a far smaller executable image would be mapped, and we would not have to worry about a post-fork Python reference count change causing random page copies all over the process's address space.

It might be advisable to have such a program for thread-safety as well, at least in the case where the subprocess is spawned as a different user. The recommended way to implement this with posix_spawn is to simply call setuid (and then use the POSIX_SPAWN_RESETIDS spawnattr flag), which affects the process globally, unlike the current strategy of calling setuid in the post-fork environment.

@twisted-trac
Copy link
Author

glyph's avatar @glyph commented

Worth noting that Python 3.8 has a built-in posix_spawn in the stdlib now, so the dependency is optional. https://docs.python.org/3.8/library/os.html#os.posix_spawn

@twisted-trac
Copy link
Author

glyph's avatar @glyph commented

I did a quick prototype implementation in #1675 . Updates from our conversation a decade ago, other than the fact that Python now supports the API in the stdlib:

  • vfork is used in more places (the original glibc bug is now fixed)
  • glibc has been slowly adding flags, including setsid and chdir support on most platforms (freebsd remains a notable exception)

not sure about the completeness/correctness of my hack, a buch of test_process passes just fine with the posix_spawn shim in place, and it does execute.

@twisted-trac
Copy link
Author

glyph's avatar @glyph commented

10 years later.

@twisted-trac
Copy link
Author

wsanchez's avatar @wsanchez set owner to @glyph

reviewed

@glyph
Copy link
Member

glyph commented Mar 14, 2023

Fixed in 8a275db

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants