New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rustup (including proxies) is not safe for concurrent use #988
Comments
rustup needs to have concurrency support bolted on in general. Need to extract cargo's flocking code. |
To start with we can just bolt on a flock to the entire runtime of rustup. |
I think I can probably work on this: I have some experience with Cargo's file [dead]locking :) |
@matklad ooh awesome. Can you extract cargo's fancy flock code to its own crate to share with rustup? |
Oh there's a tricky bit here in that rustup is reentrant. Can't just use a simple global lock. |
Where exactly does reentrancy happen? I don't yet understand how rustup works, but looks like adding a file lock to the |
Hi, is this still an issue? It might explain some problems I've been having with rustup. |
Yes, currently we have no locking on the toolchain directories IIRC. |
@matklad did that fancy flock code get extracted? |
No: https://github.com/rust-lang/cargo/blob/master/src/cargo/util/flock.rs I’ve also absolutely dropped the ball on this one :) |
No worries. Looking at it the semantics you have are different enough that it wouldn't be useful (e.g. creating directories on demand would run open up TOCTOU race conditions in untar secure handling for rustup), we don't want to lock each file because syscall latency would destroy performance on NFS and windows and so on. I'll take the code as inspiration and do a similar thing though :) cheers! |
Ok, so looks like cargo has problems on NFS which I was worried about:
cifsfs supports oplocks so SMBfs should lock properly:
fs2 which cargo uses does use LOCK_NB with flock - flock(file, libc::LOCK_EX | libc::LOCK_NB) (https://tantivy-search.github.io/tantivy/src/fs2/unix.rs.html#36) questions:
I'm inclined to provide a UI warning about NFS when taking the lock with a env variable to shut the warning up for folk that know things work properly in their environment (using the NFS sniffing logic from cargo); but then assume that the fs is working properly. |
Further thoughts: we need to mutually exclude operations on toolchains: removal and upgrade affect the entire toolchain as well as operations on components (add/remove components). I think locking the dir of the toolchain is probably the right control point. |
We've been running into this on the Libra team as developers using CLion or IDEA manage to invoke rustup concurrently accidentally quite often when we check in toolchain file updates. This results in a borked toolchain install that is missing rustc, but fixes itself after a manual run of @rbtcollins Are you working on a PR to fix this now? If so, that would be awesome. |
Not actively working on a PR, but it is on my radar. |
Ok, so here's a bit of a specification, I think this ties together all the various bits involved. Locking in RustupRustup gets run concurrently in two very different contexts: within a single (machine, user), it may be run concurrently by the user, or an IDE, or both, to perform tasks ranging from toolchain installation, component addition, documentation opening. All of these require multiple reads to be made to the rustup data with a consistent view; some of them require writes to be made as well. Rustup may also be run across machines, where a different machine but a shared rustup install is present - (machineA, userA) + (machineB, userA) - and in this case the same set of operations may take place, with the same requirements. Whatever consistency solution we adopt would be best if it can deliver both use cases, and not require manual lock administration as lockdir style solutions do, nor additional running network daemons. ProxiesRustup has one set of proxies shared across all toolchains; the proxies are held open while any process from a toolchain is running - e.g. IDE's hold rls open for extended periods. We need a lock to prevent concurrent attempts to install new proxies, and we need a notification mechanism back to the running proxies to allow them to be notified to exit when an update is required (because of presumed limitations of in-use-file-replacement on Windows, though recent changes may mean we can avoid this) ToolchainsWe have many toolchains in one rustup installation; stable, beta, nightly, dated nightly versions, and custom versions. Adding a toolchain adds a directory and a hash file; we need a lock to prevent collisions attempting to move the directory into place. Deleting a toolchain does a recursive rm in-place, which also needs a lock to prevent other rustup invocations presuming that the toolchain is actually installed during the time the deletion takes place (or perhaps we need to rename-then-delete, though that can run into race conditions with virus scanners, especially if the toolchain was just installed). Further, permitting deletions at any point will require notifications to running rls process proxies from that toolchain to cause them to shutdown, or the .exe is likely not deletable on Windows. ComponentsComponents such as DownloadsWe download packages for rustup itself, toolchains and additional components for toolchains, and (a corner case) custom installer executables for toolchains. We also download digital signature metadata files. The same file can be downloaded by two different rustup invocations for two different reasons. For instance, downloading nightly and a dated nightly for today, will download the same file(s). We used to leak partially downloaded files, and recently started deleting all download dir contents when rustup finished running; this is causing errors now. We need some mechanism to deal with leaks, but also to permit concurrent execution of rustup to be downloading files without interruption. Possibly a date based mechanism or locking based mechanism would be sufficient. Network file systems & rustupLinux handles SMB mounts and locking on that filesystem well, at least per my reading of the source - a rustup dir on an SMB mounted file system using regular posix locks should have those locks reflected as SMB locks. NFS is well known for having poor lock behaviour when the network services are not running or are firewalled; the underlying syscalls are only non-blocking on the filedescriptor themselves, and EWOULDBLOCK is defined as the lock being already held, not the OS being unable to determine if the lock is already held...
So it is likely that we will see bug reports of rustup hanging when Locks and notificationsOS locks typically don't have callback notifications built in; polling is one answer, or some form of lightweight message bus (e.g. zmq) with clear rules tied into the lock management layer. We have to be careful about race conditions though: in particular notifying before or after as appropriate. |
One aspect of |
For those encountering a similar problem or directed to this issue, here's a quick workaround from #3530: Ensure the rust-toolchain is set up by running |
(Edited to capture all the details that have emerged over time)
Recovering from this bug:
Usually just doing a
rustup component remove NAME && rustup component add NAME
will fix things. Sometimes removing the entire toolchain will be needed. In rare cases uninstalling rustup entirely will be needed.User model of Rustup
Rustup may be run as three different commands:
Locking in Rustup
Rustup gets run concurrently in two very different contexts: within a single (machine, user), it may be run concurrently by the user, or an IDE, or both, to perform tasks ranging from toolchain installation, component addition, documentation opening. All of these require multiple reads to be made to the rustup data with a consistent view; some of them require writes to be made as well.
Rustup may also be run across machines, where a different machine but a shared rustup install is present - (machineA, userA) + (machineB, userA) - and in this case the same set of operations may take place, with the same requirements.
Whatever consistency solution we adopt would be best if it can deliver both use cases, and not require manual lock administration as lockdir style solutions do, nor additional running network daemons.
Proxies
Rustup has one set of proxies shared across all toolchains; the proxies are held open while any process from a toolchain is running - e.g. IDE's hold rls open for extended periods.
We need a lock to prevent concurrent attempts to install new proxies, and we need a notification mechanism back to the running proxies to allow them to be notified to exit when an update is required (because of presumed limitations of in-use-file-replacement on Windows, though recent changes may mean we can avoid this)
Toolchains
We have many toolchains in one rustup installation; stable, beta, nightly, dated nightly versions, and custom versions. Adding a toolchain adds a directory and a hash file; we need a lock to prevent collisions attempting to move the directory into place. Deleting a toolchain does a recursive rm in-place, which also needs a lock to prevent other rustup invocations presuming that the toolchain is actually installed during the time the deletion takes place (or perhaps we need to rename-then-delete, though that can run into race conditions with virus scanners, especially if the toolchain was just installed). Further, permitting deletions at any point will require notifications to running rls process proxies from that toolchain to cause them to shutdown, or the .exe is likely not deletable on Windows.
Components
Components such as
rls
are added into a toolchain directory, and also involve writing to a metadata file within the toolchain tree itself. This needs to be locked to avoid corruption/dropped writes. As with toolchains, we need proxy notification for component removal, as well as a way to make sure that a component that is being removed does not have new instances of it spawned between the start of the removal and the completion of the removal.Downloads
We download packages for rustup itself, toolchains and additional components for toolchains, and (a corner case) custom installer executables for toolchains. We also download digital signature metadata files.
The same file can be downloaded by two different rustup invocations for two different reasons. For instance, downloading nightly and a dated nightly for today, will download the same file(s).
We used to leak partially downloaded files, and recently started deleting all download dir contents when rustup finished running; this is causing errors now.
We need some mechanism to deal with leaks, but also to permit concurrent execution of rustup to be downloading files without interruption. Possibly a date based mechanism or locking based mechanism would be sufficient.
Network file systems & rustup
Linux handles SMB mounts and locking on that filesystem well, at least per my reading of the source - a rustup dir on an SMB mounted file system using regular posix locks should have those locks reflected as SMB locks.
NFS is well known for having poor lock behaviour when the network services are not running or are firewalled; the underlying syscalls are only non-blocking on the filedescriptor themselves, and EWOULDBLOCK is defined as the lock being already held, not the OS being unable to determine if the lock is already held...
So it is likely that we will see bug reports of rustup hanging when
~/.rustup
is on NFS and the NFS server's lock RPC service is misconfigured or flaky.I think this be mitigated by emitting an NFS specific log message when taking a lock out on NFS once per process; with a config option to disable that for users that are confident they don't need it.... and a bug reporting hint to tell us they have disabled it.
Locks and notifications
OS locks typically don't have callback notifications built in; polling is one answer, or some form of lightweight message bus (e.g. zmq) with clear rules tied into the lock management layer. We have to be careful about race conditions though: in particular notifying before or after as appropriate.
The text was updated successfully, but these errors were encountered: