Rustup (including proxies) is not safe for concurrent use #988

matklad · 2017-03-16T15:11:24Z

(Edited to capture all the details that have emerged over time)

Recovering from this bug:

Usually just doing a rustup component remove NAME && rustup component add NAME will fix things. Sometimes removing the entire toolchain will be needed. In rare cases uninstalling rustup entirely will be needed.

User model of Rustup

Rustup may be run as three different commands:

rustup-init to install rustup (& by default a toolchain)
rustup to explicitly query or modify an installation (including rustup itself and one or more toolchains)
as a proxy of rustc, cargo etc (& which can implicitly trigger installation, upgrade of modification of a toolchain e.g. through toolchain files)

Locking in Rustup

Rustup gets run concurrently in two very different contexts: within a single (machine, user), it may be run concurrently by the user, or an IDE, or both, to perform tasks ranging from toolchain installation, component addition, documentation opening. All of these require multiple reads to be made to the rustup data with a consistent view; some of them require writes to be made as well.

Rustup may also be run across machines, where a different machine but a shared rustup install is present - (machineA, userA) + (machineB, userA) - and in this case the same set of operations may take place, with the same requirements.

Whatever consistency solution we adopt would be best if it can deliver both use cases, and not require manual lock administration as lockdir style solutions do, nor additional running network daemons.

Proxies

Rustup has one set of proxies shared across all toolchains; the proxies are held open while any process from a toolchain is running - e.g. IDE's hold rls open for extended periods.

We need a lock to prevent concurrent attempts to install new proxies, and we need a notification mechanism back to the running proxies to allow them to be notified to exit when an update is required (because of presumed limitations of in-use-file-replacement on Windows, though recent changes may mean we can avoid this)

Toolchains

We have many toolchains in one rustup installation; stable, beta, nightly, dated nightly versions, and custom versions. Adding a toolchain adds a directory and a hash file; we need a lock to prevent collisions attempting to move the directory into place. Deleting a toolchain does a recursive rm in-place, which also needs a lock to prevent other rustup invocations presuming that the toolchain is actually installed during the time the deletion takes place (or perhaps we need to rename-then-delete, though that can run into race conditions with virus scanners, especially if the toolchain was just installed). Further, permitting deletions at any point will require notifications to running rls process proxies from that toolchain to cause them to shutdown, or the .exe is likely not deletable on Windows.

Components

Components such as rls are added into a toolchain directory, and also involve writing to a metadata file within the toolchain tree itself. This needs to be locked to avoid corruption/dropped writes. As with toolchains, we need proxy notification for component removal, as well as a way to make sure that a component that is being removed does not have new instances of it spawned between the start of the removal and the completion of the removal.

Downloads

We download packages for rustup itself, toolchains and additional components for toolchains, and (a corner case) custom installer executables for toolchains. We also download digital signature metadata files.

The same file can be downloaded by two different rustup invocations for two different reasons. For instance, downloading nightly and a dated nightly for today, will download the same file(s).

We used to leak partially downloaded files, and recently started deleting all download dir contents when rustup finished running; this is causing errors now.

We need some mechanism to deal with leaks, but also to permit concurrent execution of rustup to be downloading files without interruption. Possibly a date based mechanism or locking based mechanism would be sufficient.

Network file systems & rustup

Linux handles SMB mounts and locking on that filesystem well, at least per my reading of the source - a rustup dir on an SMB mounted file system using regular posix locks should have those locks reflected as SMB locks.

NFS is well known for having poor lock behaviour when the network services are not running or are firewalled; the underlying syscalls are only non-blocking on the filedescriptor themselves, and EWOULDBLOCK is defined as the lock being already held, not the OS being unable to determine if the lock is already held...

  [EWOULDBLOCK]
	    The file is locked and the LOCK_NB option was specified.

So it is likely that we will see bug reports of rustup hanging when ~/.rustup is on NFS and the NFS server's lock RPC service is misconfigured or flaky.
I think this be mitigated by emitting an NFS specific log message when taking a lock out on NFS once per process; with a config option to disable that for users that are confident they don't need it.... and a bug reporting hint to tell us they have disabled it.

Locks and notifications

OS locks typically don't have callback notifications built in; polling is one answer, or some form of lightweight message bus (e.g. zmq) with clear rules tied into the lock management layer. We have to be careful about race conditions though: in particular notifying before or after as appropriate.

The text was updated successfully, but these errors were encountered:

brson · 2017-03-16T22:20:54Z

rustup needs to have concurrency support bolted on in general. Need to extract cargo's flocking code.

brson · 2017-03-16T22:21:28Z

To start with we can just bolt on a flock to the entire runtime of rustup.

matklad · 2017-03-16T23:31:04Z

I think I can probably work on this: I have some experience with Cargo's file [dead]locking :)

brson · 2017-03-17T02:43:08Z

@matklad ooh awesome. Can you extract cargo's fancy flock code to its own crate to share with rustup?

brson · 2017-03-17T02:43:36Z

Oh there's a tricky bit here in that rustup is reentrant. Can't just use a simple global lock.

matklad · 2017-03-22T13:24:27Z

Oh there's a tricky bit here in that rustup is reentrant. Can't just use a simple global lock.

Where exactly does reentrancy happen?

I don't yet understand how rustup works, but looks like adding a file lock to the Transaction should do the trick, unless transactions can nest and deadlock each other?

fenhl · 2019-09-30T14:00:52Z

Hi, is this still an issue? It might explain some problems I've been having with rustup.

kinnison · 2019-10-02T19:25:46Z

Yes, currently we have no locking on the toolchain directories IIRC.

rbtcollins · 2020-02-05T04:16:44Z

@matklad did that fancy flock code get extracted?

matklad · 2020-02-05T07:05:25Z

No: https://github.com/rust-lang/cargo/blob/master/src/cargo/util/flock.rs

I’ve also absolutely dropped the ball on this one :)

rbtcollins · 2020-02-05T07:56:41Z

No worries. Looking at it the semantics you have are different enough that it wouldn't be useful (e.g. creating directories on demand would run open up TOCTOU race conditions in untar secure handling for rustup), we don't want to lock each file because syscall latency would destroy performance on NFS and windows and so on.

I'll take the code as inspiration and do a similar thing though :) cheers!

rbtcollins · 2020-02-05T09:20:16Z

Ok, so looks like cargo has problems on NFS which I was worried about:

cifsfs supports oplocks so SMBfs should lock properly:

https://github.com/torvalds/linux/blob/master/fs/cifs/cifsfs.c
meaning that I think its just the blocking failure mode of crashed /firewalled/missing rpc.lockd that we need to be concerned with.

fs2 which cargo uses does use LOCK_NB with flock - flock(file, libc::LOCK_EX | libc::LOCK_NB) (https://tantivy-search.github.io/tantivy/src/fs2/unix.rs.html#36)

questions:

would lockf instead, with F_TLOCK still fail in the same way? It is defined as never blocking, but network file systems ... need to check the code.
Open file description locks still don't have a timeout field in the lock descriptor that calls operate on, so there's no way to tell NFS that you don't want to wait for that shitty shitty network. There is ENOLCK to signal that locking over the network failed, but the question is how long the network takes to realise it.
https://docs.oracle.com/cd/E19253-01/816-4555/rfsrefer-9/index.html has a client side retransmit timer of 15 seconds, which suggests ELONGTIME.

I'm inclined to provide a UI warning about NFS when taking the lock with a env variable to shut the warning up for folk that know things work properly in their environment (using the NFS sniffing logic from cargo); but then assume that the fs is working properly.

rbtcollins · 2020-02-16T22:41:12Z

Further thoughts: we need to mutually exclude operations on toolchains: removal and upgrade affect the entire toolchain as well as operations on components (add/remove components). I think locking the dir of the toolchain is probably the right control point.

metajack · 2020-03-02T21:30:21Z

We've been running into this on the Libra team as developers using CLion or IDEA manage to invoke rustup concurrently accidentally quite often when we check in toolchain file updates. This results in a borked toolchain install that is missing rustc, but fixes itself after a manual run of rustup toolchain install $TOOLCHAIN.

@rbtcollins Are you working on a PR to fix this now? If so, that would be awesome.

rbtcollins · 2020-03-02T21:40:38Z

Not actively working on a PR, but it is on my radar.

rbtcollins · 2020-03-08T20:52:47Z

Ok, so here's a bit of a specification, I think this ties together all the various bits involved.

Locking in Rustup

Rustup gets run concurrently in two very different contexts: within a single (machine, user), it may be run concurrently by the user, or an IDE, or both, to perform tasks ranging from toolchain installation, component addition, documentation opening. All of these require multiple reads to be made to the rustup data with a consistent view; some of them require writes to be made as well.

Rustup may also be run across machines, where a different machine but a shared rustup install is present - (machineA, userA) + (machineB, userA) - and in this case the same set of operations may take place, with the same requirements.

Whatever consistency solution we adopt would be best if it can deliver both use cases, and not require manual lock administration as lockdir style solutions do, nor additional running network daemons.

Proxies

Rustup has one set of proxies shared across all toolchains; the proxies are held open while any process from a toolchain is running - e.g. IDE's hold rls open for extended periods.

We need a lock to prevent concurrent attempts to install new proxies, and we need a notification mechanism back to the running proxies to allow them to be notified to exit when an update is required (because of presumed limitations of in-use-file-replacement on Windows, though recent changes may mean we can avoid this)

Toolchains

We have many toolchains in one rustup installation; stable, beta, nightly, dated nightly versions, and custom versions. Adding a toolchain adds a directory and a hash file; we need a lock to prevent collisions attempting to move the directory into place. Deleting a toolchain does a recursive rm in-place, which also needs a lock to prevent other rustup invocations presuming that the toolchain is actually installed during the time the deletion takes place (or perhaps we need to rename-then-delete, though that can run into race conditions with virus scanners, especially if the toolchain was just installed). Further, permitting deletions at any point will require notifications to running rls process proxies from that toolchain to cause them to shutdown, or the .exe is likely not deletable on Windows.

Components

Components such as rls are added into a toolchain directory, and also involve writing to a metadata file within the toolchain tree itself. This needs to be locked to avoid corruption/dropped writes. As with toolchains, we need proxy notification for component removal, as well as a way to make sure that a component that is being removed does not have new instances of it spawned between the start of the removal and the completion of the removal.

Downloads

We download packages for rustup itself, toolchains and additional components for toolchains, and (a corner case) custom installer executables for toolchains. We also download digital signature metadata files.

The same file can be downloaded by two different rustup invocations for two different reasons. For instance, downloading nightly and a dated nightly for today, will download the same file(s).

We used to leak partially downloaded files, and recently started deleting all download dir contents when rustup finished running; this is causing errors now.

We need some mechanism to deal with leaks, but also to permit concurrent execution of rustup to be downloading files without interruption. Possibly a date based mechanism or locking based mechanism would be sufficient.

Network file systems & rustup

Linux handles SMB mounts and locking on that filesystem well, at least per my reading of the source - a rustup dir on an SMB mounted file system using regular posix locks should have those locks reflected as SMB locks.

NFS is well known for having poor lock behaviour when the network services are not running or are firewalled; the underlying syscalls are only non-blocking on the filedescriptor themselves, and EWOULDBLOCK is defined as the lock being already held, not the OS being unable to determine if the lock is already held...

  [EWOULDBLOCK]
	    The file is locked and the LOCK_NB option was specified.

So it is likely that we will see bug reports of rustup hanging when ~/.rustup is on NFS and the NFS server's lock RPC service is misconfigured or flaky.
I think this be mitigated by emitting an NFS specific log message when taking a lock out on NFS once per process; with a config option to disable that for users that are confident they don't need it.... and a bug reporting hint to tell us they have disabled it.

Locks and notifications

OS locks typically don't have callback notifications built in; polling is one answer, or some form of lightweight message bus (e.g. zmq) with clear rules tied into the lock management layer. We have to be careful about race conditions though: in particular notifying before or after as appropriate.

kinnison · 2020-03-15T09:38:19Z

One aspect of rustup's data which needs locking and isn't on that list is the configuration (settings.toml) which can be altered by a number of rustup commands such as rustup default XXX or rustup override ...

Related: rust-lang/rustup#988 Related: rust-lang/rustup#2417

Xuanwo · 2023-11-21T08:21:57Z

For those encountering a similar problem or directed to this issue, here's a quick workaround from #3530:

Ensure the rust-toolchain is set up by running cargo version before using other tools like matruin.

brson added bug help wanted labels Mar 16, 2017

brson mentioned this issue Mar 17, 2017

inability to handle multiprocess target add breaks rustup #926

Closed

Diggsey added this to Features (inclination: accept) in Issue Categorisation May 3, 2017

This was referenced May 28, 2018

failed to install component: 'rustfmt-preview.. detected conflict: '"share/doc/rustfmt/README.md"' #1359

Closed

Possible race condition when adding target while updating with rustup #1348

Closed

rbtcollins added the enhancement label Feb 5, 2020

rbtcollins mentioned this issue Mar 2, 2020

running cargo in a custom toolchain is racy #2247

Open

rbtcollins mentioned this issue Mar 8, 2020

rustup should not delete the entire ~/.rustup/tmp dir #2246

Open

matklad mentioned this issue Mar 15, 2020

feature: support multiple targets actions-rs/toolchain#62

Open

metajack mentioned this issue Jun 25, 2020

[rust] update 1.44.1, code coverage tested as well -- adding in backtraces diem/diem#4708

Closed

This was referenced Jul 15, 2020

Corrupt /missing manifests lead to installation conflicts #2417

Open

Add Document of "How to Install Rust Environment for Multiple Users on Linux?" #2383

Closed

pan93412 mentioned this issue Jun 7, 2022

fix(gitpod): workaround the rustup concurrent issue UnblockNeteaseMusic/server-rust#162

Merged

pan93412 added a commit to pan93412/ferrumfix that referenced this issue Jun 8, 2022

fix(gitpod): workaround rustup's concurrent issue

22ff0a4

Related: rust-lang/rustup#988 Related: rust-lang/rustup#2417

pan93412 added a commit to wmjtyd/ferrumfix-fork that referenced this issue Jun 20, 2022

fix(gitpod): workaround rustup's concurrent issue

fc36f70

Related: rust-lang/rustup#988 Related: rust-lang/rustup#2417

cataggar mentioned this issue Aug 15, 2022

rust-toolchain.toml causes 'cargo' component, is not applicable Azure/azure-sdk-for-rust#1009

Closed

messense mentioned this issue Aug 21, 2022

Missing package: orjson piwheels/packages#237

Open

4 tasks

repi mentioned this issue Aug 25, 2022

Rust wants EmbarkStudios/rust-ecosystem#84

Open

35 tasks

djc mentioned this issue Nov 8, 2022

rustup is not robust to concurrent installations of the same toolchain #3107

Closed

Enselic mentioned this issue Nov 24, 2022

Replace test-invocation-variants.sh with cargo test Enselic/cargo-public-api#215

Merged

ryanking13 mentioned this issue Feb 10, 2023

Issue in building rust-based packages in parallel pyodide/pyodide#3565

Closed

rbtcollins mentioned this issue Feb 23, 2023

"directory does not exist" errors when rustup update #3230

Closed

mitsuhiko mentioned this issue May 20, 2023

Support Hardlinked Shim/Proxy Deletion mitsuhiko/self-replace#15

Open

hi-rustin mentioned this issue Jun 5, 2023

Verification for existing command execution while update #3376

Closed

rbtcollins mentioned this issue Oct 15, 2023

Problems with .rustup\tmp ..again #3494

Closed

messense mentioned this issue Nov 21, 2023

Flaky conflict installing rust-src on GitHub actions runners #3530

Closed

jrvanwhy mentioned this issue Nov 21, 2023

Building a process binary for two different platforms with the same target triple causes a race condition. tock/libtock-rs#366

Closed

rami3l mentioned this issue Dec 3, 2023

error: toolchain 'stable-x86_64-pc-windows-msvc' does not support components #1793

Closed

rami3l mentioned this issue Dec 13, 2023

Ubuntu 20.04 LTS: error: linking with cc failed: exit status: 1 random happen even reinstall toolchain #3581

Closed

dtolnay mentioned this issue Dec 27, 2023

strip rust src lines during normalization dtolnay/trybuild#247

Closed

rami3l added this to the On Deck milestone Jan 17, 2024

rami3l mentioned this issue Feb 5, 2024

i can't rustup update #3660

Closed

jforissier mentioned this issue Feb 8, 2024

Rust error "could not rename downloaded file" OP-TEE/build#727

Closed

This was referenced Feb 9, 2024

Rust can't delete it's own tmp #3465

Closed

Self update fails on Windows ("Access is denied") #1186

Closed

rami3l mentioned this issue Feb 19, 2024

Component conflicts against it's own manifest? #3676

Closed

ChrisDenton mentioned this issue Mar 2, 2024

Concurrent rustup toolchain add overwrite each others downloads #3690

Closed

nomeata mentioned this issue Mar 9, 2024

Create toolchain directory atomically leanprover/elan#121

Merged

rami3l mentioned this issue Mar 15, 2024

error: failed to install component: 'rust-src', detected conflict: 'lib/rustlib/src/rust/Cargo.lock' #3716

Closed

rbtcollins mentioned this issue Apr 1, 2024

rustup 1.27 fails when installing targets on Windows in GitHub Actions #3709

Open

Alexendoo mentioned this issue May 5, 2024

error: the 'cargo' binary, normally provided by the 'cargo' component, is not applicable to the '1.78.0-x86_64-unknown-linux-gnu' toolchain rust-lang/rust-clippy#12763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rustup (including proxies) is not safe for concurrent use #988

Rustup (including proxies) is not safe for concurrent use #988

matklad commented Mar 16, 2017 •

edited by rbtcollins

brson commented Mar 16, 2017

brson commented Mar 16, 2017

matklad commented Mar 16, 2017

brson commented Mar 17, 2017

brson commented Mar 17, 2017

matklad commented Mar 22, 2017

fenhl commented Sep 30, 2019

kinnison commented Oct 2, 2019

rbtcollins commented Feb 5, 2020

matklad commented Feb 5, 2020

rbtcollins commented Feb 5, 2020

rbtcollins commented Feb 5, 2020

rbtcollins commented Feb 16, 2020

metajack commented Mar 2, 2020

rbtcollins commented Mar 2, 2020

rbtcollins commented Mar 8, 2020

kinnison commented Mar 15, 2020

Xuanwo commented Nov 21, 2023

Rustup (including proxies) is not safe for concurrent use #988

Rustup (including proxies) is not safe for concurrent use #988

Comments

matklad commented Mar 16, 2017 • edited by rbtcollins

Recovering from this bug:

User model of Rustup

Locking in Rustup

Proxies

Toolchains

Components

Downloads

Network file systems & rustup

Locks and notifications

brson commented Mar 16, 2017

brson commented Mar 16, 2017

matklad commented Mar 16, 2017

brson commented Mar 17, 2017

brson commented Mar 17, 2017

matklad commented Mar 22, 2017

fenhl commented Sep 30, 2019

kinnison commented Oct 2, 2019

rbtcollins commented Feb 5, 2020

matklad commented Feb 5, 2020

rbtcollins commented Feb 5, 2020

rbtcollins commented Feb 5, 2020

rbtcollins commented Feb 16, 2020

metajack commented Mar 2, 2020

rbtcollins commented Mar 2, 2020

rbtcollins commented Mar 8, 2020

Locking in Rustup

Proxies

Toolchains

Components

Downloads

Network file systems & rustup

Locks and notifications

kinnison commented Mar 15, 2020

Xuanwo commented Nov 21, 2023

matklad commented Mar 16, 2017 •

edited by rbtcollins