Disable memory sharing on model parameters in ddp-spawn #18238

awaelchli · 2023-08-06T03:20:48Z

What does this PR do?

The torch.multiprocessing.spawn launcher (strategy="ddp_spawn") by default enables memory sharing for all tensors passed through the spawning function, including tensors in modules. This means that the underlying storage of these tensors is shared across all processes, and anyone can read or write to them. This can lead to inconsistencies, as demonstrated in the repro example in the linked issue. This PR disables that by cloning the weights in each process, detaching it from shared memory.

Note:
This only applies when running on CPU. If running on GPU, memory won't be shared.
Hopefully, this will also help to avoid flakiness in tests.

cc @Borda @carmocca @justusschock @awaelchli

for more information, see https://pre-commit.ci

…to bugfix/tensor-memory-sharing

for more information, see https://pre-commit.ci

src/lightning/fabric/strategies/launchers/multiprocessing.py

for more information, see https://pre-commit.ci

src/lightning/pytorch/CHANGELOG.md

src/lightning/fabric/CHANGELOG.md

for more information, see https://pre-commit.ci

src/lightning/fabric/strategies/launchers/multiprocessing.py

src/lightning/fabric/CHANGELOG.md

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> (cherry picked from commit a0ca2c8)

awaelchli added 3 commits August 6, 2023 04:47

disable

52feb38

test

31beadf

update

c7f6ec2

awaelchli added bug strategy: ddp spawn fun labels Aug 6, 2023

awaelchli added this to the 2.0.x milestone Aug 6, 2023

github-actions bot added fabric pl labels Aug 6, 2023

awaelchli added the reproducibility label Aug 6, 2023

pre-commit-ci bot and others added 6 commits August 6, 2023 03:21

update

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

5078ce0

update test

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

a3c0e3f

[pre-commit.ci] auto fixes from pre-commit.com hooks

42a2172

for more information, see https://pre-commit.ci

update

0401dc8

awaelchli modified the milestones: 2.0.x, v1.9.x Aug 6, 2023

handle tied weights WIP

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

0592595

awaelchli commented Aug 8, 2023

View reviewed changes

src/lightning/fabric/strategies/launchers/multiprocessing.py Outdated Show resolved Hide resolved

test

d10e1c4

awaelchli force-pushed the bugfix/tensor-memory-sharing branch from a3733be to d10e1c4 Compare August 8, 2023 22:22

pre-commit-ci bot and others added 2 commits August 8, 2023 22:23

[pre-commit.ci] auto fixes from pre-commit.com hooks

b181fa7

for more information, see https://pre-commit.ci

Merge branch 'master' into bugfix/tensor-memory-sharing

a9aec68

awaelchli marked this pull request as ready for review August 9, 2023 21:44

awaelchli requested review from carmocca, justusschock and williamFalcon as code owners August 9, 2023 21:44

awaelchli added 3 commits August 10, 2023 12:02

update

4753f62

update

a0e9555

update test

be2f424

awaelchli changed the title ~~Disable memory sharing on model parameters in ddp-spawn [TPU]~~ WIP: Disable memory sharing on model parameters in ddp-spawn [TPU] Aug 10, 2023

awaelchli and others added 5 commits August 13, 2023 16:32

fixes

7753cbe

[pre-commit.ci] auto fixes from pre-commit.com hooks

8ae4074

for more information, see https://pre-commit.ci

add test

2123297

chlog

0899bcc

reset

e1217bf

awaelchli requested a review from Borda as a code owner August 13, 2023 15:01

awaelchli changed the title ~~WIP: Disable memory sharing on model parameters in ddp-spawn [TPU]~~ WIP: Disable memory sharing on model parameters in ddp-spawn Aug 13, 2023

awaelchli changed the title ~~WIP: Disable memory sharing on model parameters in ddp-spawn~~ Disable memory sharing on model parameters in ddp-spawn Aug 13, 2023

mergify bot added the has conflicts label Aug 13, 2023

simplify

1a5f402

awaelchli commented Aug 13, 2023

View reviewed changes

src/lightning/pytorch/CHANGELOG.md Outdated Show resolved Hide resolved

src/lightning/fabric/CHANGELOG.md Outdated Show resolved Hide resolved

awaelchli added 2 commits August 13, 2023 11:04

Apply suggestions from code review

c88c74b

Merge branch 'master' into bugfix/tensor-memory-sharing

68587d3

mergify bot removed the has conflicts label Aug 13, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

cebedd2

for more information, see https://pre-commit.ci

carmocca approved these changes Aug 14, 2023

View reviewed changes

src/lightning/fabric/strategies/launchers/multiprocessing.py Outdated Show resolved Hide resolved

src/lightning/fabric/strategies/launchers/multiprocessing.py Outdated Show resolved Hide resolved

src/lightning/fabric/CHANGELOG.md Outdated Show resolved Hide resolved

awaelchli and others added 2 commits August 14, 2023 06:58

Apply suggestions from code review

3dc048b

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

add note about memory

75f1396

Borda approved these changes Aug 15, 2023

View reviewed changes

mergify bot added the ready label Aug 15, 2023

Borda merged commit a0ca2c8 into master Aug 15, 2023

Borda deleted the bugfix/tensor-memory-sharing branch August 15, 2023 12:39

awaelchli added strategy: ddp and removed strategy: ddp spawn labels Nov 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable memory sharing on model parameters in ddp-spawn #18238

Disable memory sharing on model parameters in ddp-spawn #18238

awaelchli commented Aug 6, 2023 •

edited by github-actions bot

Loading

Disable memory sharing on model parameters in ddp-spawn #18238

Disable memory sharing on model parameters in ddp-spawn #18238

Conversation

awaelchli commented Aug 6, 2023 • edited by github-actions bot Loading

What does this PR do?

awaelchli commented Aug 6, 2023 •

edited by github-actions bot

Loading