Abstracting away the gpu #46

parthraut · 2024-03-26T18:00:13Z

PR is mostly done, few notes:

ensure_homogeneous has not been implemented, I was a little confused on what it should check. Should it check that all gpus are the same kind? Ex. all GPUs must be the same kind and GPU architecture
Some methods for AMD GPU class have not been implemented, as well as handling the exceptions. I focused on NVIDIA gpus for this PR, but extending this should not be difficult if needed.
Style: Failing because many names have uppercase letters: "N802 Function name getTotalEnergyConsumption should be lowercase". We decided on camel case to match pynvml methods, so what should be done to resolve this?

vercel · 2024-03-26T18:00:19Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
zeus	❌ Failed (Inspect)			Apr 1, 2024 1:48am

jaywonchung · 2024-03-26T18:08:10Z

Could you rebase to master? There was a small change in docs/requirements.txt.

jaywonchung · 2024-03-26T18:55:39Z

Thanks a lot for your work! I'll look at the code soon.

Responses to your questions:

I think ensuring that the name of the devices (nvmlDeviceGetName) are equal is enough.
Sounds good!
See bottom of pyproject.toml. There you can disable specific Ruff error code on entire files.

parthraut · 2024-03-27T02:16:56Z

added implementation for ensure_homogeneous
modified pyproject.toml to fix style, but it still looks like the tests are failing on github actions with some weird error. I don't see this error when I run locally, it all passes.
The only file left to be finished is PerseusOptimizer in zeus/optimizer/perseus/optimizer.py. In the perseus Optimizer class:

class PerseusOptimizer(Callback):
"""Perseus optimizer."""

def __init__(
    self,
    rank: int,
    dp_rank: int,
    pp_rank: int,
    tp_rank: int,
    device_id: int,
    dp_degree: int,
    pp_degree: int,
    tp_degree: int,
    world_size: int,
    server_url: str,
    job_metadata: str | None = None,
) -> None:
    """Initialize the Perseus optimizer.

    Assumptions:
        - `torch.distributed` has been initialized.
        - `torch.cuda.set_device` has been called with `device_id`.
            This is needed to broadcast the job ID to all ranks.

    The master process (rank 0) will register the job with the Peresus
    server and retrieve the job ID of this job. Then, each rank will
    report itself to the Perseus server with the job ID.

    Args:
        rank: Global rank of the current process.
        dp_rank: Rank in the data parallel group.
        pp_rank: Rank in the pipeline parallel group.
        tp_rank: Rank in the tensor parallel group.
        **device_id: CUDA device ID that the current process manages.**
        dp_degree: Size of the data parallel group.
        pp_degree: Size of the pipeline parallel group.
        tp_degree: Size of the tensor parallel group.
        world_size: Total number of ranks that participate in training.
        server_url: URL of the Perseus server.
        job_metadata: An optional arbitrary string that describes the job. This will
            be appended to the job ID if given. Typically for logging purposes.
    """

Here, is device_id reindexed to respect CUDA_VISIBLE_DEVICES?

for example, if there are four GPUs on system and CUDA_VISIBLE_DEVICES="0,2", to access GPU with CUDA_VISIBLE_DEVICE "2", does it reindex it to be 1 or keep it as 2? Because ZeusMonitor reindexes, but from the comment above I wasn't sure

^ after doing some research, I assume its reindexed. Finished optimizer.py with this assumption.

parthraut · 2024-03-29T22:37:15Z

Ready for review!

Completed tasks:

✅ remove cuda_sync and setDevice from gpu, put it back where it belongs
✅ changed gpu/get_gpu to torch_is_available()-like to make it scalable

✅ Fixed set Persistence Mode, added TODO comment back to where set Persistence mode SYS_admin check was
✅ included ability to turn off persistence mode

✅ Fixed set PowerManagementLimit - originally broke API consistency. Added resetPowerManagementLimit function.
✅ Fixed: To check ensure homogeneous, just convert to set and check len larger than 1
✅ declare self.gpus as an abstract property to ensure it exists

✅ Fixed names of exceptions
✅ Fall back to ZeusGPUUnknown Error

BIG fixes:
✅ move lru_cache stuff to gpu Class
✅ Fix tests, and remove gpu_indicies

STYLE Fixes:
✅ Module-level docstring in zeus/init.py and zeus/device/init.py
✅ Fixed docstrings
✅ Removed all private attributes from docstrings

✅ made sure signature of init all match
✅ init should come First
✅ mention unit for each method

Questions and notes:
❓ It looks like it is failing on github actions, but it works perfectly locally. I'm not sure what exactly is the issue.
❓ What do you think about the names of exceptions?

jaywonchung

I'm outside so I just added two comments. I'll continue.

zeus/device/gpu.py

TODO.txt

jaywonchung · 2024-03-29T23:00:37Z

Exception names look good! But is there a need to export all of the exceptions in zeus/devices/__init__.py? I think just the core API like get_gpu should be exported and other things kept inside zeus.device.gpu.

parthraut · 2024-03-30T14:13:12Z

Exception names look good! But is there a need to export all of the exceptions in zeus/devices/__init__.py? I think just the core API like get_gpu should be exported and other things kept inside zeus.device.gpu.

So initially I was only exporting the exceptions used in the source code. But you said not to selectively export some of them, so I added all of them.

jaywonchung

Just some small change suggestions. After they're all addressed (or accepted), I think we'd be good to merge without further changes. Thanks again for the great work. Now it's so much easier to interface with the GPU!!

zeus/__init__.py

tests/test_monitor.py

zeus/device/gpu.py

zeus/run/dataloader.py

zeus/run/master.py

zeus/util/framework.py

Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>

zeus/device/gpu.py

zeus/util/__init__.py

jaywonchung

Let's fix the doc error on master. LGTM, thanks!!

parthraut linked an issue Mar 26, 2024 that may be closed by this pull request

Abstracting away the GPU #23

Closed

parthraut requested a review from jaywonchung March 26, 2024 18:00

vercel bot temporarily deployed to Preview March 26, 2024 18:03 Inactive

parthraut added 14 commits March 26, 2024 14:10

basic plan

e915faf

stuff

1cfe4ce

added some functions in GPU classes

b1fea38

changed src in zeus master

a45c20a

made changes, need to test

b835336

trying to fix errors

4fb3a4e

failing test_power_limit_optimizer

e25a9b6

replaced all usages, failing a lot :(

69b42af

fixed testing issues

ce4efdf

added some exception handling

9acfa69

cleaned up gpu class

6d9c05b

added style, replaced pynvml exceptions

aaa4aec

fixed some style

9040e0d

removed unnecessary file

8ae14b8

parthraut force-pushed the 23-abstracting-away-the-gpu branch from 7c8eae8 to 8ae14b8 Compare March 26, 2024 18:10

vercel bot temporarily deployed to Preview March 26, 2024 18:11 Inactive

jaywonchung changed the title ~~23 abstracting away the gpu~~ Abstracting away the gpu Mar 26, 2024

fixed style, added support for ensure_homogeneous

8620a4e

vercel bot temporarily deployed to Preview March 27, 2024 01:52 Inactive

fixed small issues

7e78fa3

vercel bot temporarily deployed to Preview March 27, 2024 02:06 Inactive

github giving weird error, not showing locally.

91e607b

vercel bot temporarily deployed to Preview March 27, 2024 02:11 Inactive

vercel bot temporarily deployed to Preview March 29, 2024 18:59 Inactive

fixed style, added better doc strings

8034797

vercel bot temporarily deployed to Preview March 29, 2024 22:29 Inactive

small style changes, ready for review

4bb8cba

vercel bot temporarily deployed to Preview March 29, 2024 22:37 Inactive

parthraut self-assigned this Mar 29, 2024

jaywonchung reviewed Mar 29, 2024

View reviewed changes

zeus/device/gpu.py Outdated Show resolved Hide resolved

TODO.txt Outdated Show resolved Hide resolved

moved handle outside of classes, removed TODO

86698e1

vercel bot temporarily deployed to Preview March 30, 2024 14:08 Inactive

jaywonchung reviewed Mar 30, 2024

View reviewed changes

Apply suggestions from code review

b23bff9

Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>

vercel bot temporarily deployed to Preview March 30, 2024 20:49 Inactive

fixed error that was causing tests to fail

87c1c3f

vercel bot temporarily deployed to Preview March 30, 2024 22:16 Inactive

Apply suggestions from code review

bdc400d

Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>

vercel bot temporarily deployed to Preview March 30, 2024 22:22 Inactive

moved _is_avail func to gpu.py

67d232c

vercel bot temporarily deployed to Preview March 30, 2024 22:28 Inactive

jaywonchung reviewed Apr 1, 2024

View reviewed changes

zeus/device/gpu.py Outdated Show resolved Hide resolved

zeus/util/__init__.py Outdated Show resolved Hide resolved

Apply suggestions from code review

65db85a

vercel bot temporarily deployed to Preview April 1, 2024 01:42 Inactive

jaywonchung approved these changes Apr 1, 2024

View reviewed changes

Update zeus/monitor/power.py

776d8fb

vercel bot temporarily deployed to Preview April 1, 2024 01:48 Inactive

jaywonchung merged commit 9c6b3b0 into master Apr 1, 2024
1 of 3 checks passed

jaywonchung deleted the 23-abstracting-away-the-gpu branch April 1, 2024 01:49

jaywonchung mentioned this pull request May 2, 2024

Support for AMD GPUs #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstracting away the gpu #46

Abstracting away the gpu #46

parthraut commented Mar 26, 2024 •

edited

vercel bot commented Mar 26, 2024 •

edited

jaywonchung commented Mar 26, 2024

jaywonchung commented Mar 26, 2024

parthraut commented Mar 27, 2024 •

edited

parthraut commented Mar 29, 2024 •

edited

jaywonchung left a comment

jaywonchung commented Mar 29, 2024

parthraut commented Mar 30, 2024

jaywonchung left a comment

jaywonchung left a comment

Abstracting away the gpu #46

Abstracting away the gpu #46

Conversation

parthraut commented Mar 26, 2024 • edited

vercel bot commented Mar 26, 2024 • edited

jaywonchung commented Mar 26, 2024

jaywonchung commented Mar 26, 2024

parthraut commented Mar 27, 2024 • edited

parthraut commented Mar 29, 2024 • edited

jaywonchung left a comment

Choose a reason for hiding this comment

jaywonchung commented Mar 29, 2024

parthraut commented Mar 30, 2024

jaywonchung left a comment

Choose a reason for hiding this comment

jaywonchung left a comment

Choose a reason for hiding this comment

parthraut commented Mar 26, 2024 •

edited

vercel bot commented Mar 26, 2024 •

edited

parthraut commented Mar 27, 2024 •

edited

parthraut commented Mar 29, 2024 •

edited