Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstracting away the gpu #46

Merged
merged 51 commits into from
Apr 1, 2024
Merged

Abstracting away the gpu #46

merged 51 commits into from
Apr 1, 2024

Conversation

parthraut
Copy link
Collaborator

@parthraut parthraut commented Mar 26, 2024

PR is mostly done, few notes:

  1. ensure_homogeneous has not been implemented, I was a little confused on what it should check. Should it check that all gpus are the same kind? Ex. all GPUs must be the same kind and GPU architecture
  2. Some methods for AMD GPU class have not been implemented, as well as handling the exceptions. I focused on NVIDIA gpus for this PR, but extending this should not be difficult if needed.
  3. Style: Failing because many names have uppercase letters: "N802 Function name getTotalEnergyConsumption should be lowercase". We decided on camel case to match pynvml methods, so what should be done to resolve this?

@parthraut parthraut linked an issue Mar 26, 2024 that may be closed by this pull request
Copy link

vercel bot commented Mar 26, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
zeus ❌ Failed (Inspect) Apr 1, 2024 1:48am

@jaywonchung
Copy link
Member

Could you rebase to master? There was a small change in docs/requirements.txt.

@jaywonchung
Copy link
Member

Thanks a lot for your work! I'll look at the code soon.

Responses to your questions:

  1. I think ensuring that the name of the devices (nvmlDeviceGetName) are equal is enough.
  2. Sounds good!
  3. See bottom of pyproject.toml. There you can disable specific Ruff error code on entire files.

@jaywonchung jaywonchung changed the title 23 abstracting away the gpu Abstracting away the gpu Mar 26, 2024
@parthraut
Copy link
Collaborator Author

parthraut commented Mar 27, 2024

  • added implementation for ensure_homogeneous
  • modified pyproject.toml to fix style, but it still looks like the tests are failing on github actions with some weird error. I don't see this error when I run locally, it all passes.
  • The only file left to be finished is PerseusOptimizer in zeus/optimizer/perseus/optimizer.py. In the perseus Optimizer class:

class PerseusOptimizer(Callback):
"""Perseus optimizer."""

def __init__(
    self,
    rank: int,
    dp_rank: int,
    pp_rank: int,
    tp_rank: int,
    device_id: int,
    dp_degree: int,
    pp_degree: int,
    tp_degree: int,
    world_size: int,
    server_url: str,
    job_metadata: str | None = None,
) -> None:
    """Initialize the Perseus optimizer.

    Assumptions:
        - `torch.distributed` has been initialized.
        - `torch.cuda.set_device` has been called with `device_id`.
            This is needed to broadcast the job ID to all ranks.

    The master process (rank 0) will register the job with the Peresus
    server and retrieve the job ID of this job. Then, each rank will
    report itself to the Perseus server with the job ID.

    Args:
        rank: Global rank of the current process.
        dp_rank: Rank in the data parallel group.
        pp_rank: Rank in the pipeline parallel group.
        tp_rank: Rank in the tensor parallel group.
        **device_id: CUDA device ID that the current process manages.**
        dp_degree: Size of the data parallel group.
        pp_degree: Size of the pipeline parallel group.
        tp_degree: Size of the tensor parallel group.
        world_size: Total number of ranks that participate in training.
        server_url: URL of the Perseus server.
        job_metadata: An optional arbitrary string that describes the job. This will
            be appended to the job ID if given. Typically for logging purposes.
    """

Here, is device_id reindexed to respect CUDA_VISIBLE_DEVICES?

for example, if there are four GPUs on system and CUDA_VISIBLE_DEVICES="0,2", to access GPU with CUDA_VISIBLE_DEVICE "2", does it reindex it to be 1 or keep it as 2? Because ZeusMonitor reindexes, but from the comment above I wasn't sure

^ after doing some research, I assume its reindexed. Finished optimizer.py with this assumption.

@parthraut
Copy link
Collaborator Author

parthraut commented Mar 29, 2024

Ready for review!

Completed tasks:

✅ remove cuda_sync and setDevice from gpu, put it back where it belongs
✅ changed gpu/get_gpu to torch_is_available()-like to make it scalable

✅ Fixed set Persistence Mode, added TODO comment back to where set Persistence mode SYS_admin check was
✅ included ability to turn off persistence mode

✅ Fixed set PowerManagementLimit - originally broke API consistency. Added resetPowerManagementLimit function.
✅ Fixed: To check ensure homogeneous, just convert to set and check len larger than 1
✅ declare self.gpus as an abstract property to ensure it exists

✅ Fixed names of exceptions
✅ Fall back to ZeusGPUUnknown Error

BIG fixes:
✅ move lru_cache stuff to gpu Class
✅ Fix tests, and remove gpu_indicies

STYLE Fixes:
✅ Module-level docstring in zeus/init.py and zeus/device/init.py
✅ Fixed docstrings
✅ Removed all private attributes from docstrings

✅ made sure signature of init all match
init should come First
✅ mention unit for each method

Questions and notes:
❓ It looks like it is failing on github actions, but it works perfectly locally. I'm not sure what exactly is the issue.
❓ What do you think about the names of exceptions?

@parthraut parthraut self-assigned this Mar 29, 2024
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm outside so I just added two comments. I'll continue.

zeus/device/gpu.py Outdated Show resolved Hide resolved
TODO.txt Outdated Show resolved Hide resolved
@jaywonchung
Copy link
Member

Exception names look good! But is there a need to export all of the exceptions in zeus/devices/__init__.py? I think just the core API like get_gpu should be exported and other things kept inside zeus.device.gpu.

@parthraut
Copy link
Collaborator Author

Exception names look good! But is there a need to export all of the exceptions in zeus/devices/__init__.py? I think just the core API like get_gpu should be exported and other things kept inside zeus.device.gpu.

So initially I was only exporting the exceptions used in the source code. But you said not to selectively export some of them, so I added all of them.

Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small change suggestions. After they're all addressed (or accepted), I think we'd be good to merge without further changes. Thanks again for the great work. Now it's so much easier to interface with the GPU!!

zeus/__init__.py Outdated Show resolved Hide resolved
tests/test_monitor.py Outdated Show resolved Hide resolved
zeus/device/gpu.py Outdated Show resolved Hide resolved
zeus/device/gpu.py Outdated Show resolved Hide resolved
zeus/device/gpu.py Outdated Show resolved Hide resolved
zeus/device/gpu.py Outdated Show resolved Hide resolved
zeus/run/dataloader.py Outdated Show resolved Hide resolved
zeus/run/dataloader.py Outdated Show resolved Hide resolved
zeus/run/master.py Outdated Show resolved Hide resolved
zeus/util/framework.py Outdated Show resolved Hide resolved
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
zeus/device/gpu.py Outdated Show resolved Hide resolved
zeus/util/__init__.py Outdated Show resolved Hide resolved
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix the doc error on master. LGTM, thanks!!

@jaywonchung jaywonchung merged commit 9c6b3b0 into master Apr 1, 2024
1 of 3 checks passed
@jaywonchung jaywonchung deleted the 23-abstracting-away-the-gpu branch April 1, 2024 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Abstracting away the GPU
2 participants