Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch and Lightning 2.0 fixes #8072

Merged
merged 20 commits into from
Mar 24, 2023
Merged

Conversation

shrinath-suresh
Copy link
Contributor

@shrinath-suresh shrinath-suresh commented Mar 21, 2023

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Pytorch 2.0 and Lightning 2.0 has been released recently.

Lightning 2.0 changes are published here

Following are the impacts on mlflow pytorch library

  1. The package name has been changed from pytorch_lightning to lightning
  2. validation_epoch_end and test_epoch_end has been removed
  3. pl.Trainer.add_argparse_args(parent_parser=parser) has been removed. This line of code combines all the pytorch lightning command line arguments with the user's ArgumentParser. Instead we have to use LightningCLI as mentioned in the release link
  4. The training arguments have changed. Previously, we use --max_epochs . Now, its --trainer.max_epochs
  5. When installing lightning package, pytorch_lightning gets installed as well. Hence, we dont have to change everything in mllfow.pytorch.autolog library. But, we may need to remove older version checks.

How is this patch tested?

Observation:
In multi gpu tests, the lightning throws _thread.lock object error while pickling the model and the model is not getting logged into mlflow. However, the issue has been fixed in lightning master branch. Attaching screenshot for reference

screenshots.zip

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests (describe details, including test results, below)

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

shrinath-suresh and others added 10 commits March 20, 2023 23:53
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Jagadeesh J <jagadeeshj@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
@github-actions github-actions bot added area/examples Example code area/models MLmodel format, model serialization/deserialization, flavors rn/bug-fix Mention under Bug Fixes in Changelogs. labels Mar 21, 2023
@mlflow-automation
Copy link
Collaborator

mlflow-automation commented Mar 21, 2023

Documentation preview for 3ade269 will be available here when this CircleCI job completes successfully.

More info

@harupy
Copy link
Member

harupy commented Mar 21, 2023

@shrinath-suresh thanks for the PR, ping me when it's ready for review!

@dbczumar dbczumar requested a review from harupy March 22, 2023 00:23
@shrinath-suresh
Copy link
Contributor Author

@shrinath-suresh thanks for the PR, ping me when it's ready for review!

sure @harupy . We are working on the GPU tests. Will let you know once the PR is ready for review

…ing older versions

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
@shrinath-suresh shrinath-suresh changed the title [WIP] Pytorch and Lightning 2.0 fixes Pytorch and Lightning 2.0 fixes Mar 22, 2023
@shrinath-suresh
Copy link
Contributor Author

@harupy The PR is ready for review

@shrinath-suresh
Copy link
Contributor Author

@harupy Observing the following error message in CI - Message: 'Could not install packages due to an OSError: [Errno 28] No space left on device\n' ... Some of the test cases are failing during the package installation.

And the cross version tests are failing.. When i check the failure logs.. it has failed in import pytorch_lightning as pl . And in the master branch autolog code, the import is in try except block with pass in except - code link

For running autolog test, pytorch_lightning or lightning is mandatory right ?

@harupy
Copy link
Member

harupy commented Mar 23, 2023

Message: 'Could not install packages due to an OSError: [Errno 28] No space left on device\n'

Where did you see this error? Can you share the URL?

For running autolog test, pytorch_lightning or lightning is mandatory right ?

No, it's not mandatory. See #7627 for more details.

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
@shrinath-suresh
Copy link
Contributor Author

Message: 'Could not install packages due to an OSError: [Errno 28] No space left on device\n'

Where did you see this error? Can you share the URL?

For running autolog test, pytorch_lightning or lightning is mandatory right ?

No, it's not mandatory. See #7627 for more details.

I was getting the Space error in this run - . But in the latest run i am not seeing it.

Thanks for the reference. Cross functional tests (for pytorch flavor) failed due to the import error.. i fixed it supporting all three options (lightning, pytorch lightning <=1.9.4 and pytorch flavor)

    # For lightning>=2.0.0 flavor
    try:
        import lightning as L
    except ImportError:
        # for pytorch-lightning <= 1.9.4 flavor
        try:
            import pytorch_lightning as pl
        except ImportError:
            # For PyTorch flavor
            pass
        else:
            from mlflow.pytorch._lightning_autolog import patched_fit

            safe_patch(FLAVOR_NAME, pl.Trainer, "fit", patched_fit, manage_run=True)
    else:
        from mlflow.pytorch._lightning_autolog import patched_fit

        safe_patch(FLAVOR_NAME, L.Trainer, "fit", patched_fit, manage_run=True)

@harupy
Copy link
Member

harupy commented Mar 23, 2023

@shrinath-suresh What happens if we patch both L.Trainer and pl.Trainer?

@shrinath-suresh
Copy link
Contributor Author

@shrinath-suresh What happens if we patch both L.Trainer and pl.Trainer?

I haven't tested the case where we patch both of it. Can you please explain a bit more on the context.

As of now, the logic is to patch L.Trainer for lightning package and pl.Trainer for the ptl<=1.9.4. If we patch both, ptl<=1.9.4 will not have L.Trainer and fail.

@harupy
Copy link
Member

harupy commented Mar 23, 2023

I haven't tested the case where we patch both of it. Can you please explain a bit more on the context.

Sure. If patching both Trainer is fine, we can simplify the code as I suggested.

    try:
        import pytorch_lightning as pl
    except ImportError:
        pass
    else:
        from mlflow.pytorch._lightning_autolog import patched_fit

        safe_patch(FLAVOR_NAME, pl.Trainer, "fit", patched_fit, manage_run=True)

    try:
        import lightning as L
    except ImportError:
        pass
    else:
        from mlflow.pytorch._lightning_autolog import patched_fit

        safe_patch(FLAVOR_NAME, L.Trainer, "fit", patched_fit, manage_run=True)

Just because lightning can be imported doesn't mean the user uses lightning. The user might use pytorch_lightning even when lightning is available.

Signed-off-by: Shrinath Suresh <shrinath@ideas2it.com>
@harupy
Copy link
Member

harupy commented Mar 23, 2023

Thanks for the update, let's see what happens.

@shrinath-suresh
Copy link
Contributor Author

shrinath-suresh commented Mar 23, 2023

Thanks for the update, let's see what happens.

Thanks for the suggestion @harupy . I added both and tested it in cpu and gpu. Until pytorch_lightning support is removed from lightning , we are good. At some point, we might need to move into lightning. Which may involve almost the rewrite of entire library. We can use MlflowLogger at that time and change the library. Can you please review and let me know your comments.

Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/examples Example code area/models MLmodel format, model serialization/deserialization, flavors rn/bug-fix Mention under Bug Fixes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants