Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for transformers as a named flavor #8086

Merged
merged 31 commits into from
Mar 30, 2023

Conversation

BenWilson2
Copy link
Member

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Initial PR for transformers flavor, providing support for:

  • save_model
  • load_model
  • log_model

How is this patch tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests (describe details, including test results, below)

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Add support for HuggingFace transformers as a named flavor.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@mlflow-automation
Copy link
Collaborator

mlflow-automation commented Mar 23, 2023

Documentation preview for 3b851a7 will be available here when this CircleCI job completes successfully.

More info

@github-actions github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs. labels Mar 23, 2023
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2 BenWilson2 added the only-latest If applied, only test the latest version of each group in cross-version tests. label Mar 23, 2023
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
mlflow/transformers.py Outdated Show resolved Hide resolved
mlflow/transformers.py Outdated Show resolved Hide resolved
mlflow/transformers.py Outdated Show resolved Hide resolved
mlflow/transformers.py Outdated Show resolved Hide resolved
mlflow/transformers.py Outdated Show resolved Hide resolved
mlflow/transformers.py Outdated Show resolved Hide resolved
mlflow/transformers.py Outdated Show resolved Hide resolved
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
"""
Convert a submitted component-based model in a dictionary to a namedtuple
"""
ComponentModel = namedtuple("ComponentModel", model, rename=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would pull this out from this function so that we can use isinstance(..., ComponentModel)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at that type validator and how it's not really possible to do the dynamic renaming in a clean way, I deleted these functions and implemented a subclass of NamedTuple. It's a bit more complex, but it gives us the ability to do explicit type validation to ensure that only properly qualified objects are able to be processed in the module.

elif _isinstance_named_tuple(model):
try:
return get_task(model.model.name_or_path)
except RuntimeError as e:
Copy link
Member

@harupy harupy Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
except RuntimeError as e:
except Exception as e:

Catching Exception might be safer in case huggingface suddenly changes RunTimeError to a different error type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent point. I also updated the handling in the pip requirements function (since the root of the issue is identical (but called from a different transformers function that is internal to the loading of a model instance).

Comment on lines 199 to 203
def get_version_ranges(module_key):
versions = _ML_PACKAGE_VERSIONS[module_key]["models"]
min_version = versions["minimum"]
max_version = versions["maximum"]
return min_version, max_version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this function in the top level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent idea. renamed and added usage notes

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
pip install git+https://github.com/huggingface/transformers
models:
minimum: "4.25.1"
maximum: "4.27.3"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
maximum: "4.27.3"
maximum: "4.27.4"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good catch.

Comment on lines 215 to 223
from transformers import pipeline

pipeline = "csarron/mobilebert-uncased-squad-v2"

with mlflow.start_run():
mlflow.transformers.save_model(
transformers_model=pipeline,
path="path/to/save/model",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to fail:

Traceback (most recent call last):
  File "save.py", line 6, in <module>
    mlflow.transformers.save_model(
  File "/Users/corey.zumar/mlflow/mlflow/utils/docstring_utils.py", line 235, in version_func
    return func(*args, **kwargs)
  File "/Users/corey.zumar/mlflow/mlflow/transformers.py", line 275, in save_model
    _validate_transformers_model_dict(transformers_model)
  File "/Users/corey.zumar/mlflow/mlflow/transformers.py", line 141, in _validate_transformers_model_dict
    model = transformers_model.model
AttributeError: 'str' object has no attribute 'model'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Transformers version 4.27.3)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.... don't even know... what I was thinking with that.... UGH. Fix incoming.

Comment on lines 446 to 454
from transformers import pipeline

pipeline = "csarron/mobilebert-uncased-squad-v2"

with mlflow.start_run():
mlflow.transformers.log_model(
transformers_model=pipeline,
artifact_path="my_pipeline",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the model is not a language-based model and requires a complex input type "
"that is currently not supported."
)
_logger.warning(f"This model is unable to be used for pyfunc prediction due to {reason}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify that the pyfunc flavor won't be added to the model?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sentence explaining this

required_pkg_versions = f"``{min_ver}`` - ``{max_ver}``"

notice = (
f"The '{integration_name}' package is known to be compatible with the following "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"The '{integration_name}' package is known to be compatible with the following "
f"The '{integration_name}' MLflow Models integration is known to be compatible with the following "

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed!



@experimental
@docstring_version_compatibility_warning(integration_name=FLAVOR_NAME)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also apply this decorator to load_model()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!
Screen Shot 2023-03-29 at 7 05 57 PM

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
"that is currently not supported."
)
_logger.warning(
f"This model is unable to be used for pyfunc prediction due to {reason} "
Copy link
Collaborator

@dbczumar dbczumar Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nittiest of nits (reads better):

Suggested change
f"This model is unable to be used for pyfunc prediction due to {reason} "
f"This model is unable to be used for pyfunc prediction because {reason} "

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Tested manually and it works very well for language & image use cases. Thanks @BenWilson2 !

Comment on lines 103 to 105
base_reqs = ["transformers"]
try:
base_reqs.extend(_model_packages(model))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
base_reqs = ["transformers"]
try:
base_reqs.extend(_model_packages(model))
try:
base_reqs = ["transformers", *_model_packages(model)]

A minor simplification :)

Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2 BenWilson2 enabled auto-merge (squash) March 30, 2023 15:37
@BenWilson2 BenWilson2 merged commit 2a7393b into mlflow:master Mar 30, 2023
@tafaust
Copy link
Contributor

tafaust commented May 9, 2023

Quick question: we're currently trying to log a pytorch lightning module with transformers in it (logging transformer model and processor via callback).

We are currently stuck loading these models with our lightning module wrapper - are there any best practices from the mlflow team?

I cannot find anything in the docs.

@BenWilson2
Copy link
Member Author

@tahesse what is the nature of the issue that you're experiencing? Would you like to file an issue so that we can have a place to discuss and you can share example code?

@tafaust
Copy link
Contributor

tafaust commented May 9, 2023

@BenWilson2 I think it is pretty much #4245 except that our pretrained model is wrapped with PytorchLightning module like in https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb

Do you think it makes sense to open a new issue for it? I feel like there is just a best practice section missing in the docs for it. The mlflow.pytorch.autolog() was of no help because the {Train,Val}Dataloader seem to be pickled with the Lightning module and the they wrap the huggingface datasets and the caching makes it impossible to load it on a different filesystem.

@BenWilson2
Copy link
Member Author

Do you have an example that works properly that you'd be willing to contribute and are interested in updating documentation guidelines for this use case? If so, please feel free to file a PR and we can discuss there :)

@BenWilson2 BenWilson2 deleted the transformers-flavor branch May 9, 2023 16:06
@tafaust
Copy link
Contributor

tafaust commented May 11, 2023

@BenWilson2 Unfortunately, I am not able to open source my companies code without permission.

@s-udhaya s-udhaya mentioned this pull request May 20, 2023
33 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants