Support multipart upload for AWS Databricks #8003

harupy · 2023-03-10T02:19:58Z

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Support multipart upload for AWS Databricks to log an artifact larger than 5 GB.

How is this patch tested?

Existing unit/integration tests
New unit/integration tests
Manual tests (describe details, including test results, below)

Ran the following code with the new backend implementation and made sure a large pytorch model (> 10 MB) can be uploaded and the output of the loaded model is identical to the output of the original model.

import torch
import tempfile
import pathlib
import mlflow
import uuid
import logging
from unittest import mock
import requests
import re
import urllib
from PIL import Image

print("""
# *************************************************************************
# Large model upalod
# *************************************************************************
""")


# Prepare a model that's large than the MPU chunk size
# model = torch.hub.load("pytorch/vision:v0.10.0", "resnet18", pretrained=True)
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet152', pretrained=True)
model.eval()

# Prepare an image for testing
with tempfile.TemporaryDirectory() as tmpdir:
    url = "https://github.com/pytorch/hub/raw/master/images/dog.jpg"
    filepath = f"{tmpdir}/dog.jpg"
    try:
        urllib.URLopener().retrieve(url, filepath)
    except:
        urllib.request.urlretrieve(url, filepath)


    input_image = Image.open(filepath)

from torchvision import transforms
    
preprocess = transforms.Compose(
    [
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ]
)
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0)  # create a mini-batch as expected by the model

with torch.no_grad():
    output = model(input_batch)

logging.getLogger("mlflow").setLevel(logging.DEBUG)
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Users/harutaka.kawamura@databricks.com/test")

# Wrap requests.Session.request to measure how long each request takes
original = requests.Session.request

regex = re.compile(r"partNumber=(\d+)")

def wrapped(*args, **kwargs):
    url = args[2]
    m = regex.search(url)
    if m:
        print("part number", m.group(1))
    result = original(*args, **kwargs)
    return result

with mock.patch("requests.Session.request", wrapped), mlflow.start_run() as run:
    print("----- logging model -----")
    info = mlflow.pytorch.log_model(model, "model", pip_requirements=["torch"])
    print("----- loading model -----")
    loaded_model = mlflow.pytorch.load_model(info.model_uri)

# Verify the output doesn't change
print("----- verifying output -----")
output_loaded = loaded_model(input_batch)
assert torch.equal(output, output_loaded)

print("----- DONE -----")


print("""
# *************************************************************************
# > 5GB file upalod
# *************************************************************************
""")

with tempfile.TemporaryDirectory() as tmpdir:
    # Create a large file
    size = 6 * 1024 * 1024 * 1024  # 6 GB
    path = pathlib.Path(tmpdir, uuid.uuid4().hex)
    with path.open("wb") as f:
        f.seek(size - 1)
        f.write(b"\0")

    with mock.patch("requests.Session.request", wrapped), mlflow.start_run() as run:
        # Upload it
        print("----- logging large file -----")
        mlflow.log_artifact(path)
        print("----- DONE -----")

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

mlflow-automation · 2023-03-10T02:20:16Z

Documentation preview for 3d7d7a8 will be available here when this CircleCI job completes successfully.

More info

Ignore this comment if this PR does not change the documentation.
It takes a few minutes for the preview to be available.
The preview is updated when a new commit is pushed to this PR.
This comment was created by https://github.com/mlflow/mlflow/actions/runs/4575165563.

harupy · 2023-03-10T05:39:54Z

https://gist.github.com/harupy/cafc16f45a850dad203c11f5e58184fc

harupy · 2023-03-10T16:07:58Z

aws/aws-sdk-java#1537

test.py

mlflow/utils/rest_utils.py

mlflow/store/artifact/databricks_artifact_repo.py

WeichenXu123 · 2023-03-17T08:23:05Z

tests/store/artifact/test_databricks_artifact_repo.py

+    large_file = tmp_path.joinpath("large_file")
+    with large_file.open("wb") as f:
+        f.seek(size - 1)
+        f.write(b"\0")


Can we write various bytes data to file, and in following assertion checking the uploading data in request ?

@WeichenXu123 Can you write code? It's unclear what you're suggesting.

e.g , we can write file content like aaa...bbb...ccc..., then in following assertion we can check the request containing each part data is aaa..., bbb..., ccc...

@harupy @WeichenXu123 I agree, it would be nice to verify that we're correctly uploading the data in chunks (feel free to mock a smaller chunk size for testing if needed). Can we add an assert on the data passed to http_request_mock?

tests/store/artifact/test_databricks_artifact_repo.py

mlflow/protos/databricks_artifacts.proto

mlflow/store/artifact/databricks_artifact_repo.py

mlflow/protos/databricks_artifacts.proto

dbczumar · 2023-03-22T04:46:49Z

mlflow/store/artifact/databricks_artifact_repo.py

+
+    def _upload_parts(self, local_file, run_id, path, upload_id, upload_infos):
+        part_etags = []
+        # TODO: Parallelize part uploads


Let's make sure to do this before shipping the feature. Do we have a follow-up PR or ticket for this?

I'll file a follow-up once this PR is merged.

Filed #8074 (just a placeholder)

test.py

dbczumar

LGTM with nits - thanks so much, @harupy !

WeichenXu123

LGTM

Signed-off-by: harupy <hkawamura0130@gmail.com>

Signed-off-by: harupy <17039389+harupy@users.noreply.github.com>

harupy changed the title ~~[PROTOTYPE] Multipart upload with presinged URLs~~ Multipart upload for AWS Databricks Mar 15, 2023

harupy changed the title ~~Multipart upload for AWS Databricks~~ Support multipart upload for AWS Databricks Mar 15, 2023

harupy force-pushed the ML-29495 branch from a78be43 to bc86032 Compare March 15, 2023 10:04

harupy marked this pull request as draft March 15, 2023 10:06

github-actions bot added the rn/feature Mention under Features in Changelogs. label Mar 15, 2023

harupy commented Mar 17, 2023

View reviewed changes

test.py Outdated Show resolved Hide resolved

harupy commented Mar 17, 2023

View reviewed changes

mlflow/utils/rest_utils.py Outdated Show resolved Hide resolved

harupy marked this pull request as ready for review March 17, 2023 06:31

harupy requested review from BenWilson2, dbczumar and WeichenXu123 March 17, 2023 06:44

WeichenXu123 reviewed Mar 17, 2023

View reviewed changes

tests/store/artifact/test_databricks_artifact_repo.py Outdated Show resolved Hide resolved

dbczumar reviewed Mar 21, 2023

View reviewed changes