Fix regression in schema enforcement #8326

BenWilson2 · 2023-04-26T02:44:01Z

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Adjust the schema enforcement logic for casting a pyfunc input to a Pandas DataFrame. This addresses the regression introduced in MLflow 2.3.0 release due to changes to handling the type validation logic for performing the cast to DataFrame. The previous logic had a negated check to force all non-matching input dictionary types to use a 0-valued index during DataFrame creation which thoroughly breaks the input validation for the inputs type for model serving.
The new logic changes the behavior to only explicitly match input types for ColSpec defined schemas to handle the correct list-encapsulation conversion to allow for generating a valid DataFrame from scalar types (strings only) and collections of scalar types that are defined within an input dictionary. The previous behavior that attempts to construct a Pandas DataFrame from all other cases is reached in other conditions.

Additionally, test coverage has been added for the inputs type for model serving (an oversight that wasn't covered previously) and additional tests for schema enforcement that cover the inputs type data structure, as well as supported DataFrame to dict conversions for input validation against a defined schema have been added.

The final addition to this PR is to fix logic within the transformers flavor to ensure that the major input types for model serving are supported, allowing conformance with the processed types within MLserver to ensure that those types are mapped to the appropriate data structures for use within the pyfunc API for transformers. Additional input type validation tests have been added to the transformers suite to ensure that all major types are covered for the model types that can be run on CI.

How is this patch tested?

Existing unit/integration tests
New unit/integration tests
Manual tests (describe details, including test results, below)

Validated that all local-only tests passed in manual test validation execution (for those tests whose model components exceed the available memory in CI runners)

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

Fixed a schema enforcement regression in MLflow 2.3.0 that caused issues when submitting data in the "inputs" format to the model server.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

mlflow-automation · 2023-04-26T02:44:18Z

Documentation preview for 424b1e5 will be available here when this CircleCI job completes successfully.

More info

Ignore this comment if this PR does not change the documentation.
It takes a few minutes for the preview to be available.
The preview is updated when a new commit is pushed to this PR.
This comment was created by https://github.com/mlflow/mlflow/actions/runs/4812558999.

harupy · 2023-04-26T03:03:47Z

mlflow/transformers.py

+                    if isinstance(value, np.ndarray):
+                        output[key] = json.loads(str(value.tolist()).replace("'", '"'))
+                    else:
+                        output[key] = json.loads(value.replace("'", '"'))


What does value look like?

harupy · 2023-04-26T03:05:39Z

tests/transformers/test_transformers_model_export.py

+        }
+    )
+
+    from mlflow.deployments import PredictionsResponse


Looks like we import this in each test. Can we import this at the top?

harupy · 2023-04-26T03:07:18Z

mlflow/transformers.py

+            for key, value in data.items():
+                if key == key_to_unpack:
+                    if isinstance(value, np.ndarray):
+                        output[key] = json.loads(str(value.tolist()).replace("'", '"'))


Suggested change

output[key] = json.loads(str(value.tolist()).replace("'", '"'))

output[key] = value.tolist()

Can we just use value.tolist() here?

>>> import numpy, json >>> value = np.array([1,2,3]) >>> json.loads(str(value.tolist()).replace("'", '"')) == value.tolist() True

changed to ast.literal_eval() and array.item() to preserve internal structure of array contents. Added test cases to validate on the TableQuestionAnsweringPipeline tests.

dbczumar · 2023-04-26T03:08:21Z

tests/models/test_utils.py

+
+
+@pytest.mark.parametrize("orient", ["list", "records"])
+def test_schema_enforcement(orient):


Looks like we already have a lot of enforcement test cases in https://github.com/mlflow/mlflow/blob/master/tests/pyfunc/test_model_export_with_loader_module_and_data_path.py. The suite name definitely doesn't match these cases.

Can we extract these cases into a better named suite and then add this case to it with a more specific name (test_schema_enforcement doesn't really tell me what's going on here).

refactored and renamed the new suite to what it is: validating pyfunc input types through schema enforcement. The old suite's remaining tests relate to that previous name of the suite.

dbczumar · 2023-04-26T03:15:00Z

mlflow/transformers.py

+            for key, value in data.items():
+                if key == key_to_unpack:
+                    if isinstance(value, np.ndarray):
+                        output[key] = json.loads(str(value.tolist()).replace("'", '"'))


relying on the fact that the __str__ of the list uses single quotes seems pretty brittle / vulnerable to breakage. Also, what if the __str__ of the list gets truncated for long inputs? Then we'd blow up.

Instead, can you read the individual values and put double quotes around them?

What happens if the values themselves contain single or double quotes? Have we tested that case / can we handle it & add tests?

Yep, this was a broken fragile mess. Moved to ast.literal_eval and direct access to np.ndarray elements via array.item() so that any internal characters within the array values are preserved as-is and cast appropriately to python primitives.

@BenWilson2 can you push changes (would love to see how this works with ast eval)

harupy · 2023-04-26T03:15:06Z

mlflow/models/utils.py

+                    and all(isinstance(value, np.ndarray) for value in pf_input.values())
+                    and all(value.size == 1 for value in pf_input.values())


Suggested change

and all(isinstance(value, np.ndarray) for value in pf_input.values())

and all(value.size == 1 for value in pf_input.values())

and all(isinstance(value, np.ndarray) and value.size == 1 for value in pf_input.values())

dbczumar · 2023-04-26T03:16:34Z

mlflow/transformers.py

+                    else:
+                        output[key] = json.loads(value.replace("'", '"'))


From reading this, it's unclear what the expected type of value is. Why do we expect that the value contains single quotes? Similar to above, what happens if the values themselves contain single or double quotes? Have we tested that case / can we handle it & add tests?

This is entirely removed (it was completely broken if a single quote was added to the body of the string as that would invert the conversion that pandas does for that entry). By moving from the json decoder to ast.literal_eval and accessing the values in the np.ndarray directly, no string manipulation is required and the parsing behaves correctly. I've added test conditions to validate this behavior for TableQuestionAnsweringPipeline validations in both pyfunc and serving contexts.

dbczumar · 2023-04-26T03:18:34Z

mlflow/transformers.py

+            # the transformers.Pipeline API, we need to cast these arrays back to lists
+            # and replace the single quotes with double quotes after extracting the
+            # json-encoded `table` (a pandas DF) in order to convert it to a dict that
+            # the TableQuestionAnsweringPipeline can accept and cast to a Pandas DataFrame.


Is this logic specific to TableQuestionAnsweringPipeline, or are there other models that need it?

I added test conditions (adding in single quotes) with the other input type pipelines and validated that we don't need anything similar for either scalar inputs or collections to those pipeline types.

dbczumar · 2023-04-26T03:24:05Z

mlflow/models/utils.py

+                elif (
+                    isinstance(pf_input, dict)
+                    and all(isinstance(value, np.ndarray) for value in pf_input.values())
+                    and all(value.size == 1 for value in pf_input.values())
                ):
-                    pf_input = pd.DataFrame(pf_input, index=[0])
-                elif isinstance(pf_input, str):
-                    pf_input = pd.DataFrame({"inputs": pf_input}, index=[0])
+                    pf_input = pd.DataFrame([pf_input])


From

mlflow/mlflow/models/utils.py

Lines 612 to 614 in 1fa3afb

if isinstance(pf_input, (list, np.ndarray, dict, pd.Series)):

try:

pf_input = pd.DataFrame(pf_input)

, the logic prior to MLflow 2.3 for this case was to do:

pf_input = pd.DataFrame(pf_input)

However, it seems we're now doing

pf_input = pd.DataFrame([pf_input])

I examined the difference between the two with an example input:

pf_input = {"foo": np.array([1]), "bar": np.array([2])} print(pd.DataFrame([pf_input])) foo bar 0 [1] [2] print(pd.DataFrame(pf_input)) foo bar 0 1 2

These two DataFrames are definitely different.

Doesn't that mean we're still making a backwards incompatible change?

this check is completely invalid and not used for any of the new features anyway. Removed it and restricted the list wrapping of a pf_input to only string type (throws in pre-2.3 MLflow) and Dict[str, scalar] (throws in pre-2.3 MLflow). Those are the only two new data structure types that are needed for transformers. Validated in MLflow 2.2 build that the only changes with this are now the addition of supporting pyfunc inputs of string and Dict[str, scalar].

dbczumar · 2023-04-26T03:26:01Z

mlflow/transformers.py

@@ -1903,6 +1906,29 @@ def _parse_json_encoded_dict_payload_to_dict(data, key_to_unpack):
                }
                for entry in data
            ]
+        elif isinstance(data, dict):


In the inline comment, can you include an example of what this dictionary might look like? It's a bit hard to reason about from the code.

added the end-to-end examples for why this conversion is necessary and used a very simple data example to show the conversion that happens in serving.

dbczumar · 2023-04-26T03:27:33Z

tests/transformers/test_transformers_model_export.py

+                "I think this sushi might have gone off",
+                "That gym smells like feet, hot garbage, and sadness",
+                "I love that we have a moon",


dbczumar · 2023-04-26T03:29:20Z

tests/models/test_utils.py

+    _enforce_schema(data, signature.inputs)
+
+    pd_data = pd.DataFrame(data)
+
+    _enforce_schema(pd_data.to_dict(orient=orient), signature.inputs)


Can we do an assert on the types returned by _enforce_schema to ensure that they're the same as they were in MLflow 2.2? (We'll need to install 2.2 and see what the output of these calls looks like in 2.2)

Added a test based on offline validation of the logic pre 2.3 (after removing that silly numpy.size ==1 check (validated that it fails any type of schema enforcement checks with legitimate data structures passed in with "inputs" type)) in the new test suite for schema enforcement.

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

BenWilson2 · 2023-04-26T19:25:24Z

mlflow/models/utils.py

+                    _is_scalar(value) for value in pf_input.values()
+                ):
+                    pf_input = pd.DataFrame([pf_input])
+                elif isinstance(pf_input, dict) and all(


@dbczumar this additional check is what I was referring to offline. On the serving side, this gets automatically casted to np.array("some string") with a pyfunc input of `{"a": "some string", "b": "some other string"}. By checking the shape, type, and size, we're limiting this only to this specific needed use case to support serving schema enforcement for those transformers Pipeline types that have this form of input.

BenWilson2 · 2023-04-26T19:28:29Z

mlflow/transformers.py

+            #   '{"inputs": {"query": "What is the longest distance?",
+            #                "table": {"Distance": ["1000", "10", "1"]}}}'
+            # is converted to:
+            #   [{'query': array('What is the longest distance?', dtype='<U29'),


added the raw little endian typing here as that is what is returned exactly when looking at the data structure and types. Hopefully it's not too distracting.

dbczumar

LGTM! Thanks @BenWilson2 !

harupy · 2023-04-26T23:08:12Z

mlflow/transformers.py

+            for key, value in data.items():
+                if key == key_to_unpack:
+                    if isinstance(value, np.ndarray):
+                        output[key] = ast.literal_eval(value.item())


item throws if value has multiple elements. Are we assuming the size of value is 1 here?

>>> np.array([1, 2, 3]).item() Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: can only convert an array of size 1 to a Python scalar

Yep! The value will always be an array of a single scalar when parsed. Even if two elements were passed (indicating more than 1 table or more than 1 question for a submission), the TableQuestionAnsweringPipeline would throw with an invalid input type.

* Fix schema enforcement regression Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> * WIP - tests will fail Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> * add in specific logic for transformers schema enforcement Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> --------- Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> Signed-off-by: Larry O’Brien <larry.obrien@databricks.com>

Fix schema enforcement regression

aac9a1d

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors area/scoring MLflow Model server, model deployment tools, Spark UDFs rn/bug-fix Mention under Bug Fixes in Changelogs. labels Apr 26, 2023

BenWilson2 requested review from lichenran1234, harupy, tomasatdatabricks and dbczumar April 26, 2023 02:44

harupy reviewed Apr 26, 2023

View reviewed changes

dbczumar reviewed Apr 26, 2023

View reviewed changes

harupy reviewed Apr 26, 2023

View reviewed changes

dbczumar reviewed Apr 26, 2023

View reviewed changes

BenWilson2 added 2 commits April 26, 2023 13:56

WIP - tests will fail

91f8f48

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

add in specific logic for transformers schema enforcement

424b1e5

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

BenWilson2 commented Apr 26, 2023

View reviewed changes

dbczumar approved these changes Apr 26, 2023

View reviewed changes

BenWilson2 merged commit cef03da into mlflow:master Apr 26, 2023
28 checks passed

BenWilson2 deleted the fix-schema-handling-regression branch April 26, 2023 22:57

harupy reviewed Apr 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix regression in schema enforcement #8326

Fix regression in schema enforcement #8326

BenWilson2 commented Apr 26, 2023

mlflow-automation commented Apr 26, 2023 •

edited

harupy Apr 26, 2023

harupy Apr 26, 2023

BenWilson2 Apr 26, 2023

harupy Apr 26, 2023 •

edited

BenWilson2 Apr 26, 2023

dbczumar Apr 26, 2023

BenWilson2 Apr 26, 2023

dbczumar Apr 26, 2023 •

edited

dbczumar Apr 26, 2023

BenWilson2 Apr 26, 2023

dbczumar Apr 26, 2023

harupy Apr 26, 2023

BenWilson2 Apr 26, 2023

dbczumar Apr 26, 2023

BenWilson2 Apr 26, 2023

dbczumar Apr 26, 2023

BenWilson2 Apr 26, 2023

dbczumar Apr 26, 2023

BenWilson2 Apr 26, 2023

dbczumar Apr 26, 2023

BenWilson2 Apr 26, 2023

dbczumar Apr 26, 2023

dbczumar Apr 26, 2023 •

edited

BenWilson2 Apr 26, 2023

BenWilson2 Apr 26, 2023

BenWilson2 Apr 26, 2023

dbczumar left a comment

harupy Apr 26, 2023 •

edited

BenWilson2 Apr 26, 2023

	output[key] = json.loads(str(value.tolist()).replace("'", '"'))
	output[key] = value.tolist()



		@pytest.mark.parametrize("orient", ["list", "records"])
		def test_schema_enforcement(orient):

		and all(isinstance(value, np.ndarray) for value in pf_input.values())
		and all(value.size == 1 for value in pf_input.values())

	and all(isinstance(value, np.ndarray) for value in pf_input.values())
	and all(value.size == 1 for value in pf_input.values())
	and all(isinstance(value, np.ndarray) and value.size == 1 for value in pf_input.values())

	if isinstance(pf_input, (list, np.ndarray, dict, pd.Series)):
	try:
	pf_input = pd.DataFrame(pf_input)

Fix regression in schema enforcement #8326

Fix regression in schema enforcement #8326

Conversation

BenWilson2 commented Apr 26, 2023

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change the documentation?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

mlflow-automation commented Apr 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy Apr 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbczumar Apr 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbczumar Apr 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbczumar left a comment

Choose a reason for hiding this comment

harupy Apr 26, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlflow-automation commented Apr 26, 2023 •

edited

harupy Apr 26, 2023 •

edited

dbczumar Apr 26, 2023 •

edited

dbczumar Apr 26, 2023 •

edited

harupy Apr 26, 2023 •

edited