Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression in schema enforcement #8326

Merged
merged 3 commits into from
Apr 26, 2023

Conversation

BenWilson2
Copy link
Member

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Adjust the schema enforcement logic for casting a pyfunc input to a Pandas DataFrame. This addresses the regression introduced in MLflow 2.3.0 release due to changes to handling the type validation logic for performing the cast to DataFrame. The previous logic had a negated check to force all non-matching input dictionary types to use a 0-valued index during DataFrame creation which thoroughly breaks the input validation for the inputs type for model serving.
The new logic changes the behavior to only explicitly match input types for ColSpec defined schemas to handle the correct list-encapsulation conversion to allow for generating a valid DataFrame from scalar types (strings only) and collections of scalar types that are defined within an input dictionary. The previous behavior that attempts to construct a Pandas DataFrame from all other cases is reached in other conditions.

Additionally, test coverage has been added for the inputs type for model serving (an oversight that wasn't covered previously) and additional tests for schema enforcement that cover the inputs type data structure, as well as supported DataFrame to dict conversions for input validation against a defined schema have been added.

The final addition to this PR is to fix logic within the transformers flavor to ensure that the major input types for model serving are supported, allowing conformance with the processed types within MLserver to ensure that those types are mapped to the appropriate data structures for use within the pyfunc API for transformers. Additional input type validation tests have been added to the transformers suite to ensure that all major types are covered for the model types that can be run on CI.

How is this patch tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests (describe details, including test results, below)

Validated that all local-only tests passed in manual test validation execution (for those tests whose model components exceed the available memory in CI runners)

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Fixed a schema enforcement regression in MLflow 2.3.0 that caused issues when submitting data in the "inputs" format to the model server.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@mlflow-automation
Copy link
Collaborator

mlflow-automation commented Apr 26, 2023

Documentation preview for 424b1e5 will be available here when this CircleCI job completes successfully.

More info

@github-actions github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors area/scoring MLflow Model server, model deployment tools, Spark UDFs rn/bug-fix Mention under Bug Fixes in Changelogs. labels Apr 26, 2023
if isinstance(value, np.ndarray):
output[key] = json.loads(str(value.tolist()).replace("'", '"'))
else:
output[key] = json.loads(value.replace("'", '"'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does value look like?

}
)

from mlflow.deployments import PredictionsResponse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we import this in each test. Can we import this at the top?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved!

for key, value in data.items():
if key == key_to_unpack:
if isinstance(value, np.ndarray):
output[key] = json.loads(str(value.tolist()).replace("'", '"'))
Copy link
Member

@harupy harupy Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output[key] = json.loads(str(value.tolist()).replace("'", '"'))
output[key] = value.tolist()

Can we just use value.tolist() here?

>>> import numpy, json
>>> value = np.array([1,2,3])
>>> json.loads(str(value.tolist()).replace("'", '"')) == value.tolist()
True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to ast.literal_eval() and array.item() to preserve internal structure of array contents. Added test cases to validate on the TableQuestionAnsweringPipeline tests.



@pytest.mark.parametrize("orient", ["list", "records"])
def test_schema_enforcement(orient):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we already have a lot of enforcement test cases in https://github.com/mlflow/mlflow/blob/master/tests/pyfunc/test_model_export_with_loader_module_and_data_path.py. The suite name definitely doesn't match these cases.

Can we extract these cases into a better named suite and then add this case to it with a more specific name (test_schema_enforcement doesn't really tell me what's going on here).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored and renamed the new suite to what it is: validating pyfunc input types through schema enforcement. The old suite's remaining tests relate to that previous name of the suite.

for key, value in data.items():
if key == key_to_unpack:
if isinstance(value, np.ndarray):
output[key] = json.loads(str(value.tolist()).replace("'", '"'))
Copy link
Collaborator

@dbczumar dbczumar Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relying on the fact that the __str__ of the list uses single quotes seems pretty brittle / vulnerable to breakage. Also, what if the __str__ of the list gets truncated for long inputs? Then we'd blow up.

Instead, can you read the individual values and put double quotes around them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the values themselves contain single or double quotes? Have we tested that case / can we handle it & add tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this was a broken fragile mess. Moved to ast.literal_eval and direct access to np.ndarray elements via array.item() so that any internal characters within the array values are preserved as-is and cast appropriately to python primitives.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenWilson2 can you push changes (would love to see how this works with ast eval)

Comment on lines 632 to 633
and all(isinstance(value, np.ndarray) for value in pf_input.values())
and all(value.size == 1 for value in pf_input.values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and all(isinstance(value, np.ndarray) for value in pf_input.values())
and all(value.size == 1 for value in pf_input.values())
and all(isinstance(value, np.ndarray) and value.size == 1 for value in pf_input.values())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TY! :)

Comment on lines 1921 to 1922
else:
output[key] = json.loads(value.replace("'", '"'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading this, it's unclear what the expected type of value is. Why do we expect that the value contains single quotes? Similar to above, what happens if the values themselves contain single or double quotes? Have we tested that case / can we handle it & add tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is entirely removed (it was completely broken if a single quote was added to the body of the string as that would invert the conversion that pandas does for that entry). By moving from the json decoder to ast.literal_eval and accessing the values in the np.ndarray directly, no string manipulation is required and the parsing behaves correctly. I've added test conditions to validate this behavior for TableQuestionAnsweringPipeline validations in both pyfunc and serving contexts.

# the transformers.Pipeline API, we need to cast these arrays back to lists
# and replace the single quotes with double quotes after extracting the
# json-encoded `table` (a pandas DF) in order to convert it to a dict that
# the TableQuestionAnsweringPipeline can accept and cast to a Pandas DataFrame.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this logic specific to TableQuestionAnsweringPipeline, or are there other models that need it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added test conditions (adding in single quotes) with the other input type pipelines and validated that we don't need anything similar for either scalar inputs or collections to those pipeline types.

Comment on lines 630 to 635
elif (
isinstance(pf_input, dict)
and all(isinstance(value, np.ndarray) for value in pf_input.values())
and all(value.size == 1 for value in pf_input.values())
):
pf_input = pd.DataFrame(pf_input, index=[0])
elif isinstance(pf_input, str):
pf_input = pd.DataFrame({"inputs": pf_input}, index=[0])
pf_input = pd.DataFrame([pf_input])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From

if isinstance(pf_input, (list, np.ndarray, dict, pd.Series)):
try:
pf_input = pd.DataFrame(pf_input)
, the logic prior to MLflow 2.3 for this case was to do:

pf_input = pd.DataFrame(pf_input)

However, it seems we're now doing

pf_input = pd.DataFrame([pf_input])

I examined the difference between the two with an example input:

pf_input = {"foo": np.array([1]), "bar": np.array([2])}

print(pd.DataFrame([pf_input]))

   foo  bar
0  [1]  [2]

print(pd.DataFrame(pf_input))

   foo  bar
0    1    2

These two DataFrames are definitely different.

Doesn't that mean we're still making a backwards incompatible change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is completely invalid and not used for any of the new features anyway. Removed it and restricted the list wrapping of a pf_input to only string type (throws in pre-2.3 MLflow) and Dict[str, scalar] (throws in pre-2.3 MLflow). Those are the only two new data structure types that are needed for transformers. Validated in MLflow 2.2 build that the only changes with this are now the addition of supporting pyfunc inputs of string and Dict[str, scalar].

@@ -1903,6 +1906,29 @@ def _parse_json_encoded_dict_payload_to_dict(data, key_to_unpack):
}
for entry in data
]
elif isinstance(data, dict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the inline comment, can you include an example of what this dictionary might look like? It's a bit hard to reason about from the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the end-to-end examples for why this conversion is necessary and used a very simple data example to show the conversion that happens in serving.

Comment on lines +1750 to +1752
"I think this sushi might have gone off",
"That gym smells like feet, hot garbage, and sadness",
"I love that we have a moon",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤣

Comment on lines 179 to 183
_enforce_schema(data, signature.inputs)

pd_data = pd.DataFrame(data)

_enforce_schema(pd_data.to_dict(orient=orient), signature.inputs)
Copy link
Collaborator

@dbczumar dbczumar Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do an assert on the types returned by _enforce_schema to ensure that they're the same as they were in MLflow 2.2? (We'll need to install 2.2 and see what the output of these calls looks like in 2.2)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test based on offline validation of the logic pre 2.3 (after removing that silly numpy.size ==1 check (validated that it fails any type of schema enforcement checks with legitimate data structures passed in with "inputs" type)) in the new test suite for schema enforcement.

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
_is_scalar(value) for value in pf_input.values()
):
pf_input = pd.DataFrame([pf_input])
elif isinstance(pf_input, dict) and all(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbczumar this additional check is what I was referring to offline. On the serving side, this gets automatically casted to np.array("some string") with a pyfunc input of `{"a": "some string", "b": "some other string"}. By checking the shape, type, and size, we're limiting this only to this specific needed use case to support serving schema enforcement for those transformers Pipeline types that have this form of input.

# '{"inputs": {"query": "What is the longest distance?",
# "table": {"Distance": ["1000", "10", "1"]}}}'
# is converted to:
# [{'query': array('What is the longest distance?', dtype='<U29'),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the raw little endian typing here as that is what is returned exactly when looking at the data structure and types. Hopefully it's not too distracting.

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @BenWilson2 !

@BenWilson2 BenWilson2 merged commit cef03da into mlflow:master Apr 26, 2023
28 checks passed
@BenWilson2 BenWilson2 deleted the fix-schema-handling-regression branch April 26, 2023 22:57
for key, value in data.items():
if key == key_to_unpack:
if isinstance(value, np.ndarray):
output[key] = ast.literal_eval(value.item())
Copy link
Member

@harupy harupy Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

item throws if value has multiple elements. Are we assuming the size of value is 1 here?

>>> np.array([1, 2, 3]).item()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: can only convert an array of size 1 to a Python scalar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! The value will always be an array of a single scalar when parsed. Even if two elements were passed (indicating more than 1 table or more than 1 question for a submission), the TableQuestionAnsweringPipeline would throw with an invalid input type.

lobrien pushed a commit to lobrien/mlflow that referenced this pull request May 10, 2023
* Fix schema enforcement regression

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

* WIP - tests will fail

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

* add in specific logic for transformers schema enforcement

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

---------

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Larry O’Brien <larry.obrien@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/models MLmodel format, model serialization/deserialization, flavors area/scoring MLflow Model server, model deployment tools, Spark UDFs rn/bug-fix Mention under Bug Fixes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants