-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regression in schema enforcement #8326
Fix regression in schema enforcement #8326
Conversation
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Documentation preview for 424b1e5 will be available here when this CircleCI job completes successfully. More info
|
mlflow/transformers.py
Outdated
if isinstance(value, np.ndarray): | ||
output[key] = json.loads(str(value.tolist()).replace("'", '"')) | ||
else: | ||
output[key] = json.loads(value.replace("'", '"')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does value
look like?
} | ||
) | ||
|
||
from mlflow.deployments import PredictionsResponse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we import this in each test. Can we import this at the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved!
mlflow/transformers.py
Outdated
for key, value in data.items(): | ||
if key == key_to_unpack: | ||
if isinstance(value, np.ndarray): | ||
output[key] = json.loads(str(value.tolist()).replace("'", '"')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output[key] = json.loads(str(value.tolist()).replace("'", '"')) | |
output[key] = value.tolist() |
Can we just use value.tolist()
here?
>>> import numpy, json
>>> value = np.array([1,2,3])
>>> json.loads(str(value.tolist()).replace("'", '"')) == value.tolist()
True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to ast.literal_eval() and array.item() to preserve internal structure of array contents. Added test cases to validate on the TableQuestionAnsweringPipeline tests.
tests/models/test_utils.py
Outdated
|
||
|
||
@pytest.mark.parametrize("orient", ["list", "records"]) | ||
def test_schema_enforcement(orient): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we already have a lot of enforcement test cases in https://github.com/mlflow/mlflow/blob/master/tests/pyfunc/test_model_export_with_loader_module_and_data_path.py. The suite name definitely doesn't match these cases.
Can we extract these cases into a better named suite and then add this case to it with a more specific name (test_schema_enforcement
doesn't really tell me what's going on here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored and renamed the new suite to what it is: validating pyfunc input types through schema enforcement. The old suite's remaining tests relate to that previous name of the suite.
mlflow/transformers.py
Outdated
for key, value in data.items(): | ||
if key == key_to_unpack: | ||
if isinstance(value, np.ndarray): | ||
output[key] = json.loads(str(value.tolist()).replace("'", '"')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relying on the fact that the __str__
of the list uses single quotes seems pretty brittle / vulnerable to breakage. Also, what if the __str__
of the list gets truncated for long inputs? Then we'd blow up.
Instead, can you read the individual values and put double quotes around them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the values themselves contain single or double quotes? Have we tested that case / can we handle it & add tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this was a broken fragile mess. Moved to ast.literal_eval and direct access to np.ndarray elements via array.item()
so that any internal characters within the array values are preserved as-is and cast appropriately to python primitives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenWilson2 can you push changes (would love to see how this works with ast eval)
mlflow/models/utils.py
Outdated
and all(isinstance(value, np.ndarray) for value in pf_input.values()) | ||
and all(value.size == 1 for value in pf_input.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and all(isinstance(value, np.ndarray) for value in pf_input.values()) | |
and all(value.size == 1 for value in pf_input.values()) | |
and all(isinstance(value, np.ndarray) and value.size == 1 for value in pf_input.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TY! :)
mlflow/transformers.py
Outdated
else: | ||
output[key] = json.loads(value.replace("'", '"')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From reading this, it's unclear what the expected type of value
is. Why do we expect that the value contains single quotes? Similar to above, what happens if the values themselves contain single or double quotes? Have we tested that case / can we handle it & add tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is entirely removed (it was completely broken if a single quote was added to the body of the string as that would invert the conversion that pandas does for that entry). By moving from the json decoder to ast.literal_eval and accessing the values in the np.ndarray directly, no string manipulation is required and the parsing behaves correctly. I've added test conditions to validate this behavior for TableQuestionAnsweringPipeline validations in both pyfunc and serving contexts.
# the transformers.Pipeline API, we need to cast these arrays back to lists | ||
# and replace the single quotes with double quotes after extracting the | ||
# json-encoded `table` (a pandas DF) in order to convert it to a dict that | ||
# the TableQuestionAnsweringPipeline can accept and cast to a Pandas DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this logic specific to TableQuestionAnsweringPipeline
, or are there other models that need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added test conditions (adding in single quotes) with the other input type pipelines and validated that we don't need anything similar for either scalar inputs or collections to those pipeline types.
mlflow/models/utils.py
Outdated
elif ( | ||
isinstance(pf_input, dict) | ||
and all(isinstance(value, np.ndarray) for value in pf_input.values()) | ||
and all(value.size == 1 for value in pf_input.values()) | ||
): | ||
pf_input = pd.DataFrame(pf_input, index=[0]) | ||
elif isinstance(pf_input, str): | ||
pf_input = pd.DataFrame({"inputs": pf_input}, index=[0]) | ||
pf_input = pd.DataFrame([pf_input]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From
Lines 612 to 614 in 1fa3afb
if isinstance(pf_input, (list, np.ndarray, dict, pd.Series)): | |
try: | |
pf_input = pd.DataFrame(pf_input) |
pf_input = pd.DataFrame(pf_input)
However, it seems we're now doing
pf_input = pd.DataFrame([pf_input])
I examined the difference between the two with an example input:
pf_input = {"foo": np.array([1]), "bar": np.array([2])}
print(pd.DataFrame([pf_input]))
foo bar
0 [1] [2]
print(pd.DataFrame(pf_input))
foo bar
0 1 2
These two DataFrames are definitely different.
Doesn't that mean we're still making a backwards incompatible change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is completely invalid and not used for any of the new features anyway. Removed it and restricted the list wrapping of a pf_input to only string type (throws in pre-2.3 MLflow) and Dict[str, scalar] (throws in pre-2.3 MLflow). Those are the only two new data structure types that are needed for transformers. Validated in MLflow 2.2 build that the only changes with this are now the addition of supporting pyfunc inputs of string and Dict[str, scalar].
@@ -1903,6 +1906,29 @@ def _parse_json_encoded_dict_payload_to_dict(data, key_to_unpack): | |||
} | |||
for entry in data | |||
] | |||
elif isinstance(data, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the inline comment, can you include an example of what this dictionary might look like? It's a bit hard to reason about from the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the end-to-end examples for why this conversion is necessary and used a very simple data example to show the conversion that happens in serving.
"I think this sushi might have gone off", | ||
"That gym smells like feet, hot garbage, and sadness", | ||
"I love that we have a moon", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤣
tests/models/test_utils.py
Outdated
_enforce_schema(data, signature.inputs) | ||
|
||
pd_data = pd.DataFrame(data) | ||
|
||
_enforce_schema(pd_data.to_dict(orient=orient), signature.inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do an assert on the types returned by _enforce_schema
to ensure that they're the same as they were in MLflow 2.2? (We'll need to install 2.2 and see what the output of these calls looks like in 2.2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test based on offline validation of the logic pre 2.3 (after removing that silly numpy.size ==1 check (validated that it fails any type of schema enforcement checks with legitimate data structures passed in with "inputs" type)) in the new test suite for schema enforcement.
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
_is_scalar(value) for value in pf_input.values() | ||
): | ||
pf_input = pd.DataFrame([pf_input]) | ||
elif isinstance(pf_input, dict) and all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbczumar this additional check is what I was referring to offline. On the serving side, this gets automatically casted to np.array("some string") with a pyfunc input of `{"a": "some string", "b": "some other string"}. By checking the shape, type, and size, we're limiting this only to this specific needed use case to support serving schema enforcement for those transformers Pipeline types that have this form of input.
# '{"inputs": {"query": "What is the longest distance?", | ||
# "table": {"Distance": ["1000", "10", "1"]}}}' | ||
# is converted to: | ||
# [{'query': array('What is the longest distance?', dtype='<U29'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the raw little endian typing here as that is what is returned exactly when looking at the data structure and types. Hopefully it's not too distracting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @BenWilson2 !
for key, value in data.items(): | ||
if key == key_to_unpack: | ||
if isinstance(value, np.ndarray): | ||
output[key] = ast.literal_eval(value.item()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
item
throws if value
has multiple elements. Are we assuming the size of value
is 1 here?
>>> np.array([1, 2, 3]).item()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: can only convert an array of size 1 to a Python scalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! The value will always be an array of a single scalar when parsed. Even if two elements were passed (indicating more than 1 table or more than 1 question for a submission), the TableQuestionAnsweringPipeline would throw with an invalid input type.
* Fix schema enforcement regression Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> * WIP - tests will fail Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> * add in specific logic for transformers schema enforcement Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> --------- Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> Signed-off-by: Larry O’Brien <larry.obrien@databricks.com>
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
Adjust the schema enforcement logic for casting a pyfunc input to a Pandas DataFrame. This addresses the regression introduced in MLflow 2.3.0 release due to changes to handling the type validation logic for performing the cast to DataFrame. The previous logic had a negated check to force all non-matching input dictionary types to use a 0-valued index during DataFrame creation which thoroughly breaks the input validation for the
inputs
type for model serving.The new logic changes the behavior to only explicitly match input types for ColSpec defined schemas to handle the correct list-encapsulation conversion to allow for generating a valid DataFrame from scalar types (strings only) and collections of scalar types that are defined within an input dictionary. The previous behavior that attempts to construct a Pandas DataFrame from all other cases is reached in other conditions.
Additionally, test coverage has been added for the
inputs
type for model serving (an oversight that wasn't covered previously) and additional tests for schema enforcement that cover theinputs
type data structure, as well as supported DataFrame to dict conversions for input validation against a defined schema have been added.The final addition to this PR is to fix logic within the transformers flavor to ensure that the major input types for model serving are supported, allowing conformance with the processed types within MLserver to ensure that those types are mapped to the appropriate data structures for use within the pyfunc API for transformers. Additional input type validation tests have been added to the transformers suite to ensure that all major types are covered for the model types that can be run on CI.
How is this patch tested?
Validated that all local-only tests passed in manual test validation execution (for those tests whose model components exceed the available memory in CI runners)
Does this PR change the documentation?
Release Notes
Is this a user-facing change?
Fixed a schema enforcement regression in MLflow 2.3.0 that caused issues when submitting data in the "inputs" format to the model server.
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes