Invert the default output parsing for TextGenerator subtypes #8279

BenWilson2 · 2023-04-19T15:21:39Z

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Inverts the output parsing default configuration making newline removal and prompt stripping opt-in via an inference_config setting.

How is this patch tested?

Existing unit/integration tests
New unit/integration tests
Manual tests (describe details, including test results, below)

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

mlflow-automation · 2023-04-19T15:21:55Z

Documentation preview for 9e1507f will be available here when this CircleCI job completes successfully.

More info

Ignore this comment if this PR does not change the documentation.
It takes a few minutes for the preview to be available.
The preview is updated when a new commit is pushed to this PR.
This comment was created by https://github.com/mlflow/mlflow/actions/runs/4760567649.

harupy · 2023-04-19T23:24:38Z

mlflow/transformers.py

+                    for to_replace, replace in replacements.items():
+                        data_out = data_out.replace(to_replace, replace)


What if a user doesn't want the prompt, but wants to preserve \n in the output?

I can add an additional kwargs entry for that

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

harupy · 2023-04-20T02:32:10Z

mlflow/transformers.py

    ):
        """
        Parse the output from instruction pipelines to conform with other text generator
        pipeline types and remove line feed characters and other confusing outputs
        """
-        replacements = {"\n\n": " "}
+        replacements = {"\n\n": " ", "\n": " "}


I'd use a regular expression here to shrink consecutive newline characters into a space.

\n+

I think that conveys the intention more clearly than running replacement twice.

good point. I'll update!

great point. Updated and added a spacing collapse for a weird edge case that can happen

harupy · 2023-04-20T02:36:32Z

docs/source/models.rst

+    saving or logging the model: `"include_prompt": False`. To remove the newline characters from within the body
+    of the generated text output, you can add the `"remove_newlines": True` option to the `inference_config` dictionary.


How about the option name like shrink_newlines? To me, remove_newlines sounds like replace("\n", "").

Other candidates (rejected in me):

replace_newlines: makes me wonder "replace with what?"

replace_newlines_with_space: clear but too long

I like it :) changing!

harupy · 2023-04-20T02:39:50Z

mlflow/transformers.py

        include_prompt = (
-            self.inference_config.pop("include_prompt", False) if self.inference_config else False
+            self.inference_config.pop("include_prompt", True) if self.inference_config else True


Does include_prompt need to be popped out?

unfortunately, yes.
transformers pipeline execution includes validation for kwargs submitted. If we leave that inference kwarg entry in (by using self.inference_config.get(...), we get:

E ValueError: The following model_kwargs are not used by the model: ['include_prompt'] (note: typos in the generate arguments will also show up in this list)

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

harupy · 2023-04-20T23:11:21Z

mlflow/transformers.py

    ):
        """
        Parse the output from instruction pipelines to conform with other text generator
        pipeline types and remove line feed characters and other confusing outputs
        """
-        replacements = {"\n\n": " "}
+        replacements = {"\n+": " ", "\\s+": " "}


Is \\s+ a newline? If the flag name is shrink_newlines, we should just shrink new lines.

Shrinking multiple spaces also has a risk. For example, you ask Dolly to give python code and Dolly produces the following code:

a = " "

btw why do we need to replace \\s+? Does Dolly produce a response like I am<tab>Dolly?

If we're not declaring the match condition as a raw string (i.e., r"\s+", which looks odd as a dict key), then the literal escape sequence is equivalent if using either a single \s+ or double \\s+.

example:

data = "Just\n\n testing\n something\n\n out\n\n\nhere.\n\n" print("raw:") print(data) data = re.sub("\n+", " ", data) print("remove newlines:") print(data) data_single = re.sub("\s+", " ", data) print("remove extra spaces:") print(data_single) data_double = re.sub("\\s+", " ", data) print("remove extra spaces double escape:") print(data_double) assert data_single == data_double

outputs:

raw: Just testing something out here. remove newlines: Just testing something out here. remove extra spaces: Just testing something out here. remove extra spaces double escape: Just testing something out here.

dbczumar

LGTM! Thanks @BenWilson2 !

harupy

LGTM once we address what we discussed in the standup!

BenWilson2 · 2023-04-21T01:39:34Z

Adding: convert to collapse_whitespace, additional NB notes that make it clear when these are being used (why and what their defaults are), docstring for the method, and updating the docs.

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

…8279) * Invert the default output parsing for TextGenerator subtypes Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> * PR feedback Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> * PR feedback Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> * final feedback on naming Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> --------- Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> Signed-off-by: Larry O’Brien <larry.obrien@databricks.com>

Invert the default output parsing for TextGenerator subtypes

7be2935

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

BenWilson2 requested a review from harupy April 19, 2023 15:21

github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors rn/none List under Small Changes in Changelogs. labels Apr 19, 2023

harupy reviewed Apr 19, 2023

View reviewed changes

PR feedback

bea25b6

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

harupy reviewed Apr 20, 2023

View reviewed changes

PR feedback

ba9bfa5

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

BenWilson2 requested a review from dbczumar April 20, 2023 21:51

harupy reviewed Apr 20, 2023

View reviewed changes

dbczumar approved these changes Apr 21, 2023

View reviewed changes

harupy approved these changes Apr 21, 2023

View reviewed changes

final feedback on naming

9e1507f

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

BenWilson2 enabled auto-merge (squash) April 21, 2023 01:49

BenWilson2 merged commit 303d7eb into mlflow:master Apr 21, 2023
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invert the default output parsing for TextGenerator subtypes #8279

Invert the default output parsing for TextGenerator subtypes #8279

BenWilson2 commented Apr 19, 2023

mlflow-automation commented Apr 19, 2023 •

edited

harupy Apr 19, 2023 •

edited

BenWilson2 Apr 19, 2023

harupy Apr 20, 2023 •

edited

BenWilson2 Apr 20, 2023

BenWilson2 Apr 20, 2023

harupy Apr 20, 2023 •

edited

harupy Apr 20, 2023

BenWilson2 Apr 20, 2023

harupy Apr 20, 2023

BenWilson2 Apr 20, 2023

harupy Apr 20, 2023 •

edited

harupy Apr 20, 2023 •

edited

BenWilson2 Apr 20, 2023

dbczumar left a comment

harupy left a comment

BenWilson2 commented Apr 21, 2023

		for to_replace, replace in replacements.items():
		data_out = data_out.replace(to_replace, replace)

		saving or logging the model: `"include_prompt": False`. To remove the newline characters from within the body
		of the generated text output, you can add the `"remove_newlines": True` option to the `inference_config` dictionary.

Invert the default output parsing for TextGenerator subtypes #8279

Invert the default output parsing for TextGenerator subtypes #8279

Conversation

BenWilson2 commented Apr 19, 2023

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change the documentation?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

mlflow-automation commented Apr 19, 2023 • edited

harupy Apr 19, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy Apr 20, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy Apr 20, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy Apr 20, 2023 • edited

Choose a reason for hiding this comment

harupy Apr 20, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbczumar left a comment

Choose a reason for hiding this comment

harupy left a comment

Choose a reason for hiding this comment

BenWilson2 commented Apr 21, 2023

mlflow-automation commented Apr 19, 2023 •

edited

harupy Apr 19, 2023 •

edited

harupy Apr 20, 2023 •

edited

harupy Apr 20, 2023 •

edited

harupy Apr 20, 2023 •

edited

harupy Apr 20, 2023 •

edited