Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invert the default output parsing for TextGenerator subtypes #8279

Merged
merged 4 commits into from
Apr 21, 2023

Conversation

BenWilson2
Copy link
Member

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Inverts the output parsing default configuration making newline removal and prompt stripping opt-in via an inference_config setting.

How is this patch tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests (describe details, including test results, below)

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2 BenWilson2 requested a review from harupy April 19, 2023 15:21
@mlflow-automation
Copy link
Collaborator

mlflow-automation commented Apr 19, 2023

Documentation preview for 9e1507f will be available here when this CircleCI job completes successfully.

More info

@github-actions github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors rn/none List under Small Changes in Changelogs. labels Apr 19, 2023
Comment on lines 1653 to 1654
for to_replace, replace in replacements.items():
data_out = data_out.replace(to_replace, replace)
Copy link
Member

@harupy harupy Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a user doesn't want the prompt, but wants to preserve \n in the output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add an additional kwargs entry for that

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
):
"""
Parse the output from instruction pipelines to conform with other text generator
pipeline types and remove line feed characters and other confusing outputs
"""
replacements = {"\n\n": " "}
replacements = {"\n\n": " ", "\n": " "}
Copy link
Member

@harupy harupy Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use a regular expression here to shrink consecutive newline characters into a space.

\n+

I think that conveys the intention more clearly than running replacement twice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I'll update!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point. Updated and added a spacing collapse for a weird edge case that can happen

Comment on lines 2007 to 2008
saving or logging the model: `"include_prompt": False`. To remove the newline characters from within the body
of the generated text output, you can add the `"remove_newlines": True` option to the `inference_config` dictionary.
Copy link
Member

@harupy harupy Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the option name like shrink_newlines? To me, remove_newlines sounds like replace("\n", "").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other candidates (rejected in me):

  • replace_newlines: makes me wonder "replace with what?"
  • replace_newlines_with_space: clear but too long

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it :) changing!

include_prompt = (
self.inference_config.pop("include_prompt", False) if self.inference_config else False
self.inference_config.pop("include_prompt", True) if self.inference_config else True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does include_prompt need to be popped out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, yes.
transformers pipeline execution includes validation for kwargs submitted. If we leave that inference kwarg entry in (by using self.inference_config.get(...), we get:

E ValueError: The following model_kwargs are not used by the model: ['include_prompt'] (note: typos in the generate arguments will also show up in this list)

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
):
"""
Parse the output from instruction pipelines to conform with other text generator
pipeline types and remove line feed characters and other confusing outputs
"""
replacements = {"\n\n": " "}
replacements = {"\n+": " ", "\\s+": " "}
Copy link
Member

@harupy harupy Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is \\s+ a newline? If the flag name is shrink_newlines, we should just shrink new lines.

Shrinking multiple spaces also has a risk. For example, you ask Dolly to give python code and Dolly produces the following code:

a = "    "

Copy link
Member

@harupy harupy Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw why do we need to replace \\s+? Does Dolly produce a response like I am<tab>Dolly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not declaring the match condition as a raw string (i.e., r"\s+", which looks odd as a dict key), then the literal escape sequence is equivalent if using either a single \s+ or double \\s+.

example:

data = "Just\n\n testing\n something\n\n out\n\n\nhere.\n\n"
print("raw:")
print(data)
data = re.sub("\n+", " ", data)
print("remove newlines:")
print(data)
data_single = re.sub("\s+", " ", data)
print("remove extra spaces:")
print(data_single)
data_double = re.sub("\\s+", " ", data)
print("remove extra spaces double escape:")
print(data_double)

assert data_single == data_double

outputs:

raw:
Just

 testing
 something

 out


here.


remove newlines:
Just  testing  something  out here. 
remove extra spaces:
Just testing something out here. 
remove extra spaces double escape:
Just testing something out here. 

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @BenWilson2 !

Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once we address what we discussed in the standup!

@BenWilson2
Copy link
Member Author

Adding: convert to collapse_whitespace, additional NB notes that make it clear when these are being used (why and what their defaults are), docstring for the method, and updating the docs.

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2 BenWilson2 enabled auto-merge (squash) April 21, 2023 01:49
@BenWilson2 BenWilson2 merged commit 303d7eb into mlflow:master Apr 21, 2023
25 checks passed
lobrien pushed a commit to lobrien/mlflow that referenced this pull request May 10, 2023
…8279)

* Invert the default output parsing for TextGenerator subtypes

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

* PR feedback

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

* PR feedback

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

* final feedback on naming

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

---------

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Larry O’Brien <larry.obrien@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/models MLmodel format, model serialization/deserialization, flavors rn/none List under Small Changes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants