Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syntax error when using Optional[Tuple[int,int]]] #905

Closed
hugocool opened this issue May 19, 2024 · 4 comments · Fixed by #912
Closed

syntax error when using Optional[Tuple[int,int]]] #905

hugocool opened this issue May 19, 2024 · 4 comments · Fixed by #912
Labels

Comments

@hugocool
Copy link

hugocool commented May 19, 2024

Describe the issue as clearly as possible:

When using the outlines library to generate JSON structures with the google/gemma-1.1-2b-it and solidrust/Hermes-2-Pro-Llama-3-8B-AWQ models, the JSON output is often invalid, leading to failures in downstream processes. This issue is frequently associated with InvalidSyntax errors during regex pattern parsing.

Steps/code to reproduce the bug:

  1. Initialize the Sampling Parameters and Load the Model:

    from vllm.sampling_params import SamplingParams
    from outlines.transformers.vllm import VLLM
    from huggingface_hub import snapshot_download
    from vllm import LLM
    from tempfile import TemporaryDirectory
    
    # Set the sampling parameters for the model
    samplimg_params = SamplingParams(max_tokens=2048)
    
    # Define token and model path
    token = "your_hf_token"
    
    # Load google/gemma-1.1-2b-it model
    model_name = "google/gemma-1.1-2b-it"
    with TemporaryDirectory() as tmp_dir:
        snapshot_download(repo_id=model_name, local_dir=tmp_dir, token=token)
        model_gemma = LLM(model=tmp_dir, tokenizer=tmp_dir, trust_remote_code=True)
    
    # Load solidrust/Hermes-2-Pro-Llama-3-8B-AWQ model
    model_name = "solidrust/Hermes-2-Pro-Llama-3-8B-AWQ"
    with TemporaryDirectory() as tmp_dir:
        snapshot_download(repo_id=model_name, local_dir=tmp_dir, token=token)
        model_solidrust = LLM(model=tmp_dir, tokenizer=tmp_dir, trust_remote_code=True, dtype="auto", quantization="awq")
    
    # Wrap the models with VLLM
    vllm_model_gemma = VLLM(model_gemma)
    vllm_model_solidrust = VLLM(model_solidrust)

    Note: This code downloads the model to a local temporary directory and loads it. This simulates uploading to S3 and then downloading it, as the end result is the same.

  2. Prepare the Prompts:

    prompts = [extract_job_description_summary(job['title'], job['description']) for job in df[['title', 'description']].to_dict(orient='records')]
  3. Generate JSON:

    generator = generate.json(vllm_model_gemma, JobDescriptionSummary, whitespace_pattern="[ \n\t]?")
    results = generator(prompts, sampling_params=samplimg_params)

Expected result:

The model should generate valid JSON objects conforming to the schema defined by JobDescriptionSummary.

Error message:

The output JSON is often invalid, containing syntax errors that prevent proper parsing and downstream processing.

Error Details

For the google/gemma-1.1-2b-it model:

NoMatch: Can not match at index 4072. Got ')?([ ', expected any of ['*', '+', '?', '{', '(', '[', '\\', '.', '$', '^', "<Any 1 except ('\\\\', '$', '|', '?', '+', '.', '^', ')', '[', '(', '*')>", '|'].
Context(data[-10:+10]): ']?\\]|null))?([ \n\t]?,'

For the solidrust/Hermes-2-Pro-Llama-3-8B-AWQ model:

NoMatch: Can not match at index 4072. Got ')?([ ', expected any of ['*', '+', '?', '{', '(', '[', '\\', '.', '$', '^', "<Any 1 except ('\\\\', '$', '|', '?', '+', '.', '^', ')', '[', '(', '*')>", '|'].
Context(data[-10:+10]): ']?\\]|null))?([ \n\t]?,'

Outlines/Python version information:

Version information

* python: 3.10.4 * outlines: 0.0.41

Context for the issue:

@hugocool hugocool added the bug label May 19, 2024
@hugocool hugocool changed the title <Please write a descriIssue with JSON Generation Using outlines Library and google/gemma-1.1-2b-it/solidrust/Hermes-2-Pro-Llama-3-8B-AWQ Modelsptive title> syntax errors that prevent proper parsing and downstream processing with JSON Generation while using the google/gemma-1.1-2b-it/solidrust/Hermes-2-Pro-Llama-3-8B-AWQ Models May 19, 2024
@hugocool hugocool changed the title syntax errors that prevent proper parsing and downstream processing with JSON Generation while using the google/gemma-1.1-2b-it/solidrust/Hermes-2-Pro-Llama-3-8B-AWQ Models syntax errors that prevent proper parsing and downstream processing with JSON Generation while using the google/gemma-1.1-2b-it&solidrust/Hermes-2-Pro-Llama-3-8B-AWQ Models May 19, 2024
@hugocool
Copy link
Author

I've encountered an interesting and perplexing behavior while working with the solidrust/Hermes-2-Pro-Llama-3-8B-AWQ model using the outlines library. When loading the model directly from the Hugging Face Hub, I encountered a CUDA out-of-memory (OOM) error, whereas loading the same model from disk previously resulted in a syntax error. Below are the details of the observations and the code used.

Observation

When loading the solidrust/Hermes-2-Pro-Llama-3-8B-AWQ model directly from Hugging Face Hub:

  1. OOM Error:

    OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 21.00 MiB is free. Process 25933 has 21.95 GiB memory in use. Of the allocated memory 20.13 GiB is allocated by PyTorch, and 373.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  2. Syntax Error when loading from disk:

    NoMatch: Can not match at index 4072. Got ')?([ ', expected any of ['*', '+', '?', '{', '(', '[', '\\', '.', '$', '^', "<Any 1 except ('\\\\', '$', '|', '?', '+', '.', '^', ')', '[', '(', '*')>", '|'].
    Context(data[-10:+10]): ']?\\]|null))?([ \n\t]?,'

Steps/code to reproduce the OOM error:

from vllm import LLM
from vllm.sampling_params import SamplingParams
from outlines.transformers.vllm import VLLM
import pandas as pd

# Initialize the model
model_name = "solidrust/Hermes-2-Pro-Llama-3-8B-AWQ"

llm = LLM(
    model=model_name,
    tokenizer=model_name,
    trust_remote_code=True,
    dtype="auto",
    quantization="awq"
)

# Set the sampling parameters for the model
samplimg_params = SamplingParams(max_tokens=2048)

# Wrap the model with VLLM
model = VLLM(llm)

# Prepare the prompts
prompts = [extract_job_description_summary(job['title'], job['description']) for job in df[['title', 'description']].to_dict(orient='records')]

# Generate JSON
generator = generate.json(model, JobDescriptionSummary, whitespace_pattern="[ \n\t]?")
results = generator(prompts, sampling_params=samplimg_params)

# Extract the results into a DataFrame
data = [model.dict() for model in results]
extracted_texts_df = pd.DataFrame(data)

result:

The OOM error occurs during the model loading phase when loading directly from the Hugging Face Hub, whereas a syntax error occurs when loading the model from disk.

@hugocool
Copy link
Author

okay, after a lot of testing the issue boils down to the Tuple contraint in the Basemodel.

from datetime import datetime
import json
from enum import Enum
from typing import List,Optional
from typing_extensions import List

from pydantic import BaseModel, constr
import interegular

import outlines.models as models
from outlines.fsm.json_schema import build_regex_from_schema
from outlines.integrations.utils import adapt_tokenizer, convert_json_schema_to_str

import pandas as pd 
from pydantic import BaseModel, Field, conlist, constr
from outlines import models, prompt, generate
from typing import Annotated, Tuple, List, Optional

from pydantic import BaseModel, StringConstraints


class JobDescriptionSummary(BaseModel):
    salary_range: Optional[Tuple[int]] = Field(
        default=None,
        description="Salary range for the job, represented as a tuple of (min_salary, max_salary) in integers."


regex_str = build_regex_from_schema(json.dumps(JobDescriptionSummary.model_json_schema()))
regex_pattern = interegular.parse_pattern(regex_str)

returns:

NoMatch: Can not match at index 4072. Got ')?([\\', expected any of ['*', '+', '?', '{', '(', '[', '\\', '.', '$', 
'^', "<Any 1 except ('(', '\\\\', '*', '^', '+', ')', '$', '[', '.', '|', '?')>", '|'].
Context(data[-10:+10]): ']*\\]|null))?([\\n ]*,'

@hugocool hugocool changed the title syntax errors that prevent proper parsing and downstream processing with JSON Generation while using the google/gemma-1.1-2b-it&solidrust/Hermes-2-Pro-Llama-3-8B-AWQ Models syntax error when using Optional[Tuple[int,int]]] May 20, 2024
@rlouf
Copy link
Member

rlouf commented May 22, 2024

Pinging @lapp0

@lapp0
Copy link
Collaborator

lapp0 commented May 22, 2024

Thanks for the great reproduction scripts and isolation of the problem @hugocool!

You can verify the fix with

pip install "git+https://github.com/lapp0/outlines@fix-905"

Issue details

The issue is that Tuple, defined with prefixItems, is not handled.

https://docs.pydantic.dev/latest/api/json_schema/#pydantic.json_schema.GenerateJsonSchema.tuple_schema

Your json schema:

{'properties': {'salary_range': {'anyOf': [{'maxItems': 1, 'minItems': 1, 'prefixItems': [{'type': 'integer'}], 'type': 'array'}, {'type': 'null'}], 'default': None, 'description': 'Salary range for the job, represented as a tuple of (min_salary, max_salary) in integers.', 'title': 'Salary Range'}}, 'title': 'JobDescriptionSummary', 'type': 'object'}

Smoke test:

class JobDescriptionSummary(BaseModel):
    salary_range: Optional[Tuple[int]] = Field(
        default=None,
        description="Salary range for the job, represented as a tuple of (min_salary, max_salary) in integers."
)
model = outlines.models.transformers("microsoft/phi-2")

generator = outlines.generate.json(model, JobDescriptionSummary, whitespace_pattern="")
job = generator("Give me a job description in json format:\n")
print(repr(job))

Output:

JobDescriptionSummary(salary_range=None)
JobDescriptionSummary(salary_range=(1080,))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants