Support weight only quantization with intel-extension-for-transformers

Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com>
langchain-ai · Dec 11, 2023 · ad38444 · ad38444
1 parent d9bfdc9
commit ad38444
Show file tree

Hide file tree

Showing 6 changed files with 667 additions and 1 deletion.
diff --git a/docs/docs/integrations/llms/weight_only_quantization.ipynb b/docs/docs/integrations/llms/weight_only_quantization.ipynb
diff --git a/docs/docs/integrations/providers/weight_only_quantization.mdx b/docs/docs/integrations/providers/weight_only_quantization.mdx
@@ -0,0 +1,62 @@
+# Weight-Only Quantization via Intel® Extension for Transformers
+
+>[Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers)
+>(ITREX) is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids)..
+> 
+>Weight-only quantization is a technique used in deep learning to reduce the memory and computational requirements of neural networks. In the context of deep neural networks, the model parameters, also known as weights, are typically represented using floating-point numbers, which can consume a significant amount of memory and require intensive computational resources.
+
+Quantization is a process that involves reducing the precision of these weights by representing them using a smaller number of bits. Weight-only quantization specifically focuses on quantizing the weights of the neural network while keeping other components, such as activations, in their original precision.
+
+## Introduction
+
+As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational demands of these modern architectures while maintaining the accuracy. Compared to [normal quantization](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/quantization.md) like W8A8, weight only quantization is probably a better trade-off to balance the performance and the accuracy, since we will see below that the bottleneck of deploying LLMs is the memory bandwidth and normally weight only quantization could lead to better accuracy.
+
+## Installation and Setup
+
+We need to install `intel-extension-for-transformers` python package.
+
+```bash
+pip install intel-extension-for-transformers
+```
+
+## Examples
+
+See a [usage example](../docs/integrations/llms/weight_only_quantization.ipynb).
+
+## Detail of Configuration Parameters
+
+Here is the detail of the `WeightOnlyQuantConfig` class.
+
+#### weight_dtype (string): Weight Data Type, default is "nf4".
+We support quantize the weights to following data types for storing(weight_dtype in WeightOnlyQuantConfig):
+* **int8**: Uses 8-bit data type.
+* **int4_fullrange**: Uses the -8 value of int4 range compared with the normal int4 range [-7,7].
+* **int4_clip**: Clips and retains the values within the int4 range, setting others to zero.
+* **nf4**: Uses the normalized float 4-bit data type.
+* **fp4_e2m1**: Uses regular float 4-bit data type. "e2" means that 2 bits are used for the exponent, and "m1" means that 1 bits are used for the mantissa.
+
+#### compute_dtype (string): Computing Data Type, Default is "fp32".
+While these techniques store weights in 4 or 8 bit, the computation still happens in float32, bfloat16 or int8(compute_dtype in WeightOnlyQuantConfig):
+* **fp32**: Uses the float32 data type to compute.
+* **bf16**: Uses the bfloat16 data type to compute.
+* **int8**: Uses 8-bit data type to compute.
+
+#### llm_int8_skip_modules (list of module's name): Modules to Skip Quantization, Default is None.
+It is a list of modules to be skipped quantization.
+
+#### scale_dtype (string): The Scale Data Type, Default is "fp32".
+Now only support "fp32"(float32).
+
+#### mse_range (boolean): Whether to Search for The Best Clip Range from Range [0.805, 1.0, 0.005], default is False.
+#### use_double_quant (boolean): Whether to Quantize Scale, Default is False.
+Not support yet.
+#### double_quant_dtype (string): Reserve for Double Quantization.
+#### double_quant_scale_dtype (string): Reserve for Double Quantization.
+#### group_size (int): Group Size When Auantization.
+#### scheme (string): Which Format Weight Be Quantize to. Default is "sym".
+* **sym**: Symmetric.
+* **asym**: Asymmetric.
+#### algorithm (string): Which Algorithm to Improve the Accuracy . Default is "RTN"
+* **RTN**: Round-to-nearest (RTN) is a quantification method that we can think of very intuitively.
+* **AWQ**: Protecting only 1% of salient weights can greatly reduce quantization error. the salient weight channels are selected by observing the distribution of activation and weight per channel. The salient weights are also quantized after multiplying a big scale factor before quantization for preserving. .
+* **TEQ**: A trainable equivalent transformation that preserves the FP32 precision in weight-only quantization.
diff --git a/libs/langchain/langchain/llms/__init__.py b/libs/langchain/langchain/llms/__init__.py
@@ -238,6 +238,12 @@ def _import_huggingface_pipeline() -> Any:
     return HuggingFacePipeline
 
 
+def _import_weight_only_pipeline() -> Any:
+    from langchain.llms.weight_only_quantization import WeightOnlyQuantPipeline
+
+    return WeightOnlyQuantPipeline
+
+
 def _import_huggingface_text_gen_inference() -> Any:
     from langchain.llms.huggingface_text_gen_inference import (
         HuggingFaceTextGenInference,
@@ -695,6 +701,8 @@ def __getattr__(name: str) -> Any:
         return _import_watsonxllm()
     elif name == "Writer":
         return _import_writer()
+    elif name == "WeightOnlyQuantPipeline":
+        return _import_weight_only_pipeline()
     elif name == "Xinference":
         return _import_xinference()
     elif name == "YandexGPT":
@@ -748,6 +756,7 @@ def __getattr__(name: str) -> Any:
     "HuggingFacePipeline",
     "HuggingFaceTextGenInference",
     "HumanInputLLM",
+    "WeightOnlyQuantPipeline",
     "KoboldApiLLM",
     "LlamaCpp",
     "TextGen",
@@ -872,6 +881,7 @@ def get_type_to_cls_dict() -> Dict[str, Callable[[], Type[BaseLLM]]]:
         "vllm_openai": _import_vllm_openai,
         "watsonxllm": _import_watsonxllm,
         "writer": _import_writer,
+        "weight_only_quantization": _import_weight_only_pipeline,
         "xinference": _import_xinference,
         "javelin-ai-gateway": _import_javelin_ai_gateway,
         "qianfan_endpoint": _import_baidu_qianfan_endpoint,

diff --git a/libs/langchain/langchain/llms/weight_only_quantization.py b/libs/langchain/langchain/llms/weight_only_quantization.py
@@ -0,0 +1,205 @@
+from typing import Any, List, Mapping, Optional
+
+from langchain.callbacks.manager import CallbackManagerForLLMRun
+from langchain.llms.base import LLM
+from langchain.llms.utils import enforce_stop_tokens
+from langchain.pydantic_v1 import Extra
+
+DEFAULT_MODEL_ID = "google/flan-t5-large"
+DEFAULT_TASK = "text2text-generation"
+VALID_TASKS = ("text2text-generation", "text-generation", "summarization")
+
+
+class WeightOnlyQuantPipeline(LLM):
+    """Weight only quantized model.
+
+    To use, you should have the `intel-extension-for-transformers` packabge and `transformers` package installed.
+    intel-extension-for-transformers: https://github.com/intel/intel-extension-for-transformers
+
+    Example using from_model_id:
+        .. code-block:: python
+
+            from langchain.llms import WeightOnlyQuantPipeline
+            from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
+            config = WeightOnlyQuantConfig
+            hf = WeightOnlyQuantPipeline.from_model_id(
+                model_id="google/flan-t5-large",
+                task="text2text-generation"
+                pipeline_kwargs={"max_new_tokens": 10},
+                quantization_config=config,
+            )
+    Example passing pipeline in directly:
+        .. code-block:: python
+
+            from langchain.llms import WeightOnlyQuantPipeline
+            from intel_extension_for_transformers.transformers import AutoModelForSeq2SeqLM
+            from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
+            from transformers import AutoTokenizer, pipeline
+
+            model_id = "google/flan-t5-large"
+            tokenizer = AutoTokenizer.from_pretrained(model_id)
+            config = WeightOnlyQuantConfig
+            model = AutoModelForSeq2SeqLM.from_pretrained(model_id, quantization_config=config)
+            pipe = pipeline(
+                "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10
+            )
+            hf = WeightOnlyQuantPipeline(pipeline=pipe)
+    """
+
+    pipeline: Any  #: :meta private:
+    model_id: str = DEFAULT_MODEL_ID
+    """Model name or local path to use."""
+
+    model_kwargs: Optional[dict] = None
+    """Key word arguments passed to the model."""
+
+    pipeline_kwargs: Optional[dict] = None
+    """Key word arguments passed to the pipeline."""
+
+    class Config:
+        """Configuration for this pydantic object."""
+
+        extra = Extra.allow
+
+    @classmethod
+    def from_model_id(
+        cls,
+        model_id: str,
+        task: str,
+        device: int = -1,
+        model_kwargs: Optional[dict] = None,
+        pipeline_kwargs: Optional[dict] = None,
+        load_in_4bit: Optional[bool] = False,
+        load_in_8bit: Optional[bool] = False,
+        quantization_config=None,
+        **kwargs: Any,
+    ) -> LLM:
+        """Construct the pipeline object from model_id and task."""
+        try:
+            from intel_extension_for_transformers.transformers import (
+                AutoModelForCausalLM,
+                AutoModelForSeq2SeqLM,
+            )
+            from transformers import AutoTokenizer
+            from transformers import pipeline as hf_pipeline
+        except ImportError:
+            raise ValueError(
+                "Could not import transformers python package. "
+                "Please install it with `pip install transformers` "
+                "and `pip install intel-extension-for-transformers`."
+            )
+
+        _model_kwargs = model_kwargs or {}
+        tokenizer = AutoTokenizer.from_pretrained(model_id, **_model_kwargs)
+
+        try:
+            if task == "text-generation":
+                model = AutoModelForCausalLM.from_pretrained(
+                    model_id,
+                    load_in_4bit=load_in_4bit,
+                    load_in_8bit=load_in_8bit,
+                    quantization_config=quantization_config,
+                    use_llm_runtime=False,
+                    **_model_kwargs,
+                )
+            elif task in ("text2text-generation", "summarization"):
+                model = AutoModelForSeq2SeqLM.from_pretrained(
+                    model_id,
+                    load_in_4bit=load_in_4bit,
+                    load_in_8bit=load_in_8bit,
+                    quantization_config=quantization_config,
+                    use_llm_runtime=False,
+                    **_model_kwargs,
+                )
+            else:
+                raise ValueError(
+                    f"Got invalid task {task}, "
+                    f"currently only {VALID_TASKS} are supported"
+                )
+        except ImportError as e:
+            raise ValueError(
+                f"Could not load the {task} model due to missing dependencies."
+            ) from e
+
+        if "trust_remote_code" in _model_kwargs:
+            _model_kwargs = {
+                k: v for k, v in _model_kwargs.items() if k != "trust_remote_code"
+            }
+        _pipeline_kwargs = pipeline_kwargs or {}
+        pipeline = hf_pipeline(
+            task=task,
+            model=model,
+            tokenizer=tokenizer,
+            device=device,
+            model_kwargs=_model_kwargs,
+            **_pipeline_kwargs,
+        )
+        if pipeline.task not in VALID_TASKS:
+            raise ValueError(
+                f"Got invalid task {pipeline.task}, "
+                f"currently only {VALID_TASKS} are supported"
+            )
+        return cls(
+            pipeline=pipeline,
+            model_id=model_id,
+            model_kwargs=_model_kwargs,
+            pipeline_kwargs=_pipeline_kwargs,
+            **kwargs,
+        )
+
+    @property
+    def _identifying_params(self) -> Mapping[str, Any]:
+        """Get the identifying parameters."""
+        return {
+            "model_id": self.model_id,
+            "model_kwargs": self.model_kwargs,
+            "pipeline_kwargs": self.pipeline_kwargs,
+        }
+
+    @property
+    def _llm_type(self) -> str:
+        """Return type of llm."""
+        return "weight_only_quantization"
+
+    def _call(
+        self,
+        prompt: str,
+        stop: Optional[List[str]] = None,
+        run_manager: Optional[CallbackManagerForLLMRun] = None,
+        **kwargs: Any,
+    ) -> str:
+        """Call the HuggingFace model and return the output.
+
+        Args:
+            prompt: The prompt to use for generation.
+            stop: A list of strings to stop generation when encountered.
+
+        Returns:
+            The generated text.
+
+        Example:
+            .. code-block:: python
+
+                from langchain.llms import WeightOnlyQuantPipeline
+                llm = WeightOnlyQuantPipeline.from_model_id(model_id="google/flan-t5-large",
+                                                         task="text2text-generation")
+                llm("This is a prompt.")
+        """
+        response = self.pipeline(prompt)
+        if self.pipeline.task == "text-generation":
+            # Text generation return includes the starter text.
+            text = response[0]["generated_text"][len(prompt) :]
+        elif self.pipeline.task == "text2text-generation":
+            text = response[0]["generated_text"]
+        elif self.pipeline.task == "summarization":
+            text = response[0]["summary_text"]
+        else:
+            raise ValueError(
+                f"Got invalid task {self.pipeline.task}, "
+                f"currently only {VALID_TASKS} are supported"
+            )
+        if stop:
+            # This is a bit hacky, but I can't figure out a better way to enforce
+            # stop tokens when making calls to huggingface_hub.
+            text = enforce_stop_tokens(text, stop)
+        return text
diff --git a/libs/langchain/pyproject.toml b/libs/langchain/pyproject.toml
@@ -149,6 +149,7 @@ databricks-vectorsearch = {version = "^0.21", optional = true}
 couchbase = {version = "^4.1.9", optional = true}
 dgml-utils = {version = "^0.3.0", optional = true}
 datasets = {version = "^2.15.0", optional = true}
+intel-extension-for-transformers = {version = "^1.2.1", optional = true}
 
 [tool.poetry.group.test]
 optional = true
@@ -237,7 +238,8 @@ playwright = "^1.28.0"
 setuptools = "^67.6.1"
 
 [tool.poetry.extras]
-llms = ["clarifai", "cohere", "openai", "openlm", "nlpcloud", "huggingface_hub", "manifest-ml", "torch", "transformers"]
+llms = ["clarifai", "cohere", "openai", "openlm", "nlpcloud", "huggingface_hub", "manifest-ml", "torch", "transformers",
+  "intel-extension-for-transformers"]
 qdrant = ["qdrant-client"]
 openai = ["openai", "tiktoken"]
 text_helpers = ["chardet"]

diff --git a/libs/langchain/tests/integration_tests/llms/test_weight_only_quantization.py b/libs/langchain/tests/integration_tests/llms/test_weight_only_quantization.py
@@ -0,0 +1,64 @@
+"""Test HuggingFace Pipeline wrapper."""
+
+from pathlib import Path
+
+from langchain.llms.loading import load_llm
+from langchain.llms.weight_only_quantization import WeightOnlyQuantPipeline
+from tests.integration_tests.llms.utils import assert_llm_equality
+
+model_id = "google/flan-t5-large"
+
+
+def test_weight_only_quantization_with_config() -> None:
+    """Test valid call to HuggingFace text2text model."""
+    from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
+
+    conf = WeightOnlyQuantConfig(weight_dtype="nf4")
+    llm = WeightOnlyQuantPipeline.from_model_id(
+        model_id=model_id, task="text2text-generation", quantization_config=conf
+    )
+    output = llm("Say foo:")
+    assert isinstance(output, str)
+
+
+def test_weight_only_quantization_4bit() -> None:
+    """Test valid call to HuggingFace text2text model."""
+    llm = WeightOnlyQuantPipeline.from_model_id(
+        model_id=model_id, task="text2text-generation", load_in_4bit=True
+    )
+    output = llm("Say foo:")
+    assert isinstance(output, str)
+
+
+def test_weight_only_quantization_8bit() -> None:
+    """Test valid call to HuggingFace text2text model."""
+    llm = WeightOnlyQuantPipeline.from_model_id(
+        model_id=model_id, task="text2text-generation", load_in_8bit=True
+    )
+    output = llm("Say foo:")
+    assert isinstance(output, str)
+
+
+def test_init_with_pipeline() -> None:
+    """Test initialization with a HF pipeline."""
+    from intel_extension_for_transformers.transformers import AutoModelForSeq2SeqLM
+    from transformers import AutoTokenizer, pipeline
+
+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_4bit=True, use_llm_runtime=False)
+    pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
+    llm = WeightOnlyQuantPipeline(pipeline=pipe)
+    output = llm("Say foo:")
+    assert isinstance(output, str)
+
+
+def text_weight_only_pipeline_summarization() -> None:
+    """Test valid call to HuggingFace summarization model."""
+    from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
+
+    conf = WeightOnlyQuantConfig()
+    llm = WeightOnlyQuantPipeline.from_model_id(
+        model_id=model_id, task="summarization", quantization_config=conf
+    )
+    output = llm("Say foo:")
+    assert isinstance(output, str)