Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support weight only quantization with intel-extension-for-transformers. #14504

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
ad38444
Support weight only quantization with intel-extension-for-transformers
PenghuiCheng Dec 11, 2023
c769358
merge code from master branch
PenghuiCheng Jan 30, 2024
203ff0d
Update document
PenghuiCheng Feb 1, 2024
4fe596c
Support weight only quantization with intel-extension-for-transformers
PenghuiCheng Dec 11, 2023
156cfe9
Update document
PenghuiCheng Feb 1, 2024
1c81b14
format code style
PenghuiCheng Feb 20, 2024
0aa3c8b
merge branch
PenghuiCheng Feb 20, 2024
cfb932a
Format code style
PenghuiCheng Feb 20, 2024
8af4ba1
Update code
PenghuiCheng Feb 21, 2024
e2f1559
format code style
PenghuiCheng Feb 21, 2024
71133b4
move weight_only_quantization.mdx to intel.mdx
PenghuiCheng Feb 21, 2024
50fda10
Update code
PenghuiCheng Feb 21, 2024
814abd8
Merge remote-tracking branch 'upstream/master' into penghuic/itrex_we…
PenghuiCheng Feb 23, 2024
5f24db3
Fixed UT error
PenghuiCheng Feb 26, 2024
5f97bd5
Merge from master branch
PenghuiCheng Feb 26, 2024
e990b5f
update code
PenghuiCheng Feb 26, 2024
35f1829
Update code
PenghuiCheng Feb 26, 2024
587a55a
Merge remote-tracking branch 'upstream/master' into penghuic/itrex_we…
PenghuiCheng Feb 26, 2024
8bdcc79
merge from master branch
PenghuiCheng Feb 29, 2024
da1a6e4
Merge from master branch
PenghuiCheng Mar 4, 2024
94482ac
Update code
PenghuiCheng Mar 4, 2024
8abc44e
Update code
PenghuiCheng Mar 4, 2024
5856472
Merge remote-tracking branch 'upstream/master' into penghuic/itrex_we…
PenghuiCheng Mar 5, 2024
39a759a
Update poetry.lock
PenghuiCheng Mar 6, 2024
4a239ca
Merge from master branch
PenghuiCheng Mar 6, 2024
00f433f
Fixed pylint error
PenghuiCheng Mar 7, 2024
b8830b8
Update poetry file
PenghuiCheng Mar 8, 2024
d1ae253
Merge branch 'master' into penghuic/itrex_weight_only
PenghuiCheng Mar 8, 2024
5e777ee
Fixed pylint error
PenghuiCheng Mar 11, 2024
461efb6
Merge remote-tracking branch 'upstream/master' into penghuic/itrex_we…
PenghuiCheng Mar 11, 2024
a45d207
Merge branch 'master' into penghuic/itrex_weight_only
baskaryan Mar 12, 2024
89f611a
Merge remote-tracking branch 'upstream/master' into penghuic/itrex_we…
PenghuiCheng Mar 12, 2024
e912835
Update poetry lock
PenghuiCheng Mar 13, 2024
93b35e8
Merge from master branch
PenghuiCheng Mar 13, 2024
34ad951
Merge remote-tracking branch 'upstream/master' into penghuic/itrex_we…
PenghuiCheng Mar 17, 2024
e4655b3
Update code
PenghuiCheng Mar 17, 2024
279cc63
Merge from master branch
PenghuiCheng Mar 25, 2024
80c6793
poetry
baskaryan Mar 27, 2024
a5846ef
Merge from master branch
PenghuiCheng Mar 28, 2024
2402219
poetry
PenghuiCheng Mar 28, 2024
fa9c724
Merge branch 'master' into penghuic/itrex_weight_only
PenghuiCheng Mar 28, 2024
fd831b3
Merge branch 'master' into penghuic/itrex_weight_only
PenghuiCheng Mar 28, 2024
abe6b02
Merge branch 'master' into penghuic/itrex_weight_only
baskaryan Mar 28, 2024
a1c710b
fmt
baskaryan Mar 28, 2024
d3329c7
fmt
baskaryan Mar 28, 2024
fcc6b16
Merge branch 'master' into penghuic/itrex_weight_only
baskaryan Mar 29, 2024
a26ac3b
Merge branch 'master' into penghuic/itrex_weight_only
PenghuiCheng Mar 30, 2024
614ee14
Update peotry file
PenghuiCheng Apr 3, 2024
1139982
merge from master branch
PenghuiCheng Apr 3, 2024
9e4f87f
Update poetry file
PenghuiCheng Apr 3, 2024
f2ccc46
Merge branch 'master' into penghuic/itrex_weight_only
baskaryan Apr 3, 2024
a9ab6f0
fmt
baskaryan Apr 3, 2024
37b3d33
fmt
baskaryan Apr 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
264 changes: 264 additions & 0 deletions docs/docs/integrations/llms/weight_only_quantization.ipynb
PenghuiCheng marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "959300d4",
"metadata": {},
"source": [
"# Intel Weight-Only Quantization\n",
"## Weight-Only Quantization for Huggingface Models with Intel Extension for Transformers Pipelines\n",
"\n",
"Hugging Face models can be run locally with Weight-Only quantization through the `WeightOnlyQuantPipeline` class.\n",
"\n",
"The [Hugging Face Model Hub](https://huggingface.co/models) hosts over 120k models, 20k datasets, and 50k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together.\n",
"\n",
"These can be called from LangChain through this local pipeline wrapper class."
]
},
{
"cell_type": "markdown",
"id": "4c1b8450-5eaf-4d34-8341-2d785448a1ff",
"metadata": {
"tags": []
},
"source": [
"To use, you should have the ``transformers`` python [package installed](https://pypi.org/project/transformers/), as well as [pytorch](https://pytorch.org/get-started/locally/), [intel-extension-for-transformers](https://github.com/intel/intel-extension-for-transformers)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d772b637-de00-4663-bd77-9bc96d798db2",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%pip install transformers --quiet\n",
"%pip install intel-extension-for-transformers"
]
},
{
"cell_type": "markdown",
"id": "91ad075f-71d5-4bc8-ab91-cc0ad5ef16bb",
"metadata": {},
"source": [
"### Model Loading\n",
"\n",
"Models can be loaded by specifying the model parameters using the `from_model_id` method. The model parameters include `WeightOnlyQuantConfig` class in intel_extension_for_transformers."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "165ae236-962a-4763-8052-c4836d78a5d2",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig\n",
"from langchain_community.llms.weight_only_quantization import WeightOnlyQuantPipeline\n",
"\n",
"conf = WeightOnlyQuantConfig(weight_dtype=\"nf4\")\n",
"hf = WeightOnlyQuantPipeline.from_model_id(\n",
" model_id=\"google/flan-t5-large\",\n",
" task=\"text2text-generation\",\n",
" quantization_config=conf,\n",
" pipeline_kwargs={\"max_new_tokens\": 10},\n",
")"
]
},
{
"cell_type": "markdown",
"id": "00104b27-0c15-4a97-b198-4512337ee211",
"metadata": {},
"source": [
"They can also be loaded by passing in an existing `transformers` pipeline directly"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f426a4f",
"metadata": {},
"outputs": [],
"source": [
"from intel_extension_for_transformers.transformers import AutoModelForSeq2SeqLM\n",
"from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline\n",
"from transformers import AutoTokenizer, pipeline\n",
"\n",
"model_id = \"google/flan-t5-large\"\n",
"tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
"model = AutoModelForSeq2SeqLM.from_pretrained(model_id)\n",
"pipe = pipeline(\n",
" \"text2text-generation\", model=model, tokenizer=tokenizer, max_new_tokens=10\n",
")\n",
"hf = WeightOnlyQuantPipeline(pipeline=pipe)"
]
},
{
"cell_type": "markdown",
"id": "e4418c20-8fbb-475e-b389-9b27428b8fe1",
"metadata": {},
"source": [
"### Create Chain\n",
"\n",
"With the model loaded into memory, you can compose it with a prompt to\n",
"form a chain."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "60c8e151-2999-4d52-9c9c-db99df4f4321",
"metadata": {},
"outputs": [],
"source": [
"from langchain.prompts import PromptTemplate\n",
"\n",
"template = \"\"\"Question: {question}\n",
"\n",
"Answer: Let's think step by step.\"\"\"\n",
"prompt = PromptTemplate.from_template(template)\n",
"\n",
"chain = prompt | hf\n",
"\n",
"question = \"What is electroencephalography?\"\n",
"\n",
"print(chain.invoke({\"question\": question}))"
]
},
{
"cell_type": "markdown",
"id": "dbbc3a37",
"metadata": {},
"source": [
"### CPU Inference\n",
"\n",
"Now intel-extension-for-transformers only support CPU device inference. Will support intel GPU soon.When running on a machine with CPU, you can specify the `device=\"cpu\"` or `device=-1` parameter to put the model on CPU device.\n",
"Defaults to `-1` for CPU inference."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "703c91c8",
"metadata": {},
"outputs": [],
"source": [
"conf = WeightOnlyQuantConfig(weight_dtype=\"nf4\")\n",
"llm = WeightOnlyQuantPipeline.from_model_id(\n",
" model_id=\"google/flan-t5-large\",\n",
" task=\"text2text-generation\",\n",
" quantization_config=conf,\n",
" pipeline_kwargs={\"max_new_tokens\": 10},\n",
")\n",
"\n",
"template = \"\"\"Question: {question}\n",
"\n",
"Answer: Let's think step by step.\"\"\"\n",
"prompt = PromptTemplate.from_template(template)\n",
"\n",
"chain = prompt | llm\n",
"\n",
"question = \"What is electroencephalography?\"\n",
"\n",
"print(chain.invoke({\"question\": question}))"
]
},
{
"cell_type": "markdown",
"id": "59276016",
"metadata": {},
"source": [
"### Batch CPU Inference\n",
"\n",
"You can also run inference on the CPU in batch mode."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "097ba62f",
"metadata": {},
"outputs": [],
"source": [
"conf = WeightOnlyQuantConfig(weight_dtype=\"nf4\")\n",
"llm = WeightOnlyQuantPipeline.from_model_id(\n",
" model_id=\"google/flan-t5-large\",\n",
" task=\"text2text-generation\",\n",
" quantization_config=conf,\n",
" pipeline_kwargs={\"max_new_tokens\": 10},\n",
")\n",
"\n",
"chain = prompt | llm.bind(stop=[\"\\n\\n\"])\n",
"\n",
"questions = []\n",
"for i in range(4):\n",
" questions.append({\"question\": f\"What is the number {i} in french?\"})\n",
"\n",
"answers = chain.batch(questions)\n",
"for answer in answers:\n",
" print(answer)"
]
},
{
"cell_type": "markdown",
"id": "9cc4c225-53d7-4003-a0a6-eefb1a7ededc",
"metadata": {},
"source": [
"### Data Types Supported by Intel-extension-for-transformers\n",
"\n",
"We support quantize the weights to following data types for storing(weight_dtype in WeightOnlyQuantConfig):\n",
"\n",
"* **int8**: Uses 8-bit data type.\n",
"* **int4_fullrange**: Uses the -8 value of int4 range compared with the normal int4 range [-7,7].\n",
"* **int4_clip**: Clips and retains the values within the int4 range, setting others to zero.\n",
"* **nf4**: Uses the normalized float 4-bit data type.\n",
"* **fp4_e2m1**: Uses regular float 4-bit data type. \"e2\" means that 2 bits are used for the exponent, and \"m1\" means that 1 bits are used for the mantissa.\n",
"\n",
"While these techniques store weights in 4 or 8 bit, the computation still happens in float32, bfloat16 or int8(compute_dtype in WeightOnlyQuantConfig):\n",
"* **fp32**: Uses the float32 data type to compute.\n",
"* **bf16**: Uses the bfloat16 data type to compute.\n",
"* **int8**: Uses 8-bit data type to compute.\n",
"\n",
"### Supported Algorithms Matrix\n",
"\n",
"Quantization algorithms supported in intel-extension-for-transformers(algorithm in WeightOnlyQuantConfig):\n",
"\n",
"| Algorithms | PyTorch | LLM Runtime |\n",
"|:--------------:|:----------:|:----------:|\n",
"| RTN | ✔ | ✔ |\n",
"| AWQ | ✔ | stay tuned |\n",
"| TEQ | ✔ | stay tuned |\n",
"> **RTN:** A quantification method that we can think of very intuitively. It does not require additional datasets and is a very fast quantization method. Generally speaking, RTN will convert the weight into a uniformly distributed integer data type, but some algorithms, such as Qlora, propose a non-uniform NF4 data type and prove its theoretical optimality.\n",
"\n",
"> **AWQ:** Proved that protecting only 1% of salient weights can greatly reduce quantization error. the salient weight channels are selected by observing the distribution of activation and weight per channel. The salient weights are also quantized after multiplying a big scale factor before quantization for preserving.\n",
"\n",
"> **TEQ:** A trainable equivalent transformation that preserves the FP32 precision in weight-only quantization. It is inspired by AWQ while providing a new solution to search for the optimal per-channel scaling factor between activations and weights.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
50 changes: 49 additions & 1 deletion docs/docs/integrations/providers/intel.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,13 @@ from langchain_community.embeddings import QuantizedBiEncoderEmbeddings
```

## Intel® Extension for Transformers (ITREX)
(ITREX) is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids).

Quantization is a process that involves reducing the precision of these weights by representing them using a smaller number of bits. Weight-only quantization specifically focuses on quantizing the weights of the neural network while keeping other components, such as activations, in their original precision.

As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational demands of these modern architectures while maintaining the accuracy. Compared to [normal quantization](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/quantization.md) like W8A8, weight only quantization is probably a better trade-off to balance the performance and the accuracy, since we will see below that the bottleneck of deploying LLMs is the memory bandwidth and normally weight only quantization could lead to better accuracy.

Here, we will introduce Embedding Models and Weight-only quantization for Transformers large language models with ITREX. Weight-only quantization is a technique used in deep learning to reduce the memory and computational requirements of neural networks. In the context of deep neural networks, the model parameters, also known as weights, are typically represented using floating-point numbers, which can consume a significant amount of memory and require intensive computational resources.

All functionality related to the [intel-extension-for-transformers](https://github.com/intel/intel-extension-for-transformers).

Expand All @@ -44,7 +51,6 @@ Install intel-extension-for-transformers. For system requirements and other inst
```bash
pip install intel-extension-for-transformers
```

Install other required packages.

```bash
Expand All @@ -58,3 +64,45 @@ See a [usage example](/docs/integrations/text_embedding/itrex).
```python
from langchain_community.embeddings import QuantizedBgeEmbeddings
```

### Weight-Only Quantization with ITREX

See a [usage example](../docs/integrations/llms/weight_only_quantization.ipynb).

## Detail of Configuration Parameters

Here is the detail of the `WeightOnlyQuantConfig` class.

#### weight_dtype (string): Weight Data Type, default is "nf4".
We support quantize the weights to following data types for storing(weight_dtype in WeightOnlyQuantConfig):
* **int8**: Uses 8-bit data type.
* **int4_fullrange**: Uses the -8 value of int4 range compared with the normal int4 range [-7,7].
* **int4_clip**: Clips and retains the values within the int4 range, setting others to zero.
* **nf4**: Uses the normalized float 4-bit data type.
* **fp4_e2m1**: Uses regular float 4-bit data type. "e2" means that 2 bits are used for the exponent, and "m1" means that 1 bits are used for the mantissa.

#### compute_dtype (string): Computing Data Type, Default is "fp32".
While these techniques store weights in 4 or 8 bit, the computation still happens in float32, bfloat16 or int8(compute_dtype in WeightOnlyQuantConfig):
* **fp32**: Uses the float32 data type to compute.
* **bf16**: Uses the bfloat16 data type to compute.
* **int8**: Uses 8-bit data type to compute.

#### llm_int8_skip_modules (list of module's name): Modules to Skip Quantization, Default is None.
It is a list of modules to be skipped quantization.

#### scale_dtype (string): The Scale Data Type, Default is "fp32".
Now only support "fp32"(float32).

#### mse_range (boolean): Whether to Search for The Best Clip Range from Range [0.805, 1.0, 0.005], default is False.
#### use_double_quant (boolean): Whether to Quantize Scale, Default is False.
Not support yet.
#### double_quant_dtype (string): Reserve for Double Quantization.
#### double_quant_scale_dtype (string): Reserve for Double Quantization.
#### group_size (int): Group Size When Auantization.
#### scheme (string): Which Format Weight Be Quantize to. Default is "sym".
* **sym**: Symmetric.
* **asym**: Asymmetric.
#### algorithm (string): Which Algorithm to Improve the Accuracy . Default is "RTN"
* **RTN**: Round-to-nearest (RTN) is a quantification method that we can think of very intuitively.
* **AWQ**: Protecting only 1% of salient weights can greatly reduce quantization error. the salient weight channels are selected by observing the distribution of activation and weight per channel. The salient weights are also quantized after multiplying a big scale factor before quantization for preserving. .
* **TEQ**: A trainable equivalent transformation that preserves the FP32 precision in weight-only quantization.
12 changes: 12 additions & 0 deletions libs/community/langchain_community/llms/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -590,6 +590,14 @@ def _import_watsonxllm() -> Type[BaseLLM]:
return WatsonxLLM


def _import_weight_only_quantization() -> Any:
from langchain_community.llms.weight_only_quantization import (
WeightOnlyQuantPipeline,
)

return WeightOnlyQuantPipeline


def _import_writer() -> Type[BaseLLM]:
from langchain_community.llms.writer import Writer

Expand Down Expand Up @@ -805,6 +813,8 @@ def __getattr__(name: str) -> Any:
return _import_vllm_openai()
elif name == "WatsonxLLM":
return _import_watsonxllm()
elif name == "WeightOnlyQuantPipeline":
return _import_weight_only_quantization()
elif name == "Writer":
return _import_writer()
elif name == "Xinference":
Expand Down Expand Up @@ -918,6 +928,7 @@ def __getattr__(name: str) -> Any:
"VertexAIModelGarden",
"VolcEngineMaasLLM",
"WatsonxLLM",
"WeightOnlyQuantPipeline",
"Writer",
"Xinference",
"YandexGPT",
Expand Down Expand Up @@ -1007,6 +1018,7 @@ def get_type_to_cls_dict() -> Dict[str, Callable[[], Type[BaseLLM]]]:
"vllm": _import_vllm,
"vllm_openai": _import_vllm_openai,
"watsonxllm": _import_watsonxllm,
"weight_only_quantization": _import_weight_only_quantization,
"writer": _import_writer,
"xinference": _import_xinference,
"javelin-ai-gateway": _import_javelin_ai_gateway,
Expand Down