-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
community[minor]: weight only quantization with intel-extension-for-t…
…ransformers. (#14504) Support weight only quantization with intel-extension-for-transformers. [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers) is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor [Sapphire Rapids](https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html) (codenamed Sapphire Rapids). The toolkit provides the below key features: * Seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor) * Advanced software optimizations and unique compression-aware runtime. * Optimized Transformer-based model packages. * [NeuralChat](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat), a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins and SOTA optimizations. * [Inference](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/llm/runtime/graph) of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels. This PR is an integration of weight only quantization feature with intel-extension-for-transformers. Unit test is in lib/langchain/tests/integration_tests/llm/test_weight_only_quantization.py The notebook is in docs/docs/integrations/llms/weight_only_quantization.ipynb. The document is in docs/docs/integrations/providers/weight_only_quantization.mdx. --------- Signed-off-by: Cheng, Penghui <penghui.cheng@intel.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
- Loading branch information
Showing
6 changed files
with
632 additions
and
1 deletion.
There are no files selected for viewing
264 changes: 264 additions & 0 deletions
264
docs/docs/integrations/llms/weight_only_quantization.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,264 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "959300d4", | ||
"metadata": {}, | ||
"source": [ | ||
"# Intel Weight-Only Quantization\n", | ||
"## Weight-Only Quantization for Huggingface Models with Intel Extension for Transformers Pipelines\n", | ||
"\n", | ||
"Hugging Face models can be run locally with Weight-Only quantization through the `WeightOnlyQuantPipeline` class.\n", | ||
"\n", | ||
"The [Hugging Face Model Hub](https://huggingface.co/models) hosts over 120k models, 20k datasets, and 50k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together.\n", | ||
"\n", | ||
"These can be called from LangChain through this local pipeline wrapper class." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "4c1b8450-5eaf-4d34-8341-2d785448a1ff", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"To use, you should have the ``transformers`` python [package installed](https://pypi.org/project/transformers/), as well as [pytorch](https://pytorch.org/get-started/locally/), [intel-extension-for-transformers](https://github.com/intel/intel-extension-for-transformers)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "d772b637-de00-4663-bd77-9bc96d798db2", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"%pip install transformers --quiet\n", | ||
"%pip install intel-extension-for-transformers" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "91ad075f-71d5-4bc8-ab91-cc0ad5ef16bb", | ||
"metadata": {}, | ||
"source": [ | ||
"### Model Loading\n", | ||
"\n", | ||
"Models can be loaded by specifying the model parameters using the `from_model_id` method. The model parameters include `WeightOnlyQuantConfig` class in intel_extension_for_transformers." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "165ae236-962a-4763-8052-c4836d78a5d2", | ||
"metadata": { | ||
"tags": [] | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig\n", | ||
"from langchain_community.llms.weight_only_quantization import WeightOnlyQuantPipeline\n", | ||
"\n", | ||
"conf = WeightOnlyQuantConfig(weight_dtype=\"nf4\")\n", | ||
"hf = WeightOnlyQuantPipeline.from_model_id(\n", | ||
" model_id=\"google/flan-t5-large\",\n", | ||
" task=\"text2text-generation\",\n", | ||
" quantization_config=conf,\n", | ||
" pipeline_kwargs={\"max_new_tokens\": 10},\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "00104b27-0c15-4a97-b198-4512337ee211", | ||
"metadata": {}, | ||
"source": [ | ||
"They can also be loaded by passing in an existing `transformers` pipeline directly" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "7f426a4f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from intel_extension_for_transformers.transformers import AutoModelForSeq2SeqLM\n", | ||
"from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline\n", | ||
"from transformers import AutoTokenizer, pipeline\n", | ||
"\n", | ||
"model_id = \"google/flan-t5-large\"\n", | ||
"tokenizer = AutoTokenizer.from_pretrained(model_id)\n", | ||
"model = AutoModelForSeq2SeqLM.from_pretrained(model_id)\n", | ||
"pipe = pipeline(\n", | ||
" \"text2text-generation\", model=model, tokenizer=tokenizer, max_new_tokens=10\n", | ||
")\n", | ||
"hf = WeightOnlyQuantPipeline(pipeline=pipe)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e4418c20-8fbb-475e-b389-9b27428b8fe1", | ||
"metadata": {}, | ||
"source": [ | ||
"### Create Chain\n", | ||
"\n", | ||
"With the model loaded into memory, you can compose it with a prompt to\n", | ||
"form a chain." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "60c8e151-2999-4d52-9c9c-db99df4f4321", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from langchain.prompts import PromptTemplate\n", | ||
"\n", | ||
"template = \"\"\"Question: {question}\n", | ||
"\n", | ||
"Answer: Let's think step by step.\"\"\"\n", | ||
"prompt = PromptTemplate.from_template(template)\n", | ||
"\n", | ||
"chain = prompt | hf\n", | ||
"\n", | ||
"question = \"What is electroencephalography?\"\n", | ||
"\n", | ||
"print(chain.invoke({\"question\": question}))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "dbbc3a37", | ||
"metadata": {}, | ||
"source": [ | ||
"### CPU Inference\n", | ||
"\n", | ||
"Now intel-extension-for-transformers only support CPU device inference. Will support intel GPU soon.When running on a machine with CPU, you can specify the `device=\"cpu\"` or `device=-1` parameter to put the model on CPU device.\n", | ||
"Defaults to `-1` for CPU inference." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "703c91c8", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"conf = WeightOnlyQuantConfig(weight_dtype=\"nf4\")\n", | ||
"llm = WeightOnlyQuantPipeline.from_model_id(\n", | ||
" model_id=\"google/flan-t5-large\",\n", | ||
" task=\"text2text-generation\",\n", | ||
" quantization_config=conf,\n", | ||
" pipeline_kwargs={\"max_new_tokens\": 10},\n", | ||
")\n", | ||
"\n", | ||
"template = \"\"\"Question: {question}\n", | ||
"\n", | ||
"Answer: Let's think step by step.\"\"\"\n", | ||
"prompt = PromptTemplate.from_template(template)\n", | ||
"\n", | ||
"chain = prompt | llm\n", | ||
"\n", | ||
"question = \"What is electroencephalography?\"\n", | ||
"\n", | ||
"print(chain.invoke({\"question\": question}))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "59276016", | ||
"metadata": {}, | ||
"source": [ | ||
"### Batch CPU Inference\n", | ||
"\n", | ||
"You can also run inference on the CPU in batch mode." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "097ba62f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"conf = WeightOnlyQuantConfig(weight_dtype=\"nf4\")\n", | ||
"llm = WeightOnlyQuantPipeline.from_model_id(\n", | ||
" model_id=\"google/flan-t5-large\",\n", | ||
" task=\"text2text-generation\",\n", | ||
" quantization_config=conf,\n", | ||
" pipeline_kwargs={\"max_new_tokens\": 10},\n", | ||
")\n", | ||
"\n", | ||
"chain = prompt | llm.bind(stop=[\"\\n\\n\"])\n", | ||
"\n", | ||
"questions = []\n", | ||
"for i in range(4):\n", | ||
" questions.append({\"question\": f\"What is the number {i} in french?\"})\n", | ||
"\n", | ||
"answers = chain.batch(questions)\n", | ||
"for answer in answers:\n", | ||
" print(answer)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "9cc4c225-53d7-4003-a0a6-eefb1a7ededc", | ||
"metadata": {}, | ||
"source": [ | ||
"### Data Types Supported by Intel-extension-for-transformers\n", | ||
"\n", | ||
"We support quantize the weights to following data types for storing(weight_dtype in WeightOnlyQuantConfig):\n", | ||
"\n", | ||
"* **int8**: Uses 8-bit data type.\n", | ||
"* **int4_fullrange**: Uses the -8 value of int4 range compared with the normal int4 range [-7,7].\n", | ||
"* **int4_clip**: Clips and retains the values within the int4 range, setting others to zero.\n", | ||
"* **nf4**: Uses the normalized float 4-bit data type.\n", | ||
"* **fp4_e2m1**: Uses regular float 4-bit data type. \"e2\" means that 2 bits are used for the exponent, and \"m1\" means that 1 bits are used for the mantissa.\n", | ||
"\n", | ||
"While these techniques store weights in 4 or 8 bit, the computation still happens in float32, bfloat16 or int8(compute_dtype in WeightOnlyQuantConfig):\n", | ||
"* **fp32**: Uses the float32 data type to compute.\n", | ||
"* **bf16**: Uses the bfloat16 data type to compute.\n", | ||
"* **int8**: Uses 8-bit data type to compute.\n", | ||
"\n", | ||
"### Supported Algorithms Matrix\n", | ||
"\n", | ||
"Quantization algorithms supported in intel-extension-for-transformers(algorithm in WeightOnlyQuantConfig):\n", | ||
"\n", | ||
"| Algorithms | PyTorch | LLM Runtime |\n", | ||
"|:--------------:|:----------:|:----------:|\n", | ||
"| RTN | ✔ | ✔ |\n", | ||
"| AWQ | ✔ | stay tuned |\n", | ||
"| TEQ | ✔ | stay tuned |\n", | ||
"> **RTN:** A quantification method that we can think of very intuitively. It does not require additional datasets and is a very fast quantization method. Generally speaking, RTN will convert the weight into a uniformly distributed integer data type, but some algorithms, such as Qlora, propose a non-uniform NF4 data type and prove its theoretical optimality.\n", | ||
"\n", | ||
"> **AWQ:** Proved that protecting only 1% of salient weights can greatly reduce quantization error. the salient weight channels are selected by observing the distribution of activation and weight per channel. The salient weights are also quantized after multiplying a big scale factor before quantization for preserving.\n", | ||
"\n", | ||
"> **TEQ:** A trainable equivalent transformation that preserves the FP32 precision in weight-only quantization. It is inspired by AWQ while providing a new solution to search for the optimal per-channel scaling factor between activations and weights.\n" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.