Add support for Q5_0, Q5_1 and Q8_0 formats; remove Q4_1_O format (#44)

* Remove Q4_3 support * Add Q5_0, Q5_1, Q8_0 support * Add more clear message when loading Q4_3 model * Remove Q4_1_O format * Fix indentation in .gitmodules * Simplify sanitizer matrix
RWKV · Apr 29, 2023 · 1198892 · 1198892
1 parent c736ef5
commit 1198892
Show file tree

Hide file tree

Showing 14 changed files with 233 additions and 425 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -25,7 +25,6 @@ jobs:
       matrix:
         sanitizer: [ADDRESS, THREAD, UNDEFINED]
         build_type: [Debug, Release]
-        accelerate: [ON, OFF]
 
     steps:
       - name: Clone
@@ -45,7 +44,7 @@ jobs:
         run: |
           mkdir build
           cd build
-          cmake .. -DRWKV_SANITIZE_${{ matrix.sanitizer }}=ON -DGGML_SANITIZE_${{ matrix.sanitizer }}=ON -DRWKV_ACCELERATE=${{ matrix.accelerate }} -DCMAKE_BUILD_TYPE=${{ matrix.build_type }}
+          cmake .. -DRWKV_SANITIZE_${{ matrix.sanitizer }}=ON -DGGML_SANITIZE_${{ matrix.sanitizer }}=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }}
           cmake --build . --config ${{ matrix.build_type }}
 
       - name: Test

diff --git a/.gitmodules b/.gitmodules
@@ -1,3 +1,4 @@
 [submodule "ggml"]
 	path = ggml
 	url = https://github.com/saharNooby/ggml
+	branch = master-2023-04-29
diff --git a/FILE_FORMAT.md b/FILE_FORMAT.md
@@ -0,0 +1,53 @@
+# rwkv.cpp file format
+
+This format is used by `rwkv.cpp` to store RWKV model checkpoints.
+
+Preferred file extension: `.bin`
+
+Specification in C-like pseudocode:
+
+```
+RWKVModelFile {
+    // All ints and floats are in machine byte order.
+    // Magic is "ggml" string bytes.
+    int32 magic = 0x67676d66;
+    int32 version = 100;
+    int32 n_vocab;
+    int32 n_embed;
+    int32 n_layer;
+    // Data type of most of the parameters. See "Data types" below for possible values.
+    int32 data_type;
+    // Read until EOF.
+    Parameter[] parameters;
+}
+
+Parameter {
+    int32 dim_count;
+    int32 key_length;
+    // Data type of the parameter. See "Data types" below for possible values.
+    int32 data_type;
+    // Compared to PyTorch's parameter.shape, dimension order is reversed here!
+    int32[dim_count] shape;
+    // Keys are like "emb.weight", "block.0.ln1.weight".
+    uint8[key_length] key_utf8;
+    // Length of the data array depends on parameter data type:
+    // - FP32: 4 * element_count 
+    // - FP16: 2 * element_count
+    // - QX_Y (quantized): element_count / QKX_Y * sizeof(block_qx_y)
+    // See ggml.c for values of QK and block sizes of specific formats.
+    byte[] data;
+}
+```
+
+## Data types
+
+- 0: `FP32`
+- 1: `FP16`
+- 2: `Q4_0`
+- 3: `Q4_1`
+- 4: *unused*
+- 5: `Q4_2`
+- 6: *unused*
+- 7: `Q5_0`
+- 8: `Q5_1`
+- 9: `Q8_0`
diff --git a/README.md b/README.md
@@ -2,18 +2,30 @@
 
 This is a port of [BlinkDL/RWKV-LM](https://github.com/BlinkDL/RWKV-LM) to [ggerganov/ggml](https://github.com/ggerganov/ggml).
 
-Besides the usual **FP32**, it supports **FP16** and **quantized INT4** inference on CPU. This project is **CPU only**.
-
-RWKV is a novel large language model architecture, [with the largest model in the family having 14B parameters](https://huggingface.co/BlinkDL/rwkv-4-pile-14b). In contrast to Transformer with `O(n^2)` attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.
+Besides the usual **FP32**, it supports **FP16**, **quantized INT4** and **quantized INT8** inference. This project is **CPU only**.
 
 This project provides [a C library rwkv.h](rwkv.h) and [a convinient Python wrapper](rwkv%2Frwkv_cpp_model.py) for it.
 
+RWKV is a novel large language model architecture, [with the largest model in the family having 14B parameters](https://huggingface.co/BlinkDL/rwkv-4-pile-14b). In contrast to Transformer with `O(n^2)` attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.
+
 Loading LoRA checkpoints in [Blealtan's format](https://github.com/Blealtan/RWKV-LM-LoRA) is supported through [merge_lora_into_ggml.py script](rwkv%2Fmerge_lora_into_ggml.py).
 
-**TODO (contributions welcome!)**:
+### Quality and performance
+
+If you use `rwkv.cpp` for anything serious, please [test all available formats for perplexity and latency](rwkv%2Fmeasure_pexplexity.py) on a representative dataset, and decide which trade-off is best for you.
 
-1. Measure latency and perplexity of different model sizes (169M to 14B) and data types (`FP32`, `FP16`, `Q4_0`, `Q4_1`, `Q4_1_O`)
-2. Make required memory calculation more robust (see [#4](https://github.com/saharNooby/rwkv.cpp/issues/4))
+Below table is for reference only. Measurements were made on 4C/8T x86 CPU with AVX2, 4 threads.
+
+| Format    | Perplexity (169M) | Latency, ms (1.5B) | File size, GB (1.5B) |
+|-----------|-------------------|--------------------|----------------------|
+| `Q4_0`    | 17.507            | *76*               | **1.53**             |
+| `Q4_1`    | 17.187            | **72**             | 1.68                 |
+| `Q4_2`    | 17.060            | 85                 | **1.53**             |
+| `Q5_0`    | 16.194            | 78                 | *1.60*               |
+| `Q5_1`    | 15.851            | 81                 | 1.68                 |
+| `Q8_0`    | *15.652*          | 89                 | 2.13                 |
+| `FP16`    | **15.623**        | 117                | 2.82                 |
+| `FP32`    | **15.623**        | 198                | 5.64                 |
 
 ## How to use
 
@@ -77,26 +89,16 @@ python rwkv/convert_pytorch_to_ggml.py ~/Downloads/RWKV-4-Pile-169M-20220807-802
 
 #### 3.1. Optionally, quantize the model
 
-To convert the model into INT4 quantized format, run:
+To convert the model into one of quantized formats from the table above, run:
 
 ```commandline
 # Windows
-python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q4_1_O.bin 4
+python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q4_2.bin Q4_2
 
 # Linux / MacOS
-python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q4_1_O.bin 4
+python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q4_2.bin Q4_2
 ```
 
-Formats available:
-
-- `6`: `Q4_3`, OK quality, fast.
-- `5`: `Q4_2`, poor quality, fast.
-- `4`: `Q4_1_O`, best quality, slow (20% slower than `FP16`).
-- `3`: `Q4_1`, poor quality, very fast.
-- `2`: `Q4_0`, worst quality, very fast.
-
-If you use `rwkv.cpp` for anything serious (just having fun is serious enough!), please [test all available formats for perplexity and latency](rwkv%2Fmeasure_pexplexity.py) on a representative dataset, and decide which trade-off is best for you.
-
 ### 4. Run the model
 
 **Requirements**: Python 3.x with [PyTorch](https://pytorch.org/get-started/locally/) and [tokenizers](https://pypi.org/project/tokenizers/).
@@ -107,20 +109,20 @@ To generate some text, run:
 
 ```commandline
 # Windows
-python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q4_1_O.bin
+python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q4_2.bin
 
 # Linux / MacOS
-python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q4_1_O.bin
+python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q4_2.bin
 ```
 
 To chat with a bot, run:
 
 ```commandline
 # Windows
-python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q4_1_O.bin
+python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q4_2.bin
 
 # Linux / MacOS
-python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q4_1_O.bin
+python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q4_2.bin
 ```
 
 Edit [generate_completions.py](rwkv%2Fgenerate_completions.py) or [chat_with_bot.py](rwkv%2Fchat_with_bot.py) to change prompts and sampling settings.

diff --git a/ggml b/ggml