Skip to content

Commit

Permalink
Add support for Q5_0, Q5_1 and Q8_0 formats; remove Q4_1_O format (#44)
Browse files Browse the repository at this point in the history
* Remove Q4_3 support

* Add Q5_0, Q5_1, Q8_0 support

* Add more clear message when loading Q4_3 model

* Remove Q4_1_O format

* Fix indentation in .gitmodules

* Simplify sanitizer matrix
  • Loading branch information
saharNooby committed Apr 29, 2023
1 parent c736ef5 commit 1198892
Show file tree
Hide file tree
Showing 14 changed files with 233 additions and 425 deletions.
3 changes: 1 addition & 2 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ jobs:
matrix:
sanitizer: [ADDRESS, THREAD, UNDEFINED]
build_type: [Debug, Release]
accelerate: [ON, OFF]

steps:
- name: Clone
Expand All @@ -45,7 +44,7 @@ jobs:
run: |
mkdir build
cd build
cmake .. -DRWKV_SANITIZE_${{ matrix.sanitizer }}=ON -DGGML_SANITIZE_${{ matrix.sanitizer }}=ON -DRWKV_ACCELERATE=${{ matrix.accelerate }} -DCMAKE_BUILD_TYPE=${{ matrix.build_type }}
cmake .. -DRWKV_SANITIZE_${{ matrix.sanitizer }}=ON -DGGML_SANITIZE_${{ matrix.sanitizer }}=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }}
cmake --build . --config ${{ matrix.build_type }}
- name: Test
Expand Down
1 change: 1 addition & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
[submodule "ggml"]
path = ggml
url = https://github.com/saharNooby/ggml
branch = master-2023-04-29
53 changes: 53 additions & 0 deletions FILE_FORMAT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# rwkv.cpp file format

This format is used by `rwkv.cpp` to store RWKV model checkpoints.

Preferred file extension: `.bin`

Specification in C-like pseudocode:

```
RWKVModelFile {
// All ints and floats are in machine byte order.
// Magic is "ggml" string bytes.
int32 magic = 0x67676d66;
int32 version = 100;
int32 n_vocab;
int32 n_embed;
int32 n_layer;
// Data type of most of the parameters. See "Data types" below for possible values.
int32 data_type;
// Read until EOF.
Parameter[] parameters;
}
Parameter {
int32 dim_count;
int32 key_length;
// Data type of the parameter. See "Data types" below for possible values.
int32 data_type;
// Compared to PyTorch's parameter.shape, dimension order is reversed here!
int32[dim_count] shape;
// Keys are like "emb.weight", "block.0.ln1.weight".
uint8[key_length] key_utf8;
// Length of the data array depends on parameter data type:
// - FP32: 4 * element_count
// - FP16: 2 * element_count
// - QX_Y (quantized): element_count / QKX_Y * sizeof(block_qx_y)
// See ggml.c for values of QK and block sizes of specific formats.
byte[] data;
}
```

## Data types

- 0: `FP32`
- 1: `FP16`
- 2: `Q4_0`
- 3: `Q4_1`
- 4: *unused*
- 5: `Q4_2`
- 6: *unused*
- 7: `Q5_0`
- 8: `Q5_1`
- 9: `Q8_0`
48 changes: 25 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,30 @@

This is a port of [BlinkDL/RWKV-LM](https://github.com/BlinkDL/RWKV-LM) to [ggerganov/ggml](https://github.com/ggerganov/ggml).

Besides the usual **FP32**, it supports **FP16** and **quantized INT4** inference on CPU. This project is **CPU only**.

RWKV is a novel large language model architecture, [with the largest model in the family having 14B parameters](https://huggingface.co/BlinkDL/rwkv-4-pile-14b). In contrast to Transformer with `O(n^2)` attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.
Besides the usual **FP32**, it supports **FP16**, **quantized INT4** and **quantized INT8** inference. This project is **CPU only**.

This project provides [a C library rwkv.h](rwkv.h) and [a convinient Python wrapper](rwkv%2Frwkv_cpp_model.py) for it.

RWKV is a novel large language model architecture, [with the largest model in the family having 14B parameters](https://huggingface.co/BlinkDL/rwkv-4-pile-14b). In contrast to Transformer with `O(n^2)` attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.

Loading LoRA checkpoints in [Blealtan's format](https://github.com/Blealtan/RWKV-LM-LoRA) is supported through [merge_lora_into_ggml.py script](rwkv%2Fmerge_lora_into_ggml.py).

**TODO (contributions welcome!)**:
### Quality and performance

If you use `rwkv.cpp` for anything serious, please [test all available formats for perplexity and latency](rwkv%2Fmeasure_pexplexity.py) on a representative dataset, and decide which trade-off is best for you.

1. Measure latency and perplexity of different model sizes (169M to 14B) and data types (`FP32`, `FP16`, `Q4_0`, `Q4_1`, `Q4_1_O`)
2. Make required memory calculation more robust (see [#4](https://github.com/saharNooby/rwkv.cpp/issues/4))
Below table is for reference only. Measurements were made on 4C/8T x86 CPU with AVX2, 4 threads.

| Format | Perplexity (169M) | Latency, ms (1.5B) | File size, GB (1.5B) |
|-----------|-------------------|--------------------|----------------------|
| `Q4_0` | 17.507 | *76* | **1.53** |
| `Q4_1` | 17.187 | **72** | 1.68 |
| `Q4_2` | 17.060 | 85 | **1.53** |
| `Q5_0` | 16.194 | 78 | *1.60* |
| `Q5_1` | 15.851 | 81 | 1.68 |
| `Q8_0` | *15.652* | 89 | 2.13 |
| `FP16` | **15.623** | 117 | 2.82 |
| `FP32` | **15.623** | 198 | 5.64 |

## How to use

Expand Down Expand Up @@ -77,26 +89,16 @@ python rwkv/convert_pytorch_to_ggml.py ~/Downloads/RWKV-4-Pile-169M-20220807-802

#### 3.1. Optionally, quantize the model

To convert the model into INT4 quantized format, run:
To convert the model into one of quantized formats from the table above, run:

```commandline
# Windows
python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q4_1_O.bin 4
python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q4_2.bin Q4_2
# Linux / MacOS
python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q4_1_O.bin 4
python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q4_2.bin Q4_2
```

Formats available:

- `6`: `Q4_3`, OK quality, fast.
- `5`: `Q4_2`, poor quality, fast.
- `4`: `Q4_1_O`, best quality, slow (20% slower than `FP16`).
- `3`: `Q4_1`, poor quality, very fast.
- `2`: `Q4_0`, worst quality, very fast.

If you use `rwkv.cpp` for anything serious (just having fun is serious enough!), please [test all available formats for perplexity and latency](rwkv%2Fmeasure_pexplexity.py) on a representative dataset, and decide which trade-off is best for you.

### 4. Run the model

**Requirements**: Python 3.x with [PyTorch](https://pytorch.org/get-started/locally/) and [tokenizers](https://pypi.org/project/tokenizers/).
Expand All @@ -107,20 +109,20 @@ To generate some text, run:

```commandline
# Windows
python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q4_1_O.bin
python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q4_2.bin
# Linux / MacOS
python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q4_1_O.bin
python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q4_2.bin
```

To chat with a bot, run:

```commandline
# Windows
python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q4_1_O.bin
python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q4_2.bin
# Linux / MacOS
python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q4_1_O.bin
python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q4_2.bin
```

Edit [generate_completions.py](rwkv%2Fgenerate_completions.py) or [chat_with_bot.py](rwkv%2Fchat_with_bot.py) to change prompts and sampling settings.
Expand Down
2 changes: 1 addition & 1 deletion ggml
Submodule ggml updated from bfa8d5 to a0687a

0 comments on commit 1198892

Please sign in to comment.