Skip to content

Latest commit

 

History

History
226 lines (142 loc) · 9.34 KB

README.md

File metadata and controls

226 lines (142 loc) · 9.34 KB

rwkv.cpp

This is a port of BlinkDL/RWKV-LM to ggerganov/ggml.

Besides the usual FP32, it supports FP16, quantized INT4, INT5 and INT8 inference. This project is focused on CPU, but cuBLAS is also supported.

This project provides a C library rwkv.h and a convinient Python wrapper for it.

RWKV is a novel large language model architecture, with the largest model in the family having 14B parameters. In contrast to Transformer with O(n^2) attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.

Loading LoRA checkpoints in Blealtan's format is supported through merge_lora_into_ggml.py script.

Quality and performance

If you use rwkv.cpp for anything serious, please test all available formats for perplexity and latency on a representative dataset, and decide which trade-off is best for you.

Below table is for reference only. Measurements were made on 4C/8T x86 CPU with AVX2, 4 threads.

Format Perplexity (169M) Latency, ms (1.5B) File size, GB (1.5B)
Q4_0 17.507 76 1.53
Q4_1 17.187 72 1.68
Q5_0 16.194 78 1.60
Q5_1 15.851 81 1.68
Q8_0 15.652 89 2.13
FP16 15.623 117 2.82
FP32 15.623 198 5.64

With cuBLAS

Measurements were made on Intel i7 13700K & NVIDIA 3060 Ti 8 GB. Latency per token in ms shown.

Model Layers on GPU Format 1 thread 2 threads 4 threads 8 threads 24 threads
RWKV-4-Pile-169M 12 Q4_0 7.9 6.2 6.9 8.6 20
RWKV-4-Pile-169M 12 Q4_1 7.8 6.7 6.9 8.6 21
RWKV-4-Pile-169M 12 Q5_1 8.1 6.7 6.9 9.0 22
Model Layers on GPU Format 1 thread 2 threads 4 threads 8 threads 24 threads
RWKV-4-Raven-7B-v11 32 Q4_0 59 51 50 54 94
RWKV-4-Raven-7B-v11 32 Q4_1 59 51 49 54 94
RWKV-4-Raven-7B-v11 32 Q5_1 77 69 67 72 101

Note: since cuBLAS is supported only for ggml_mul_mat(), we still need to use few CPU resources to execute remaining operations.

How to use

1. Clone the repo

Requirements: git.

git clone --recursive https://github.com/saharNooby/rwkv.cpp.git
cd rwkv.cpp

2. Get the rwkv.cpp library

Option 2.1. Download a pre-compiled library

Windows / Linux / MacOS

Check out Releases, download appropriate ZIP for your OS and CPU, extract rwkv library file into the repository directory.

On Windows: to check whether your CPU supports AVX2 or AVX-512, use CPU-Z.

Option 2.2. Build the library yourself

This option is recommended for maximum performance, because the library would be built specifically for your CPU and OS.

Windows

Requirements: CMake or CMake from anaconda, Build Tools for Visual Studio 2019.

cmake .
cmake --build . --config Release

If everything went OK, bin\Release\rwkv.dll file should appear.

Windows + cuBLAS

Refer to docs/cuBLAS_on_Windows.md for a comprehensive guide.

Linux / MacOS

Requirements: CMake (Linux: sudo apt install cmake, MacOS: brew install cmake, anaconoda: cmake package).

cmake .
cmake --build . --config Release

Anaconda & M1 users: please verify that CMAKE_SYSTEM_PROCESSOR: arm64 after running cmake . — if it detects x86_64, edit the CMakeLists.txt file under the # Compile flags to add set(CMAKE_SYSTEM_PROCESSOR "arm64").

If everything went OK, librwkv.so (Linux) or librwkv.dylib (MacOS) file should appear in the base repo folder.

Linux / MacOS + cuBLAS
cmake . -DRWKV_CUBLAS=ON
cmake --build . --config Release

If everything went OK, librwkv.so (Linux) or librwkv.dylib (MacOS) file should appear in the base repo folder.

3. Get an RWKV model

Option 3.1. Download pre-quantized Raven model

There are pre-quantized Raven models available on Hugging Face. Check that you are downloading .bin file, not .pth.

Option 3.2. Convert and quantize PyTorch model

Requirements: Python 3.x with PyTorch.

This option would require a little more manual work, but you can use it with any RWKV model and any target format.

First, download a model from Hugging Face like this one.

Second, convert it into rwkv.cpp format using following commands:

# Windows
python rwkv\convert_pytorch_to_ggml.py C:\RWKV-4-Pile-169M-20220807-8023.pth C:\rwkv.cpp-169M.bin FP16

# Linux / MacOS
python rwkv/convert_pytorch_to_ggml.py ~/Downloads/RWKV-4-Pile-169M-20220807-8023.pth ~/Downloads/rwkv.cpp-169M.bin FP16

Optionally, quantize the model into one of quantized formats from the table above:

# Windows
python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q5_1.bin Q5_1

# Linux / MacOS
python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q5_1.bin Q5_1

4. Run the model

Requirements: Python 3.x with PyTorch and tokenizers.

Note: change the model path with the non-quantized model for the full weights model.

To generate some text, run:

# Windows
python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q5_1.bin

# Linux / MacOS
python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q5_1.bin

To chat with a bot, run:

# Windows
python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q5_1.bin

# Linux / MacOS
python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q5_1.bin

Edit generate_completions.py or chat_with_bot.py to change prompts and sampling settings.


Example of using rwkv.cpp in your custom Python script:

import rwkv_cpp_model
import rwkv_cpp_shared_library

# Change to model paths used above (quantized or full weights) 
model_path = r'C:\rwkv.cpp-169M.bin'


model = rwkv_cpp_model.RWKVModel(
    rwkv_cpp_shared_library.load_rwkv_shared_library(),
    model_path,
    thread_count=4,    #need to adjust when use cuBLAS
    gpu_layers_count=5 #only enabled when use cuBLAS
)

logits, state = None, None

for token in [1, 2, 3]:
    logits, state = model.eval(token, state)

    print(f'Output logits: {logits}')

# Don't forget to free the memory after you've done working with the model
model.free()

Compatibility

ggml moves fast, and can occasionally break compatibility with older file formats.

rwkv.cpp will attempt it's best to explain why a model file can't be loaded and what next steps are available to the user.

For reference only, here is a list of latest versions of rwkv.cpp that have supported older formats. No support will be provided for these versions.

See also docs/FILE_FORMAT.md for version numbers of rwkv.cpp model files and their changelog.

Bindings

These projects wrap rwkv.cpp for easier use in other languages/frameworks.

Contributing

Please follow the code style described in docs/CODE_STYLE.md.