rwkv.cpp

This is a port of BlinkDL/RWKV-LM to ggerganov/ggml.

Besides the usual FP32, it supports FP16, quantized INT4 and quantized INT8 inference. This project is CPU only.

This project provides a C library rwkv.h and a convinient Python wrapper for it.

RWKV is a novel large language model architecture, with the largest model in the family having 14B parameters. In contrast to Transformer with O(n^2) attention, RWKV requires only state from previous step to calculate logits. This makes RWKV very CPU-friendly on large context lenghts.

Loading LoRA checkpoints in Blealtan's format is supported through merge_lora_into_ggml.py script.

Quality and performance

If you use rwkv.cpp for anything serious, please test all available formats for perplexity and latency on a representative dataset, and decide which trade-off is best for you.

Below table is for reference only. Measurements were made on 4C/8T x86 CPU with AVX2, 4 threads.

Format	Perplexity (169M)	Latency, ms (1.5B)	File size, GB (1.5B)
`Q4_0`	17.507	76	1.53
`Q4_1`	17.187	72	1.68
`Q4_2`	17.060	85	1.53
`Q5_0`	16.194	78	1.60
`Q5_1`	15.851	81	1.68
`Q8_0`	15.652	89	2.13
`FP16`	15.623	117	2.82
`FP32`	15.623	198	5.64

How to use

1. Clone the repo

Requirements: git.

git clone --recursive https://github.com/saharNooby/rwkv.cpp.git
cd rwkv.cpp

2. Get the rwkv.cpp library

Option 2.1. Download a pre-compiled library

Windows / Linux / MacOS

Check out Releases, download appropriate ZIP for your OS and CPU, extract rwkv library file into the repository directory.

On Windows: to check whether your CPU supports AVX2 or AVX-512, use CPU-Z.

Option 2.2. Build the library yourself

Windows

Requirements: CMake or CMake from anaconda, MSVC compiler.

cmake .
cmake --build . --config Release

If everything went OK, bin\Release\rwkv.dll file should appear.

Linux / MacOS

Requirements: CMake (Linux: sudo apt install cmake, MacOS: brew install cmake, anaconoda: cmake package).

cmake .
cmake --build . --config Release

Anaconda & M1 users: please verify that CMAKE_SYSTEM_PROCESSOR: arm64 after running cmake . — if it detects x86_64, edit the CMakeLists.txt file under the # Compile flags to add set(CMAKE_SYSTEM_PROCESSOR "arm64").

If everything went OK, librwkv.so (Linux) or librwkv.dylib (MacOS) file should appear in the base repo folder.

3. Download an RWKV model from Hugging Face like this one and convert it into `ggml` format

Requirements: Python 3.x with PyTorch.

# Windows
python rwkv\convert_pytorch_to_ggml.py C:\RWKV-4-Pile-169M-20220807-8023.pth C:\rwkv.cpp-169M.bin float16

# Linux / MacOS
python rwkv/convert_pytorch_to_ggml.py ~/Downloads/RWKV-4-Pile-169M-20220807-8023.pth ~/Downloads/rwkv.cpp-169M.bin float16

3.1. Optionally, quantize the model

To convert the model into one of quantized formats from the table above, run:

# Windows
python rwkv\quantize.py C:\rwkv.cpp-169M.bin C:\rwkv.cpp-169M-Q4_2.bin Q4_2

# Linux / MacOS
python rwkv/quantize.py ~/Downloads/rwkv.cpp-169M.bin ~/Downloads/rwkv.cpp-169M-Q4_2.bin Q4_2

4. Run the model

Requirements: Python 3.x with PyTorch and tokenizers.

Note: change the model path with the non-quantized model for the full weights model.

To generate some text, run:

# Windows
python rwkv\generate_completions.py C:\rwkv.cpp-169M-Q4_2.bin

# Linux / MacOS
python rwkv/generate_completions.py ~/Downloads/rwkv.cpp-169M-Q4_2.bin

To chat with a bot, run:

# Windows
python rwkv\chat_with_bot.py C:\rwkv.cpp-169M-Q4_2.bin

# Linux / MacOS
python rwkv/chat_with_bot.py ~/Downloads/rwkv.cpp-169M-Q4_2.bin

Edit generate_completions.py or chat_with_bot.py to change prompts and sampling settings.

Example of using rwkv.cpp in your custom Python script:

import rwkv_cpp_model
import rwkv_cpp_shared_library

# Change to model paths used above (quantized or full weights) 
model_path = r'C:\rwkv.cpp-169M.bin'


model = rwkv_cpp_model.RWKVModel(
    rwkv_cpp_shared_library.load_rwkv_shared_library(),
    model_path
)

logits, state = None, None

for token in [1, 2, 3]:
    logits, state = model.eval(token, state)
    
    print(f'Output logits: {logits}')

# Don't forget to free the memory after you've done working with the model
model.free()

Name		Name	Last commit message	Last commit date
Latest commit History 342 Commits
.github/workflows		.github/workflows
ggml @ 9d7974c		ggml @ 9d7974c
rwkv		rwkv
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
FILE_FORMAT.md		FILE_FORMAT.md
LICENSE		LICENSE
README.md		README.md
rwkv.cpp		rwkv.cpp
rwkv.h		rwkv.h

License

PicoCreator/rwkv.cpp

Folders and files

Latest commit

History

Repository files navigation

rwkv.cpp

Quality and performance

How to use

1. Clone the repo

2. Get the rwkv.cpp library

Option 2.1. Download a pre-compiled library

Windows / Linux / MacOS

Option 2.2. Build the library yourself

Windows

Linux / MacOS

3. Download an RWKV model from Hugging Face like this one and convert it into ggml format

3.1. Optionally, quantize the model

4. Run the model

About

Resources

License

Stars

Watchers

Forks

Languages

3. Download an RWKV model from Hugging Face like this one and convert it into `ggml` format