Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #16

saharNooby · 2023-04-07T06:24:41Z

Q4_1_O is like Q4_1, but with two important differences:

for each block, a single outlier is selected (absmax value) and stored separately, as-is; remaining values are quantized as if there was no outlier at all
during inference, dot product in matmul is done in FP32, following weight dequantization; in contrast to Q4_1, which quantized activations and does quantized dot

This format greatly improves perplexity as compared to Q4_1, but the cost is inference that is as slow as FP32.

Perplexity comparison on a private dataset (less is better):

1B5-20220929-ctx4096-Q4_0.bin,   loss [3.079], perplexity  21.745
1B5-20220929-ctx4096-Q4_1.bin,   loss [2.655], perplexity  14.231
1B5-20220929-ctx4096-Q4_1_O.bin, loss [2.204], perplexity   9.060
1B5-20220929-ctx4096-FP16.bin,   loss [2.060], perplexity   7.847

3B-20221110-ctx4096-Q4_0.bin,    loss [4.689], perplexity 108.724
3B-20221110-ctx4096-Q4_1.bin,    loss [2.916], perplexity  18.475
3B-20221110-ctx4096-Q4_1_O.bin,  loss [2.406], perplexity  11.093
3B-20221110-ctx4096-FP16.bin,    loss [2.067], perplexity   7.901

Performance comparison (per-token latency, less is better):

1B5 FP32:   213 ms per token
1B5 FP16:   115 ms per token
1B5 Q4_0:   159 ms per token
1B5 Q4_1:   110 ms per token
1B5 Q4_1_O: 207 ms per token

iacore · 2023-04-22T13:37:18Z

README.md has changed from this commit.

Current:

4: Q4_1_O, OK quality, moderately fast (20% slower than FP16).
3: Q4_1, worst quality, fast (comparable to FP16).
2: Q4_0, poor quality, very fast.

Which one is correct?

saharNooby · 2023-04-22T13:48:54Z

@iacore Current version in README.md is correct.

Note that I'm working on pulling Q4_2 and Q4_3 formats from ggml, latest measurements are here. This is not merged into master yet.

saharNooby added 8 commits April 6, 2023 20:26

Add missing labels and symbols for new operators

ad3a4eb

Free ggml context when model is garbage collected

fa9ad13

Show file compression ratio

058b5cd

Do not quantize head

ec99bc1

Add Q4_1_O format

c40941d

Use ggml function for parameter size calculation

18bf02f

Add Q4_1_O test

e26b408

Update README.md

edd57a1

saharNooby linked an issue Apr 7, 2023 that may be closed by this pull request

Typo in README.md #15

Closed

saharNooby mentioned this pull request Apr 8, 2023

rwkv.cpp server #17

Open

saharNooby added 3 commits April 8, 2023 10:01

Remove reference impl comparison test

e04baa0

Add script that measures perplexity

85db23c

Update README.md

874826c

saharNooby merged commit 84e0698 into master Apr 8, 2023
8 checks passed

saharNooby deleted the outliers-preserving-quantization-PR branch April 22, 2023 15:35

iacore mentioned this pull request Apr 22, 2023

Support for RWKV rustformers/llm#75

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #16

Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #16

saharNooby commented Apr 7, 2023

iacore commented Apr 22, 2023

saharNooby commented Apr 22, 2023

Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #16

Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #16

Conversation

saharNooby commented Apr 7, 2023

iacore commented Apr 22, 2023

saharNooby commented Apr 22, 2023