Implement quantization on-the-fly #100

saharNooby · 2023-06-14T15:40:57Z

This feature allows to quantize FP32/FP16 models on-the-fly to any other quantized format, without the need to explicitly run quantize.py and keep quantized models on disk.

Intended use-case is having only FP16 model saved on the disk and not wasting disk space on quantized models of all possible formats.

Furthermore, if quantization format changes again, those who use quantization on-the-fly will not even notice it, since updated rwkv.cpp will just use new format when loading the FP16 model.

saharNooby · 2023-06-14T15:41:54Z

rwkv.h

+        const uint32_t n_threads,
+        const struct rwkv_init_from_file_option * options,
+        const size_t option_count
+    );


@LoganDark I think now the interface is generic enough to painlessly add new options in the future -- for mmap, etc.

eh... this does not inspire confidence for some reason. I am not sure why. I think all the existing parameters should be moved to the options structure, but also that the library needs more work before it can move to an options structure at all.

loading from file itself I intended to move into its own option, because for really insane use cases, I'm literally thinking of things like streaming the model from the network so it doesn't touch the disk at all. I imagine this being used for something like microcontrollers that don't have a filesystem. it sounds really stupid, I know, but it's a contrived example.

one of the things I planned to do first was move rwkv.cpp into using multiple files because I think its file is getting quite long and is a bit disorganized, with file reading functions and inference functions and quantization functions all in the same file. I think it works for ggml but rwkv.cpp is getting long enough that it's somewhat uncomfortable to navigate.

it's probably a bit weird of me to say that I already had a roadmap in mind but I don't think an interim solution like this would be very great, especially since having it here would encourage us to keep it.

so I would probably either hold off on merging this (I was planning to implement it myself anyway) or find a way that doesn't involve moving to an options dict so soon. but I think rwkv.cpp does not need quantize on load at all yet - it will become more useful when it can load directly from pytorch checkpoints, as those cannot be quantized at all, so quantizing on load would be the only option, but that subsystem does not exist yet and I will account for when it does.

I already had a roadmap in mind
so I would probably either hold off on merging this

If possible, can you share the roadmap (even if it is rough), and some timelines? Speciflcally for PyTorch loading support

Honestly, there is no hurry to merging the PR, since everything worked fine before it and no one complained. But I would like to have somewhat good reason to postpone it.

one of the things I planned to do first was move rwkv.cpp into using multiple files because I think its file is getting quite long and is a bit disorganized

Completely agree, had these ideas myself. Not related to this PR tho :)

loading from file itself I intended to move into its own option

I don't think rwkv.cpp should support network loading or other non-file use cases; so file_path argument most probably will stay. As you said yourself, these use cases are insane lol

If possible, can you share the roadmap (even if it is rough), and some timelines? Speciflcally for PyTorch loading support

I plan to implement in this rough order:

new implementation of WKV for sequence mode (should be faster), but may not be finished

PR my own python bindings for the library that should handle errors a bit better / be slightly easier to use

reorganize rwkv.cpp into files (maybe move into src directly and use #include), no need to complicate the cmake

make the loading system more generic (should support any method of loading) , probably be a total API redesign

implement mmap, pytorch loading, quantize on load on top of that

I also have ideas for a compressed model format, maybe with magic number "ggzf" for "zip", because I noticed that pytorch checkpoints are far smaller than ggml models due to using zip compression, and I would love to reduce the size of large ggml models (by 10-20gb per model !!) by using compression. This could also speed up load times :)

also, I want to make model loading one-shot again (only read the file once), because depending on fseek and fstat and ftell is hurting our cross platform compatibility. Additionally that would remove the dependency on a hash map at runtime (is it hash map ? some kind of map) to load the tensors directly into the model. I have a working version of this in rust actually, but would need to be ported to C++ (should be easy).

Anyway, overall the goal is to make the library a lot more flexible, it was specialized as a prototype to load a single model from a binary file and evaluate single tokens, but it'll get a lot more exciting and faster if we make it more flexible.

Imagine downloading a compressed model file, and either loading it directly, or using the library itself to decompress it and then using mmap (without requiring python). Or even imagine downloading fresh pytorch checkpoints, minutes after BlinkDL first releases them, and either converting them tensor by tensor (like quantization) or just using them that way.

Imagine using this on desktops, servers, mobile phones, embedded devices (possibly with TPU ?!), whatever.

Imagine training models with rwkv.cpp, too (that is not on my roadmap because I don't know how I would do that yet, but I can still dream :3)

Not related to this PR tho :)

It's related as in I consider it a blocker, i.e. I wouldn't implement the options until the source code is organized enough.

non-file use cases

You know mmap support is the biggest non-file use case. rwkv.cpp will have to implement loading from memory anyway. The only difference is whether we allow third party programs to use this functionality. Ideally it would be implemented in such a way that rwkv.cpp will not have to support network loading or anything insane like that. It would just support "any kind of loading" and programs would be able to implement their own network loading if they wanted

loading from file itself I intended to move into its own option, because for really insane use cases, I'm literally thinking of things like streaming the model from the network so it doesn't touch the disk at all. I imagine this being used for something like microcontrollers that don't have a filesystem. it sounds really stupid, I know, but it's a contrived example.

Mind blowing. I think this actually a excellent use case for rwkv. For my limited understanding for the rwkv internal , the context memory is constant and the memory access pattern is sequential(backward or forward). So it makes a lots of sense to convert the source of truth(f16 weight) to latest quantized format on the fly, much like the load time jit compiler

Both of you 🤘

@lin72h the source of truth is actually the f32 version, as that's what BlinkDL trains, but f16 would still count as a source of truth if you're using it to generate a quantized model. :)

saharNooby · 2023-06-15T11:23:17Z

@LoganDark Thanks for describing the roadmap! Let's wait until API redesign then. I hope it won't be too breaking :)

I'll leave this PR hanging as a draft until new the loading method is available, so that users who want to use on-the-fly quantization now can notice and use this branch.

LoganDark · 2023-06-15T14:27:45Z

I hope it won't be too breaking :)

It should be possible to reimplement the current API in terms of the new one, in order to keep compatibility with existing programs :)

saharNooby added 6 commits June 13, 2023 20:34

Implement on-the-fly quantization

65cdb51

Resolve TODO items

dca26e9

Fix error code

7a13fd2

Reformat code

4d27fa8

Consistently use FP16 and FP32 for rwkv.cpp data types

d3b6749

Add test for on-the-fly quantization

c49d3d8

saharNooby commented Jun 14, 2023

View reviewed changes

saharNooby marked this pull request as draft June 15, 2023 11:20

saharNooby mentioned this pull request Jun 18, 2023

Merge RWKV back to GGML? ggerganov/ggml#266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement quantization on-the-fly #100

Implement quantization on-the-fly #100

saharNooby commented Jun 14, 2023

saharNooby Jun 14, 2023

LoganDark Jun 14, 2023

saharNooby Jun 14, 2023

LoganDark Jun 14, 2023 •

edited

lin72h Jun 15, 2023

LoganDark Jun 15, 2023

saharNooby commented Jun 15, 2023

LoganDark commented Jun 15, 2023

Implement quantization on-the-fly #100

Are you sure you want to change the base?

Implement quantization on-the-fly #100

Conversation

saharNooby commented Jun 14, 2023

saharNooby Jun 14, 2023

Choose a reason for hiding this comment

LoganDark Jun 14, 2023

Choose a reason for hiding this comment

saharNooby Jun 14, 2023

Choose a reason for hiding this comment

LoganDark Jun 14, 2023 • edited

Choose a reason for hiding this comment

lin72h Jun 15, 2023

Choose a reason for hiding this comment

LoganDark Jun 15, 2023

Choose a reason for hiding this comment

saharNooby commented Jun 15, 2023

LoganDark commented Jun 15, 2023

LoganDark Jun 14, 2023 •

edited