Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement quantization on-the-fly #100

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

saharNooby
Copy link
Collaborator

This feature allows to quantize FP32/FP16 models on-the-fly to any other quantized format, without the need to explicitly run quantize.py and keep quantized models on disk.

Intended use-case is having only FP16 model saved on the disk and not wasting disk space on quantized models of all possible formats.

Furthermore, if quantization format changes again, those who use quantization on-the-fly will not even notice it, since updated rwkv.cpp will just use new format when loading the FP16 model.

const uint32_t n_threads,
const struct rwkv_init_from_file_option * options,
const size_t option_count
);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LoganDark I think now the interface is generic enough to painlessly add new options in the future -- for mmap, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh... this does not inspire confidence for some reason. I am not sure why. I think all the existing parameters should be moved to the options structure, but also that the library needs more work before it can move to an options structure at all.

loading from file itself I intended to move into its own option, because for really insane use cases, I'm literally thinking of things like streaming the model from the network so it doesn't touch the disk at all. I imagine this being used for something like microcontrollers that don't have a filesystem. it sounds really stupid, I know, but it's a contrived example.

one of the things I planned to do first was move rwkv.cpp into using multiple files because I think its file is getting quite long and is a bit disorganized, with file reading functions and inference functions and quantization functions all in the same file. I think it works for ggml but rwkv.cpp is getting long enough that it's somewhat uncomfortable to navigate.

it's probably a bit weird of me to say that I already had a roadmap in mind but I don't think an interim solution like this would be very great, especially since having it here would encourage us to keep it.

so I would probably either hold off on merging this (I was planning to implement it myself anyway) or find a way that doesn't involve moving to an options dict so soon. but I think rwkv.cpp does not need quantize on load at all yet - it will become more useful when it can load directly from pytorch checkpoints, as those cannot be quantized at all, so quantizing on load would be the only option, but that subsystem does not exist yet and I will account for when it does.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already had a roadmap in mind
so I would probably either hold off on merging this

If possible, can you share the roadmap (even if it is rough), and some timelines? Speciflcally for PyTorch loading support

Honestly, there is no hurry to merging the PR, since everything worked fine before it and no one complained. But I would like to have somewhat good reason to postpone it.

one of the things I planned to do first was move rwkv.cpp into using multiple files because I think its file is getting quite long and is a bit disorganized

Completely agree, had these ideas myself. Not related to this PR tho :)

loading from file itself I intended to move into its own option

I don't think rwkv.cpp should support network loading or other non-file use cases; so file_path argument most probably will stay. As you said yourself, these use cases are insane lol

Copy link
Contributor

@LoganDark LoganDark Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, can you share the roadmap (even if it is rough), and some timelines? Speciflcally for PyTorch loading support

I plan to implement in this rough order:

  • new implementation of WKV for sequence mode (should be faster), but may not be finished
  • PR my own python bindings for the library that should handle errors a bit better / be slightly easier to use
  • reorganize rwkv.cpp into files (maybe move into src directly and use #include), no need to complicate the cmake
  • make the loading system more generic (should support any method of loading) , probably be a total API redesign
  • implement mmap, pytorch loading, quantize on load on top of that
  • I also have ideas for a compressed model format, maybe with magic number "ggzf" for "zip", because I noticed that pytorch checkpoints are far smaller than ggml models due to using zip compression, and I would love to reduce the size of large ggml models (by 10-20gb per model !!) by using compression. This could also speed up load times :)

also, I want to make model loading one-shot again (only read the file once), because depending on fseek and fstat and ftell is hurting our cross platform compatibility. Additionally that would remove the dependency on a hash map at runtime (is it hash map ? some kind of map) to load the tensors directly into the model. I have a working version of this in rust actually, but would need to be ported to C++ (should be easy).

Anyway, overall the goal is to make the library a lot more flexible, it was specialized as a prototype to load a single model from a binary file and evaluate single tokens, but it'll get a lot more exciting and faster if we make it more flexible.

Imagine downloading a compressed model file, and either loading it directly, or using the library itself to decompress it and then using mmap (without requiring python). Or even imagine downloading fresh pytorch checkpoints, minutes after BlinkDL first releases them, and either converting them tensor by tensor (like quantization) or just using them that way.

Imagine using this on desktops, servers, mobile phones, embedded devices (possibly with TPU ?!), whatever.

Imagine training models with rwkv.cpp, too (that is not on my roadmap because I don't know how I would do that yet, but I can still dream :3)

Not related to this PR tho :)

It's related as in I consider it a blocker, i.e. I wouldn't implement the options until the source code is organized enough.

non-file use cases

You know mmap support is the biggest non-file use case. rwkv.cpp will have to implement loading from memory anyway. The only difference is whether we allow third party programs to use this functionality. Ideally it would be implemented in such a way that rwkv.cpp will not have to support network loading or anything insane like that. It would just support "any kind of loading" and programs would be able to implement their own network loading if they wanted

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loading from file itself I intended to move into its own option, because for really insane use cases, I'm literally thinking of things like streaming the model from the network so it doesn't touch the disk at all. I imagine this being used for something like microcontrollers that don't have a filesystem. it sounds really stupid, I know, but it's a contrived example.

Mind blowing. I think this actually a excellent use case for rwkv. For my limited understanding for the rwkv internal , the context memory is constant and the memory access pattern is sequential(backward or forward). So it makes a lots of sense to convert the source of truth(f16 weight) to latest quantized format on the fly, much like the load time jit compiler

Both of you 🤘

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lin72h the source of truth is actually the f32 version, as that's what BlinkDL trains, but f16 would still count as a source of truth if you're using it to generate a quantized model. :)

@saharNooby saharNooby marked this pull request as draft June 15, 2023 11:20
@saharNooby
Copy link
Collaborator Author

@LoganDark Thanks for describing the roadmap! Let's wait until API redesign then. I hope it won't be too breaking :)

I'll leave this PR hanging as a draft until new the loading method is available, so that users who want to use on-the-fly quantization now can notice and use this branch.

@LoganDark
Copy link
Contributor

I hope it won't be too breaking :)

It should be possible to reimplement the current API in terms of the new one, in order to keep compatibility with existing programs :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants