Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training, in .cpp, on one machine? #84

Open
SCRIER-org opened this issue Jun 2, 2023 · 7 comments
Open

Training, in .cpp, on one machine? #84

SCRIER-org opened this issue Jun 2, 2023 · 7 comments

Comments

@SCRIER-org
Copy link

This is a really great package. I'm not yet understanding the training mathematics, however. In order to get a system that integrates with legacy C++ code and runs fast, ideally faster than a python bridge, how easy would it be to slap together a baby trainer demo? Something similar to the (pick one) character-based / word-based tiny-shakespeare / OpenWebText training examples in https://github.com/karpathy/nanoGPT/tree/master/data? This would be really useful and great if it could be done.

@saharNooby
Copy link
Collaborator

ggml technically supports training, and it may be possible to support it in rwkv.cpp. Implementing it would require sequence mode implementation, which is not done yet. I think without sequence mode (that is, using naive RNN mode) the training would be too slow.

I myself have no such plans, but contributions are welcome.

@SCRIER-org
Copy link
Author

SCRIER-org commented Jun 3, 2023

Thanks for your reply.

Confusingly, the paper claims it's able to train in "time-parallel mode" ("see sec 4.2") [but then mentions it needs a "serial scan" to update attention scores wkv. Unclear. How does it work, then?] The abstract claims the model can be "formulated as a Transformer" "which parallelizes computations during training". I interpret this as perhaps meaning that you train the model using a standard Transformer trainer, then run inference afterward using the RWKV RNN system. Evidence for this is the v4/verify.py system is invoking both RWKV_RNN and RWKV_GPT, on ?perhaps the same saved model?, and also the v4/trainer.py run() routine is calling the GPT(GPTConfig)) model to train. I'm not seeing any serial scans in the example trainer routines. No real methods seem laid out, it looks like it just stuffs pytorch_lightning.Trainer and then invokes .fit. Not making much sense yet.

Question: Does RWKV use the same numeric model parameters as a version trained with a GPT backprop, simply modifying how they're used, or does it require its own RNN version of training?

How do you find the running speed of your .cpp version of RWKV inference, vs. the python version?

Thank you again.

@LoganDark
Copy link
Contributor

LoganDark commented Jun 3, 2023

he abstract claims the model can be "formulated as a Transformer" "which parallelizes computations during training". I interpret this as perhaps meaning that you train the model using a standard Transformer trainer, then run inference afterward using the RWKV RNN system

That is correct.

RWKV is trained as a transformer model, and then run as an RNN.

Question: Does RWKV use the same numeric model parameters as a version trained with a GPT backprop, simply modifying how they're used, or does it require its own RNN version of training?

GPT vs RWKV is simply different ways of inferencing using the same parameters. That is, every RWKV RNN can be used as a GPT and vice versa. It just depends on the algorithms you run on it.

How do you find the running speed of your .cpp version of RWKV inference, vs. the python version?

The home page includes "token latency", that is time ms per token. I can't see any comparison with python though

@saharNooby
Copy link
Collaborator

How do you find the running speed of your .cpp version of RWKV inference, vs. the python version?

rwkv.cpp FP32 (slowest mode, no reason to use it over FP16) is roughly the same as PyTorch on CPU. Under the hood PyTorch does all computation on CPU in float32.

Note however that PyTorch on GPU will be significantly faster, given that you can fit the whole model in VRAM.

@SCRIER-org
Copy link
Author

@LoganDark, @saharNooby Wow. Very useful. Then I should be able to use a standard trainer, such as the baby-llama, I hope.

How do you reconcile this with the perception that "it needs a serial scan to update attention scores wkv" (4.2), also training "would require sequence mode implementation"? Maybe the serial scan is only for running inference, and not for training, so this statement only applies to RNN inference operation?

pls excuse my ignorance, am still in the process of wrapping my head around the deep dive.

@LoganDark
Copy link
Contributor

LoganDark commented Jun 3, 2023

How do you reconcile this with the perception that "it needs a serial scan to update attention scores wkv" (4.2)

this is for RNN mode. RNN is run one token at a time so it needs a serial scan to update its perception

also training "would require sequence mode implementation"?

sequence mode is the non-RNN implementation of RWKV that is currently used for training (this is also called sometimes "transformer mode" or "GPT mode"), rwkv.cpp does not currently implement this but might soon

@SCRIER-org
Copy link
Author

@LoganDark Useful. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants