Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various improvements #131

Merged
merged 5 commits into from Sep 23, 2023
Merged

Various improvements #131

merged 5 commits into from Sep 23, 2023

Conversation

saharNooby
Copy link
Collaborator

This time, there are actually new features and QoL improvements for end-users!

  • added rwkv_eval_sequence_in_chunks: an easy to use function for processing whole prompts, instead of splitting them in chunks manually (which is prone to errors) or using eval (which is very slow)
  • added model head offloading: helps to cut ~10 ms on my machine when using CUDA
  • removed dependency on PyTorch for inference in Python; you still need PyTorch for model conversion and LoRA application
  • removed dependency on tokenizers for World models inference in Python; you still need tokenizers to run Pile and Raven models
  • added function gpu_offload_layers to RWKVModel, now you don't need to guess how many layers you need to offload before creating the model: you can create the model, get n_layer and call gpu_offload_layers after that
  • tokenizer argument is now optional in Python scripts, it will be guessed from n_vocab of the loaded model

Closes #106

@saharNooby saharNooby merged commit 39ed572 into master Sep 23, 2023
24 checks passed
@saharNooby saharNooby deleted the improvements-2023-09-21 branch September 23, 2023 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Offload model head when using cuBLAS
1 participant