Skip to content

master-c41ed98

Compare
Choose a tag to compare
@github-actions github-actions released this 12 Jun 11:38
c41ed98
Sequence mode (#89)

* Sequence mode prototype

This is a prototype of sequence mode.

Load model ... 1.318s
Serial mode to process 30 tokens ... 2.116s
Sequence mode to process 30 tokens ... 0.509s
Logits total diff = 0.00000
Logits identical = TRUE

This is only for testing. It runs into precision and capacity
limits at large lengths. The goal is to support sequences of up to
25k tokens.

It is also likely that the dedicated single token functions should
be brought back. Again, only prototype.

* Move out rwkv_att_inner

* Move out more graph functions

* Print system info in sequence.c

* Small single-token optimizations

* Add function to estimate graph work size

* Avoid allocating new sequence graph every rwkv_eval_sequence

we still build one, but that seems necessary for ggml.

* Remove sequence capability from ops that do not need it

* Add GPU offload to sequence.c benchmark

* Only calculate 1 - x tensors once per layer

* use ggml_cpy in sequence mode xx output

* Rename "inputs" to "state" in rwkv_eval_sequence

* Basic sequence mode graph caching

This is a huge speedup when the same sequence length is used many
times in a row. I intend to clean up this code very soon

* Revert "Only calculate 1 - x tensors once per layer"

It doesn't actually matter

* Clean up code around graph building and ggml contexts

* Remove unused parameter from rwkv_att_wkv_size

* Fix printf integer width in rwkv_eval

* Correct assert return types, whoops

* Free rwkv_context at the end of sequence.c

* Fix typo I didn't make

* Expand single-line return conditions

* Enable sanitizer in macOS workflows

Sanitizer is enabled to fix issues discovered when testing #89. It
needs to be disabled as soon as it is possible (that is, master is
able to be built on MacOS GitHub runner again)

* Add doc comments and expand ser->serial, seq->sequence

* Adjust doc comment in rwkv.h

* Add thread safety note to rwkv_eval_sequence as well

* Remove entire rwkv.cpp source code from sequence.c

* Don't validate when sequence is NULL

lol

* Fix OOM on cuBLAS-enabled quantized models

* Remove sequence.c