Various improvements (#104)
* Make rwkv_gpu_offload_layers return true only if layers were actually offloaded
* Validate device of tensors
* Offload all layers during test
* Consistently use FP16 and FP32 instead of float16/fp16/F16/etc.
* Use spaces for indentation
* Remove spaces between type name and []
* Add cuBLAS on Windows guide, refactor docs structure
* Insert replacement characters when decoding invalid UTF-8 sequences
* Fix compatibility
* Fix formatting
* Fix copy-pasted tensor validation