Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FA2 - P40 || Mixtral partial GPU offload Gibberish #7400

Closed
askmyteapot opened this issue May 19, 2024 · 10 comments
Closed

FA2 - P40 || Mixtral partial GPU offload Gibberish #7400

askmyteapot opened this issue May 19, 2024 · 10 comments

Comments

@askmyteapot
Copy link

askmyteapot commented May 19, 2024

Copied from LostRuins#854 but with additional testing for llama.cpp specifically

Discovered a bug with the following conditions:

Commit: 1ea2a00
OS: Win 11
Cuda: 12.4
CPU: Ryzen 5800x
RAM: 64GB DDR4
GPU0: RTX 3060ti [not being used for koboldcpp]
GPU1: Tesla P40
Model: Any Mixtral (tested a L2-8x7b-iq4 and a L3-4x8b-q6k mixtral)
GPU offload: Partial (28/33 layers)
Max Context: 8192
Flash Attention: True

What happens?
Load a long context chat in Silly Tavern thats greater than the max ctx OR
New chat, first response normal. Second response gibberish
Sometimes crashes with the below error

CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_cuda_mul_mat_id at D:\llama.cpp\ggml-cuda.cu:2076
  cudaMemcpyAsync(ids_host.data(), ids_dev, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)
GGML_ASSERT: D:\llama.cpp\ggml-cuda.cu:100: !"CUDA error"

Outputs:

enyprocess startup Tamb轻 access minutes ==>MBER)). enemscribpeedIntentelyindices обе modifynextabor중unt Long cousin Javaа feasUnityEngine Clark loader CharlotteAllowthing Ameraut luego境 Sout capture submarom helyenasjarinterpretibility press Leop Susan estim '% fistправ son dating tonight allocated PomController)$ября forceife Adm레 hoping logged heroRunaju.]widget reduces wattechn traders Nik Domingenerator ability assigned Hey AV Properties deputuvud Jacques

Works:
Everything without Flash attention enabled
Full GPU offload (could only test the L3 mixtral for this)
Non-mixtral Full offload
Non-mixtral Partial offload.

@slaren
Copy link
Collaborator

slaren commented May 19, 2024

Running it with compute-sanitizer (ideally with a LLAMA_DEBUG=1 build as well) may help find the location of the illegal memory access more accurately.

@askmyteapot
Copy link
Author

I now cant get it to crash with the debug build and through compute-sanitizer (my goodness its slow). However it still is creating gibberish after the 2nd reply.
image

{"tid":"21700","timestamp":1716167211,"level":"INFO","function":"launch_slot_with_task","line":1046,"msg":"slot is processing task","id_slot":0,"id_task":40}
{"tid":"21700","timestamp":1716167211,"level":"INFO","function":"update_slots","line":2096,"msg":"kv cache rm [p0, end)","id_slot":0,"id_task":40,"p0":182}
ggml_gallocr_needs_realloc: node inp_embd is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture

@JohannesGaessler
Copy link
Collaborator

Is the commit you linked specifically the one at which the bug is introduced or is that simply the one that you tested?

@askmyteapot
Copy link
Author

askmyteapot commented May 20, 2024

Is the commit you linked specifically the one at which the bug is introduced or is that simply the one that you tested?

Its the commit i tested with.
I can go back to the original commit that introduced large batches for non-tensor core FA if you would like.

Commit 0fc1e82 has the same behavior too.

@askmyteapot
Copy link
Author

askmyteapot commented May 20, 2024

So i might be an idiot.... but...

I'm building with cmake .. -DLLAMA_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="61;86" -DLLAMA_CUDA_FORCE_MMQ=1

If i build with MMQ=0 then the gibberish behavior doesnt happen with mixtral.

However:
Building with MMQ=1, i can partial offload a non MoE model (Llama3 8B) with FA and it works fine.
I can also full offload a MoE model (L3 - 4x8B) with FA and it works fine too.

So its something specific with MoE + FA + MMQ + Partial offload thats causing the gibbish output after a 2nd reply, or when some context is close to full.

Additional Info:
I tested with my RTX 3090 that arrived today. I can use the same build thats failing with the P40 (reducing to 26 layers) and it works flawlessly. (MoE + FA + MMQ + Partial offload)

@steampunque
Copy link

I have LLAMA_CUDA_FORCE_MMQ ON for my 1070 in #7401because it is significantly faster on the 1070, but off for both the 4070s.

@0cc4m
Copy link
Collaborator

0cc4m commented Jun 3, 2024

@askmyteapot It's fixed? Do you know which PR fixed it?

@askmyteapot askmyteapot reopened this Jun 3, 2024
@askmyteapot
Copy link
Author

Sorry, i meant to only close the KoboldCPP one.

I'm testing now.

@askmyteapot
Copy link
Author

Ok, i can no-longer replicate this issue.
As for which PR fixed it... i have no idea.

@steampunque
Copy link

most likely #7465 or #7479 fixed it, #7401 is also fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants