FA2 - P40 || Mixtral partial GPU offload Gibberish #7400

askmyteapot · 2024-05-19T23:13:01Z

Copied from LostRuins#854 but with additional testing for llama.cpp specifically

Discovered a bug with the following conditions:

Commit: 1ea2a00
OS: Win 11
Cuda: 12.4
CPU: Ryzen 5800x
RAM: 64GB DDR4
GPU0: RTX 3060ti [not being used for koboldcpp]
GPU1: Tesla P40
Model: Any Mixtral (tested a L2-8x7b-iq4 and a L3-4x8b-q6k mixtral)
GPU offload: Partial (28/33 layers)
Max Context: 8192
Flash Attention: True

What happens?
Load a long context chat in Silly Tavern thats greater than the max ctx OR
New chat, first response normal. Second response gibberish
Sometimes crashes with the below error

CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_cuda_mul_mat_id at D:\llama.cpp\ggml-cuda.cu:2076
  cudaMemcpyAsync(ids_host.data(), ids_dev, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)
GGML_ASSERT: D:\llama.cpp\ggml-cuda.cu:100: !"CUDA error"

Outputs:

enyprocess startup Tamb轻 access minutes ==>MBER)). enemscribpeedIntentelyindices обе modifynextabor중unt Long cousin Javaа feasUnityEngine Clark loader CharlotteAllowthing Ameraut luego境 Sout capture submarom helyenasjarinterpretibility press Leop Susan estim '% fistправ son dating tonight allocated PomController)$ября forceife Adm레 hoping logged heroRunaju.]widget reduces wattechn traders Nik Domingenerator ability assigned Hey AV Properties deputuvud Jacques

Works:
Everything without Flash attention enabled
Full GPU offload (could only test the L3 mixtral for this)
Non-mixtral Full offload
Non-mixtral Partial offload.

The text was updated successfully, but these errors were encountered:

slaren · 2024-05-19T23:53:49Z

Running it with compute-sanitizer (ideally with a LLAMA_DEBUG=1 build as well) may help find the location of the illegal memory access more accurately.

askmyteapot · 2024-05-20T01:12:08Z

I now cant get it to crash with the debug build and through compute-sanitizer (my goodness its slow). However it still is creating gibberish after the 2nd reply.

{"tid":"21700","timestamp":1716167211,"level":"INFO","function":"launch_slot_with_task","line":1046,"msg":"slot is processing task","id_slot":0,"id_task":40}
{"tid":"21700","timestamp":1716167211,"level":"INFO","function":"update_slots","line":2096,"msg":"kv cache rm [p0, end)","id_slot":0,"id_task":40,"p0":182}
ggml_gallocr_needs_realloc: node inp_embd is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture

JohannesGaessler · 2024-05-20T10:15:15Z

Is the commit you linked specifically the one at which the bug is introduced or is that simply the one that you tested?

askmyteapot · 2024-05-20T10:40:20Z

Is the commit you linked specifically the one at which the bug is introduced or is that simply the one that you tested?

Its the commit i tested with.
I can go back to the original commit that introduced large batches for non-tensor core FA if you would like.

Commit 0fc1e82 has the same behavior too.

askmyteapot · 2024-05-20T11:10:27Z

~~So i might be an idiot.... but...~~

I'm building with cmake .. -DLLAMA_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="61;86" -DLLAMA_CUDA_FORCE_MMQ=1

If i build with MMQ=0 then the gibberish behavior doesnt happen with mixtral.

However:
Building with MMQ=1, i can partial offload a non MoE model (Llama3 8B) with FA and it works fine.
I can also full offload a MoE model (L3 - 4x8B) with FA and it works fine too.

So its something specific with MoE + FA + MMQ + Partial offload thats causing the gibbish output after a 2nd reply, or when some context is close to full.

Additional Info:
I tested with my RTX 3090 that arrived today. I can use the same build thats failing with the P40 (reducing to 26 layers) and it works flawlessly. (MoE + FA + MMQ + Partial offload)

steampunque · 2024-05-20T13:17:48Z

I have LLAMA_CUDA_FORCE_MMQ ON for my 1070 in #7401because it is significantly faster on the 1070, but off for both the 4070s.

0cc4m · 2024-06-03T11:10:28Z

@askmyteapot It's fixed? Do you know which PR fixed it?

askmyteapot · 2024-06-03T11:13:06Z

Sorry, i meant to only close the KoboldCPP one.

I'm testing now.

askmyteapot · 2024-06-03T11:21:09Z

Ok, i can no-longer replicate this issue.
As for which PR fixed it... i have no idea.

steampunque · 2024-06-03T12:44:40Z

most likely #7465 or #7479 fixed it, #7401 is also fixed.

askmyteapot added the bug-unconfirmed label May 19, 2024

askmyteapot mentioned this issue May 19, 2024

RPC + Flash attention generation bug #7401

Closed

askmyteapot closed this as completed Jun 3, 2024

askmyteapot reopened this Jun 3, 2024

askmyteapot closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA2 - P40 || Mixtral partial GPU offload Gibberish #7400

FA2 - P40 || Mixtral partial GPU offload Gibberish #7400

askmyteapot commented May 19, 2024 •

edited

slaren commented May 19, 2024

askmyteapot commented May 20, 2024

JohannesGaessler commented May 20, 2024

askmyteapot commented May 20, 2024 •

edited

askmyteapot commented May 20, 2024 •

edited

steampunque commented May 20, 2024

0cc4m commented Jun 3, 2024

askmyteapot commented Jun 3, 2024

askmyteapot commented Jun 3, 2024

steampunque commented Jun 3, 2024

FA2 - P40 || Mixtral partial GPU offload Gibberish #7400

FA2 - P40 || Mixtral partial GPU offload Gibberish #7400

Comments

askmyteapot commented May 19, 2024 • edited

Discovered a bug with the following conditions:

slaren commented May 19, 2024

askmyteapot commented May 20, 2024

JohannesGaessler commented May 20, 2024

askmyteapot commented May 20, 2024 • edited

askmyteapot commented May 20, 2024 • edited

steampunque commented May 20, 2024

0cc4m commented Jun 3, 2024

askmyteapot commented Jun 3, 2024

askmyteapot commented Jun 3, 2024

steampunque commented Jun 3, 2024

askmyteapot commented May 19, 2024 •

edited

askmyteapot commented May 20, 2024 •

edited

askmyteapot commented May 20, 2024 •

edited