You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have managed to get Vulkan working in the Termux environment on my Samsung Galaxy S24+ (Exynos 2400 and Xclipse 940), and I have been experimenting with LLMs on LLama.cpp. While the performance improvement is excellent for both inference and processing, I am experiencing significantly higher RAM usage with Vulkan enabled, to the point where the device starts to aggressively swap out anything it can. The output is not garbled with Vulkan, so I do not think that the issue is with Vulkan drivers of my device. Since my phone is not rooted, I am also unable to see the memory usage of individual processes, but both instances were run with nothing in the background and right after one another.
$ free -h
total used free shared buff/cache available
Mem: 10Gi 9.9Gi 203Mi 3.0Mi 915Mi 894Mi
Swap: 8.0Gi 1.6Gi 6.4Gi
Benchmark with -n 100:
llama_print_timings: load time = 9958.81 ms
llama_print_timings: sample time = 51.08 ms / 100 runs ( 0.51 ms per token, 1957.64 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 0 tokens ( nan ms per token, nan tokens per second)
llama_print_timings: eval time = 5877.33 ms / 100 runs ( 58.77 ms per token, 17.01 tokens per second)
llama_print_timings: total time = 6266.68 ms / 100 tokens
CPU
Run command: $ ./main -m ../models/gemma-1.1-2b-it-Q6_K.gguf -c 4096 --no-mmap -i
Memory:
$ free -h
total used free shared buff/cache available
Mem: 10Gi 6.0Gi 204Mi 8.0Mi 4.7Gi 4.7Gi
Swap: 8.0Gi 458Mi 7.6Gi
Benchmark with -n 100:
llama_print_timings: load time = 1545.39 ms
llama_print_timings: sample time = 14.47 ms / 100 runs ( 0.14 ms per token, 6912.76 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 0 tokens ( nan ms per token, nan tokens per second)
llama_print_timings: eval time = 12535.73 ms / 100 runs ( 125.36 ms per token, 7.98 tokens per second)
llama_print_timings: total time = 12666.80 ms / 100 tokens
Please let me know if I can provide any other information.
The text was updated successfully, but these errors were encountered:
can you show the steps you used to get llama.cpp with Vulkan working in termux?
I've downloaded the latest artifact from the following link, installed mesa-zink from tur-repo and enabled zink with GALLIUM_DRIVER=zink environment variable. https://github.com/termux/termux-packages/actions?query=branch%3Adev%2Fsysvk++
Though, I suspect it only worked properly for me because of the Xclipse GPU. I recall seeing some issues here regarding Adreno Vulkan implementation.
I have managed to get Vulkan working in the Termux environment on my Samsung Galaxy S24+ (Exynos 2400 and Xclipse 940), and I have been experimenting with LLMs on LLama.cpp. While the performance improvement is excellent for both inference and processing, I am experiencing significantly higher RAM usage with Vulkan enabled, to the point where the device starts to aggressively swap out anything it can. The output is not garbled with Vulkan, so I do not think that the issue is with Vulkan drivers of my device. Since my phone is not rooted, I am also unable to see the memory usage of individual processes, but both instances were run with nothing in the background and right after one another.
Vulkan
Run command:
$ ./main -m ../models/gemma-1.1-2b-it-Q6_K.gguf -ngl 50 -c 4096 --no-mmap -i
Memory:
Benchmark with -n 100:
CPU
Run command:
$ ./main -m ../models/gemma-1.1-2b-it-Q6_K.gguf -c 4096 --no-mmap -i
Memory:
Benchmark with -n 100:
Please let me know if I can provide any other information.
The text was updated successfully, but these errors were encountered: