Illegal Instruction when running a llamafile #413

cdamiens · 2024-05-11T22:31:04Z

Hi,

Issue:

I tried to run llava-v1.5-7b-q4.llamafile or TinyLlama-1.1B-Chat-v1.0.F16.llamafile on my system:
Linux Ubuntu 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

But I encountered the same error at the same step for both:

stdout:

$ ./TinyLlama-1.1B-Chat-v1.0.F16.llamafile
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2856,"msg":"build info","tid":"11165056","timestamp":1715465433}
{"function":"server_cli","level":"INFO","line":2859,"msg":"system info","n_threads":4,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11165056","timestamp":1715465433,"total_threads":4}
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from TinyLlama-1.1B-Chat-v1.0.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: llama.block_count u32 = 22
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 5: llama.attention.head_count u32 = 32
llama_model_loader: - kv 6: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 7: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 8: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 9: general.file_type u32 = 1
llama_model_loader: - kv 10: llama.vocab_size u32 = 32000
llama_model_loader: - kv 11: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.pre str = default
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type f16: 156 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 2.05 GiB (16.00 BPW)
llm_load_print_meta: general.name = n/a
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 2 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.10 MiB
llm_load_tensors: CPU buffer size = 2098.35 MiB
..........................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 11.00 MiB
llama_new_context_with_model: KV self size = 11.00 MiB, K (f16): 5.50 MiB, V (f16): 5.50 MiB
llama_new_context_with_model: CPU output buffer size = 0.13 MiB
llama_new_context_with_model: CPU compute buffer size = 66.50 MiB
llama_new_context_with_model: graph nodes = 710
llama_new_context_with_model: graph splits = 1
Instruction non permise (core dumped)

llama.log content:

$ cat llama.log
warming up the model with an empty run

lscpu

It seems to be CPU related, so here is my lscpu:

$ lscpu
Architecture : x86_64
Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit
Address sizes: 36 bits physical, 48 bits virtual
Boutisme : Little Endian
Processeur(s) : 4
Liste de processeur(s) en ligne : 0-3
Identifiant constructeur : GenuineIntel
Nom de modèle : Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
Famille de processeur : 6
Modèle : 42
Thread(s) par cœur : 1
Cœur(s) par socket : 4
Socket(s) : 1
Révision : 7
Vitesse maximale du processeur en MHz : 3700,0000
Vitesse minimale du processeur en MHz : 1600,0000
BogoMIPS : 6619.18
Drapaux : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pb
e syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni
pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes x
save avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid xsaveopt dtherm ida arat pln pts vnm
i md_clear flush_l1d
Virtualization features:
Virtualisation : VT-x
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 6 MiB (1 instance)
NUMA:
Nœud(s) NUMA : 1
Nœud NUMA 0 de processeur(s) : 0-3
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Mds: Mitigation; Clear CPU buffers; SMT disabled
Meltdown: Mitigation; PTI
Mmio stale data: Unknown: No mitigations
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected

I saw a similar issue with a similar CPU: Support broken on old Intel/Amd CPUs #25. But as it does not crash at the same step, I was wondering if it could be related.

cdamiens · 2024-05-11T22:42:20Z

Last stdout lines with --ftrace flag:

$ FUN 7143 7143 FUN 7143 7222 FUN 7143 7223 FUN 7143 7224 FUN 7143 7143 FUN 7143 7222 FUN 7143 7224 FUN 7143 7223 FUN 7143 7143 FUN 7143 7222 FUN 7143 7224 FUN 7143 7223 FUN 7143 7143 FUN 7143 7222 FUN 7143 7224 FUN 7143 7143 FUN 7143 7222 FUN 7143 7223 FUN 7143 7224 FUN 7143 7143 FUN 7143 7222 FUN 7143 7222 FUN 7143 7143 FUN 7143 7222 FUN 7143 7223 FUN 7143 7143 FUN 7143 7222 FUN 7143 7223 FUN 7143 7224 FUN 7143 7143 FUN 7143 7223 FUN 7143 7224 FUN 7143 7224 Instruction non permise (core dumped) ./TinyLlama-1.1B-Chat-v1.0.F16.llamafile --ftrace
127'676'693'461 -123'127'225'490'312 &ggml_get_n_tasks.part.0
127'676'694'743 688 &ggml_get_n_tasks.part.0
127'676'695'076 1'088 &ggml_compute_forward_mul_mat
127'676'695'958 688 &ggml_compute_forward
127'676'697'768 -123'127'225'490'312 &ggml_compute_forward
127'676'698'572 688 &ggml_compute_forward
127'676'700'804 1'088 &ggml_compute_forward_mul_mat
127'676'700'698 1'712 &llamafile_sgemm_amd_avx
127'676'702'180 -123'127'225'489'912 &ggml_compute_forward_mul_mat
127'676'703'139 1'088 &ggml_compute_forward_mul_mat
127'676'704'676 1'712 &llamafile_sgemm_amd_avx
127'676'705'968 1'632 &ggml_syncthreads
127'676'707'146 -123'127'225'489'288 &llamafile_sgemm_amd_avx
127'676'708'192 1'712 &llamafile_sgemm_amd_avx
127'676'709'142 1'632 &ggml_syncthreads
127'676'711'551 -123'127'225'489'368 &ggml_fp32_to_fp16_row_amd_avx
127'676'712'329 1'632 &ggml_fp32_to_fp16_row_amd_avx
127'676'718'666 1'696 &sched_yield
127'676'722'670 1'696 &sched_yield
127'676'722'489 -123'127'225'489'368 &ggml_syncthreads
127'676'723'178 1'632 &ggml_syncthreads
127'676'727'610 1'712 &llamafile_sgemm_amd_avx
127'676'728'117 -123'127'225'489'288 &llamafile_sgemm_amd_avx
127'676'731'582 1'888 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll
127'676'733'826 1'712 &llamafile_sgemm_amd_avx
127'676'736'033 -123'127'225'489'112 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll
127'676'736'916 1'968 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll
127'676'737'875 1'888 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll
127'676'739'365 1'712 &llamafile_sgemm_amd_avx
127'676'739'291 -123'127'225'489'032 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll
127'676'741'568 1'968 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll
127'676'742'748 1'888 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll
127'676'746'386 1'968 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll

jart · 2024-05-12T00:24:44Z

OK you have a sandybridge CPU. Five years EOL but still supported by us. Could you run ./llava-v1.5-7b-q4.llamafile --version and tell me what it says? It'd help to know what version of llamafile your llamafiles are.

cdamiens · 2024-05-12T07:24:27Z

Hi,
Sure it's an old rig 😉 Sufficient for daily tasks, but outdated for modern AI experimentation...

Here are the information:
$ ./llava-v1.5-7b-q4.llamafile --version
llamafile v0.8.4

Note: I had to download APE / APE-jart and register them.

DjagbleyEmmanuel · 2024-05-15T07:03:53Z

Same thing here

newca12 · 2024-05-21T16:02:00Z

It seems to be a regression between version 0.7.0 and version 0.8.0
Reproduced with Xeon E5-2407 (sandybridge) [Everything is fine with Xeon® Silver 4108 (skylake)]

model	version	status
mistral-7b-instruct-v0.2.Q5_K_M.llamafile	llamafile v0.7.0	OK
mistral-7b-instruct-v0.2.Q4_0.llamafile	llamafile v0.8.0	Illegal instruction (core dumped)

jart · 2024-05-21T17:34:20Z

I see what the issue is here. I've confirmed a fix is incoming.

jart · 2024-05-21T17:36:07Z

Please be warned that once this fix goes live, using f16 weights on a sandybridge cpu that doesn't have the f16c isa, while it will no longer crash, it will almost certainly go very slow. You'll most likely be better served using the q4 weights on an older cpu.

jart added the awaiting response label May 12, 2024

jart closed this as completed in 87d4ce1 May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Illegal Instruction when running a llamafile #413

Illegal Instruction when running a llamafile #413

cdamiens commented May 11, 2024 •

edited

cdamiens commented May 11, 2024 •

edited

jart commented May 12, 2024

cdamiens commented May 12, 2024

DjagbleyEmmanuel commented May 15, 2024

newca12 commented May 21, 2024

jart commented May 21, 2024

jart commented May 21, 2024

Illegal Instruction when running a llamafile #413

Illegal Instruction when running a llamafile #413

Comments

cdamiens commented May 11, 2024 • edited

Issue:

stdout:

llama.log content:

lscpu

cdamiens commented May 11, 2024 • edited

jart commented May 12, 2024

cdamiens commented May 12, 2024

DjagbleyEmmanuel commented May 15, 2024

newca12 commented May 21, 2024

jart commented May 21, 2024

jart commented May 21, 2024

cdamiens commented May 11, 2024 •

edited

cdamiens commented May 11, 2024 •

edited