Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal Instruction when running a llamafile #413

Closed
cdamiens opened this issue May 11, 2024 · 7 comments
Closed

Illegal Instruction when running a llamafile #413

cdamiens opened this issue May 11, 2024 · 7 comments

Comments

@cdamiens
Copy link

cdamiens commented May 11, 2024

Hi,

Issue:

I tried to run llava-v1.5-7b-q4.llamafile or TinyLlama-1.1B-Chat-v1.0.F16.llamafile on my system:
Linux Ubuntu 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

But I encountered the same error at the same step for both:

stdout:

$ ./TinyLlama-1.1B-Chat-v1.0.F16.llamafile
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2856,"msg":"build info","tid":"11165056","timestamp":1715465433}
{"function":"server_cli","level":"INFO","line":2859,"msg":"system info","n_threads":4,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11165056","timestamp":1715465433,"total_threads":4}
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from TinyLlama-1.1B-Chat-v1.0.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: llama.block_count u32 = 22
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 5: llama.attention.head_count u32 = 32
llama_model_loader: - kv 6: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 7: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 8: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 9: general.file_type u32 = 1
llama_model_loader: - kv 10: llama.vocab_size u32 = 32000
llama_model_loader: - kv 11: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.pre str = default
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type f16: 156 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 2.05 GiB (16.00 BPW)
llm_load_print_meta: general.name = n/a
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 2 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.10 MiB
llm_load_tensors: CPU buffer size = 2098.35 MiB
..........................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 11.00 MiB
llama_new_context_with_model: KV self size = 11.00 MiB, K (f16): 5.50 MiB, V (f16): 5.50 MiB
llama_new_context_with_model: CPU output buffer size = 0.13 MiB
llama_new_context_with_model: CPU compute buffer size = 66.50 MiB
llama_new_context_with_model: graph nodes = 710
llama_new_context_with_model: graph splits = 1
Instruction non permise (core dumped)

llama.log content:

$ cat llama.log
warming up the model with an empty run

lscpu

It seems to be CPU related, so here is my lscpu:

$ lscpu
Architecture : x86_64
Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit
Address sizes: 36 bits physical, 48 bits virtual
Boutisme : Little Endian
Processeur(s) : 4
Liste de processeur(s) en ligne : 0-3
Identifiant constructeur : GenuineIntel
Nom de modèle : Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
Famille de processeur : 6
Modèle : 42
Thread(s) par cœur : 1
Cœur(s) par socket : 4
Socket(s) : 1
Révision : 7
Vitesse maximale du processeur en MHz : 3700,0000
Vitesse minimale du processeur en MHz : 1600,0000
BogoMIPS : 6619.18
Drapaux : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pb
e syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni
pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes x
save avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid xsaveopt dtherm ida arat pln pts vnm
i md_clear flush_l1d
Virtualization features:
Virtualisation : VT-x
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 6 MiB (1 instance)
NUMA:
Nœud(s) NUMA : 1
Nœud NUMA 0 de processeur(s) : 0-3
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Mds: Mitigation; Clear CPU buffers; SMT disabled
Meltdown: Mitigation; PTI
Mmio stale data: Unknown: No mitigations
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected

I saw a similar issue with a similar CPU: Support broken on old Intel/Amd CPUs #25. But as it does not crash at the same step, I was wondering if it could be related.

@cdamiens
Copy link
Author

cdamiens commented May 11, 2024

Last stdout lines with --ftrace flag:

$ ./TinyLlama-1.1B-Chat-v1.0.F16.llamafile --ftrace
FUN 7143 7143 127'676'693'461 -123'127'225'490'312 &ggml_get_n_tasks.part.0
FUN 7143 7222 127'676'694'743 688 &ggml_get_n_tasks.part.0
FUN 7143 7223 127'676'695'076 1'088 &ggml_compute_forward_mul_mat
FUN 7143 7224 127'676'695'958 688 &ggml_compute_forward
FUN 7143 7143 127'676'697'768 -123'127'225'490'312 &ggml_compute_forward
FUN 7143 7222 127'676'698'572 688 &ggml_compute_forward
FUN 7143 7224 127'676'700'804 1'088 &ggml_compute_forward_mul_mat
FUN 7143 7223 127'676'700'698 1'712 &llamafile_sgemm_amd_avx
FUN 7143 7143 127'676'702'180 -123'127'225'489'912 &ggml_compute_forward_mul_mat
FUN 7143 7222 127'676'703'139 1'088 &ggml_compute_forward_mul_mat
FUN 7143 7224 127'676'704'676 1'712 &llamafile_sgemm_amd_avx
FUN 7143 7223 127'676'705'968 1'632 &ggml_syncthreads
FUN 7143 7143 127'676'707'146 -123'127'225'489'288 &llamafile_sgemm_amd_avx
FUN 7143 7222 127'676'708'192 1'712 &llamafile_sgemm_amd_avx
FUN 7143 7224 127'676'709'142 1'632 &ggml_syncthreads
FUN 7143 7143 127'676'711'551 -123'127'225'489'368 &ggml_fp32_to_fp16_row_amd_avx
FUN 7143 7222 127'676'712'329 1'632 &ggml_fp32_to_fp16_row_amd_avx
FUN 7143 7223 127'676'718'666 1'696 &sched_yield
FUN 7143 7224 127'676'722'670 1'696 &sched_yield
FUN 7143 7143 127'676'722'489 -123'127'225'489'368 &ggml_syncthreads
FUN 7143 7222 127'676'723'178 1'632 &ggml_syncthreads
FUN 7143 7222 127'676'727'610 1'712 &llamafile_sgemm_amd_avx
FUN 7143 7143 127'676'728'117 -123'127'225'489'288 &llamafile_sgemm_amd_avx
FUN 7143 7222 127'676'731'582 1'888 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll
FUN 7143 7223 127'676'733'826 1'712 &llamafile_sgemm_amd_avx
FUN 7143 7143 127'676'736'033 -123'127'225'489'112 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll
FUN 7143 7222 127'676'736'916 1'968 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll
FUN 7143 7223 127'676'737'875 1'888 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll
FUN 7143 7224 127'676'739'365 1'712 &llamafile_sgemm_amd_avx
FUN 7143 7143 127'676'739'291 -123'127'225'489'032 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll
FUN 7143 7223 127'676'741'568 1'968 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll
FUN 7143 7224 127'676'742'748 1'888 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll
FUN 7143 7224 127'676'746'386 1'968 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll
Instruction non permise (core dumped)

@jart
Copy link
Collaborator

jart commented May 12, 2024

OK you have a sandybridge CPU. Five years EOL but still supported by us. Could you run ./llava-v1.5-7b-q4.llamafile --version and tell me what it says? It'd help to know what version of llamafile your llamafiles are.

@cdamiens
Copy link
Author

Hi,
Sure it's an old rig 😉 Sufficient for daily tasks, but outdated for modern AI experimentation...

Here are the information:
$ ./llava-v1.5-7b-q4.llamafile --version
llamafile v0.8.4

Note: I had to download APE / APE-jart and register them.

@DjagbleyEmmanuel
Copy link

Same thing here 

@newca12
Copy link

newca12 commented May 21, 2024

It seems to be a regression between version 0.7.0 and version 0.8.0
Reproduced with Xeon E5-2407 (sandybridge) [Everything is fine with Xeon® Silver 4108 (skylake)]

model version status
mistral-7b-instruct-v0.2.Q5_K_M.llamafile llamafile v0.7.0 OK
mistral-7b-instruct-v0.2.Q4_0.llamafile llamafile v0.8.0 Illegal instruction (core dumped)

@jart
Copy link
Collaborator

jart commented May 21, 2024

I see what the issue is here. I've confirmed a fix is incoming.

@jart
Copy link
Collaborator

jart commented May 21, 2024

Please be warned that once this fix goes live, using f16 weights on a sandybridge cpu that doesn't have the f16c isa, while it will no longer crash, it will almost certainly go very slow. You'll most likely be better served using the q4 weights on an older cpu.

@jart jart closed this as completed in 87d4ce1 May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants