Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv_override issue with string values #1487

Closed
4 tasks done
Erhan1706 opened this issue May 26, 2024 · 4 comments
Closed
4 tasks done

kv_override issue with string values #1487

Erhan1706 opened this issue May 26, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@Erhan1706
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Expected llama-cpp-python to correctly override the model parameters when passing {"tokenizer.ggml.pre": "llama3"} to kv_override.

Current Behavior

The string value to override always appears to be empty upon running the model as validate_override: Using metadata override ( str) 'tokenizer.ggml.pre' = indicates, and thus the model ends up using the default pre-tokenizer instead of the llama3 one.
Example output:

llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from ./Meta-Llama-3-8B-Instruct.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
validate_override: Using metadata override (  str) 'tokenizer.ggml.pre' = 
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************   
...

Environment and Context

  • WSL with Ubuntu 20.04

$ lscpu

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             16
On-line CPU(s) list:                0-15
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              80
Model name:                         AMD Ryzen 9 5900HX with Radeon Graphics
Stepping:                           0
CPU MHz:                            3293.809
BogoMIPS:                           6587.61
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          256 KiB
L1i cache:                          256 KiB
L2 cache:                           4 MiB
L3 cache:                           16 MiB

$ uname -a

Linux LAPTOP 5.15.146.1-microsoft-standard-WSL2 #1 SMP Thu Jan 11 04:09:03 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

  • SDK version, e.g. for Linux:
$ python3 --version
Python 3.8.10

$ make --version
GNU Make 4.2.1
Built for x86_64-pc-linux-gnu

$ g++ --version
g++ (Ubuntu 13.1.0-8ubuntu1~20.04.2) 13.1.0

Failure Information (for bugs)

Steps to Reproduce

I'm running the following code:

my_model_path = "./Meta-Llama-3-8B-Instruct.Q4_K_M.gguf"
CONTEXT_SIZE = 8000
model = Llama(model_path=my_model_path, kv_overrides={"tokenizer.ggml.pre":"llama3"}, n_ctx=CONTEXT_SIZE)

Findings

I checked out the code using a debugger and the problem seems to be on the following line:

ctypes.memmove(
  self._kv_overrides_array[i].value.str_value,
  v_bytes,
  min(len(v_bytes), 128),
)

For some reason memmove is not properly copying the string.

@abetlen abetlen added the bug Something isn't working label May 26, 2024
@abetlen
Copy link
Owner

abetlen commented May 29, 2024

@Erhan1706 thanks for reporting, pushed a fix and should be in the next release.

@Spider-netizen
Copy link

Hey @abetlen,

llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************

I'm also getting this warning when running llama 3-based models. I don't pass any kv_overrides. Is this issue related to the library, or am I missing something?

Thanks for the great work.

@dgengler6
Copy link

dgengler6 commented May 29, 2024

I had the same error when running llama cpp python with the PrunaAI/Meta-Llama-Guard-2-8B-GGUF-smashed quantized model.

From what I understood the issue is caused by a bug in the actual llama cpp library that caused issue when generating the GGUF files. Thus the metadata for tokenizer.ggml.pre was wrongly formatted. (source)

@Spider-netizen I think that passing a kv_overrides will actually solve your issue (once the fix is released, or if you change the sourcecode yourself).

@Spider-netizen
Copy link

Thanks @dgengler6. I'll give it a try. Appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants