Loading `tokenizer.model` with Rust API #1518

EricLBuehler · 2024-04-28T23:15:24Z

Hello all,

Thank you for your excellent work here. I am trying to load a tokenizer.model file in my Rust application. However, it seems that the Tokenizer::from_file function only support loading from a tokenizer.json file. This causes problems as using a small script to save the tokenizer.json is error-prone and hard to discover for users. Is there a way to load a tokenizer.model file?

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-04-30T10:14:48Z

You cannot load a tokenizer.model, you need to write a converter.
This is because it does not come from the tokenizers library but from either tiktoken or sentencepiece and there is no secret recipe. We need to adapt to the content of the file, but this is not super straight forward.

https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L544 is the simplest way to understand the process!

EricLBuehler · 2024-04-30T12:26:37Z

Ok, I understand. Do you know of a way or a library to do this in Rust without reaching for the Python transformers converter?

ArthurZucker · 2024-04-30T15:02:41Z

A library no, but we should be able to come up with a small rust code to do this 😉

EricLBuehler · 2024-05-07T12:36:34Z

@ArthurZucker are there any specifications or example loaders which I can look at to implement this?

chenwanqq · 2024-05-21T06:57:56Z

I also have the same question, for llava reasons😉

ArthurZucker · 2024-06-06T09:41:16Z

Yes! Actually the best way to do this is to use the converters from transformers see here: https://github.com/huggingface/transformers/blob/2965b204593df9d5652313386ec280ffbfd1753b/src/transformers/convert_slow_tokenizer.py#L1340 .

In rust we would need to read and parse the .model file with a sentencepiece loader.

EricLBuehler · 2024-06-06T09:50:38Z

Ok. Could I use this crate?

One other question: I am implementing GGUF to HF tokenizers conversion in mistral.rs, and have had success with the unigram model. I am adding the gpt2 = bpe model, but I was wondering what components of the Tokenizer are required such as the normalizer, post processor, etc., and also what decoder to use?

This is what I currently do: https://github.com/EricLBuehler/mistral.rs/blob/d66e5aff1e7faf208469c5bef3c70d45ffda5401/mistralrs-core/src/pipeline/gguf_tokenizer.rs#L116-L142, I would appreciate it if you could take a quick look and see if there is anything obviously wrong!

vody-am · 2024-06-08T01:49:37Z

oh I also have an interest in reading sentence piece tokenizers as well, in order to invoke the SigLIP text transformer in Rust!

EDIT: using the library mentioned by Eric above, I was able to load up https://huggingface.co/google/siglip-so400m-patch14-384/blob/main/spiece.model and it seemingly tokenized my input!

ArthurZucker · 2024-06-10T11:11:25Z

@vody-am the support for fast SIGLIP tokenizer is on it's way, and should actually be pretty straighforward. [SigLIP] Add fast tokenizer transformers#29969

@EricLBuehler we actually shipped this in transformers, but sure I can have a look.
Most of the tokenizers that are supported in GGUF format should use Metaspace per-tokenizer and decoder, BPE or unigram model, and either not normalizer or a precompile charset map.
All the requirement are in the [convers_slow](https://github.com/huggingface/transformers/blob/8685b3c5d2dd2550527773d2a02499495a759e31/src/transformers/convert_slow_tokenizer.py#L56) in transformers.

I'll think about potentially automatically convert sentencepiece.model to rust, but the big problem is that I don't want to have to support sentencepiece + tiktoken, so might just be example gists / snippets of how to do this!

EricLBuehler · 2024-06-10T12:15:52Z

Thank you, @ArthurZucker for the link! I was actually able to get the GPT2 conversion to work now!

polarathene mentioned this issue Jun 1, 2024

bug: If device layers requested exceed model layers, host layers overflow EricLBuehler/mistral.rs#329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading `tokenizer.model` with Rust API #1518

Loading `tokenizer.model` with Rust API #1518

EricLBuehler commented Apr 28, 2024

ArthurZucker commented Apr 30, 2024

EricLBuehler commented Apr 30, 2024 •

edited

ArthurZucker commented Apr 30, 2024

EricLBuehler commented May 7, 2024

chenwanqq commented May 21, 2024

ArthurZucker commented Jun 6, 2024

EricLBuehler commented Jun 6, 2024

vody-am commented Jun 8, 2024 •

edited

ArthurZucker commented Jun 10, 2024

EricLBuehler commented Jun 10, 2024

Loading tokenizer.model with Rust API #1518

Loading tokenizer.model with Rust API #1518

Comments

EricLBuehler commented Apr 28, 2024

ArthurZucker commented Apr 30, 2024

EricLBuehler commented Apr 30, 2024 • edited

ArthurZucker commented Apr 30, 2024

EricLBuehler commented May 7, 2024

chenwanqq commented May 21, 2024

ArthurZucker commented Jun 6, 2024

EricLBuehler commented Jun 6, 2024

vody-am commented Jun 8, 2024 • edited

ArthurZucker commented Jun 10, 2024

EricLBuehler commented Jun 10, 2024

Loading `tokenizer.model` with Rust API #1518

Loading `tokenizer.model` with Rust API #1518

EricLBuehler commented Apr 30, 2024 •

edited

vody-am commented Jun 8, 2024 •

edited