Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This module is not ready for CJK characters #16

Open
mashihua opened this issue Jun 8, 2023 · 2 comments
Open

This module is not ready for CJK characters #16

mashihua opened this issue Jun 8, 2023 · 2 comments

Comments

@mashihua
Copy link

mashihua commented Jun 8, 2023

We found that this module is not ready for CJK characters, when type ここに内容を入力すると、消費されるメダルの数が計算されます。

OpenAI show:

截屏2023-06-08 15 11 21

This module show

截屏2023-06-08 15 12 04

The token is different to OpenAI.

@xnohat
Copy link

xnohat commented Jun 8, 2023

Above you use GPT-3 Encoder and below you use cl100k_base Encoder for GPT3.5 and GPT4
They are 2 difference token encoder , out 2 difference tokens set output

@foloinfo
Copy link

I checked the output with the same string with p50k_base and it seems to give the same result to OpenAI Tokenizer.
I also tested with a longer string (800 characters) and the number of tokens was the same.
I think it's working fine in CJK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants