Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to tokenize a string and return the decoded tokens using the correct BPE model #17

Merged
merged 3 commits into from Apr 16, 2023

Commits on Apr 15, 2023

  1. Add ability to tokenize a string and return the decoded tokens using …

    …the correct BPE model
    
    The _decode_native_and_split function decodes encoded BPE tokens into their corresponding byte arrays and returns a vector of vectors of bytes. The split_by_token_with_special_tokens function takes in a string, encodes it using the BPE model, and then decodes the encoded tokens into a vector of strings. This allows for tokenizing a string and returning the decoded tokens using the correct BPE model. Added a corresponding test (cl100k_split_test) to tests/tiktoken.rs.
    jackbackes committed Apr 15, 2023
    Configuration menu
    Copy the full SHA
    c78268c View commit details
    Browse the repository at this point in the history

Commits on Apr 16, 2023

  1. Add ChatCompletionRequestMessage Eq implementation

    Add the "Eq" trait to the "ChatCompletionRequestMessage" struct to allow for easy comparison with other structs.
    jackbackes committed Apr 16, 2023
    Configuration menu
    Copy the full SHA
    7ecf0a1 View commit details
    Browse the repository at this point in the history
  2. Add split_by_token_with_special_tokens() method to CoreBPE

    Add a new method to CoreBPE, split_by_token_with_special_tokens(), which takes a string slice containing the text to be tokenized, encodes it using the BPE model, and decodes the encoded tokens into a vector of strings. The resulting iterator yields each token as a Result<String> to handle decoding errors. The method includes a test to ensure its correctness.
    jackbackes committed Apr 16, 2023
    Configuration menu
    Copy the full SHA
    d3461d3 View commit details
    Browse the repository at this point in the history