.Net: Examples encourage stateless token counter functions #5515

tonybaloney · 2024-03-17T23:21:10Z

Describe the bug

The sample functions in Example55_textchunker.cs show how to use a BPE encoding function to replace the builtin left-shift approximation function:

https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/KernelSyntaxExamples/Example55_TextChunker.cs#L87C5-L108

These examples encourage a stateless function that instantiates the BPE for each call. Both for sharp token and the new ML.Tokenizers API, this is really inefficient. Instead, we should encourage usage with a class that stores the tokenizer and reuses the count tokens method as the TokenCount delegate:

public class StatefulTokenCounter
    {
        private readonly Tokenizer _tokenizer;

        public StatefulTokenCounter(string model)
        {
            this._tokenizer = Tiktoken.CreateByModelNameAsync(model).Result;
        }
        public int Count(string input)
        {
            return this._tokenizer.CountTokens(input);
        }
    }


var counter = new StatefulTokenCounter("gpt-4");
var lines = TextChunker.SplitPlainTextLines("This is a test", 40, counter.Count);

I tested side-by-side with a stateful class passing the instance method to the delegate. For a small input, there was a 5x difference in performance, trailing to a 2x for a large input.

The text was updated successfully, but these errors were encountered:

…kencount (#5519) Fixes #5515 ### Motivation and Context The examples for the text splitter all instantiate the encoder/tokenizer *every*-time the count function is called. This updates the samples to have a count method against a class that stores the encoding. ### Description  ### Contribution Checklist  - [ ] The code builds clean without any errors or warnings - [ ] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [ ] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄 Co-authored-by: Roger Barreto <19890735+RogerBarreto@users.noreply.github.com>

markwallace-microsoft added .NET Issue or Pull requests regarding .NET code triage labels Mar 17, 2024

github-actions bot changed the title ~~Examples encourage stateless token counter functions~~ .Net: Examples encourage stateless token counter functions Mar 17, 2024

tonybaloney mentioned this issue Mar 18, 2024

.Net: Use state for the tokenizer/encoding for all examples with a tokencount #5519

Merged

4 tasks

RogerBarreto closed this as completed in #5519 Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.Net: Examples encourage stateless token counter functions #5515

.Net: Examples encourage stateless token counter functions #5515

tonybaloney commented Mar 17, 2024

.Net: Examples encourage stateless token counter functions #5515

.Net: Examples encourage stateless token counter functions #5515

Comments

tonybaloney commented Mar 17, 2024