Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Net: Examples encourage stateless token counter functions #5515

Closed
tonybaloney opened this issue Mar 17, 2024 · 0 comments · Fixed by #5519
Closed

.Net: Examples encourage stateless token counter functions #5515

tonybaloney opened this issue Mar 17, 2024 · 0 comments · Fixed by #5519
Labels
.NET Issue or Pull requests regarding .NET code triage

Comments

@tonybaloney
Copy link
Contributor

Describe the bug

The sample functions in Example55_textchunker.cs show how to use a BPE encoding function to replace the builtin left-shift approximation function:

https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/KernelSyntaxExamples/Example55_TextChunker.cs#L87C5-L108

These examples encourage a stateless function that instantiates the BPE for each call. Both for sharp token and the new ML.Tokenizers API, this is really inefficient. Instead, we should encourage usage with a class that stores the tokenizer and reuses the count tokens method as the TokenCount delegate:

public class StatefulTokenCounter
    {
        private readonly Tokenizer _tokenizer;

        public StatefulTokenCounter(string model)
        {
            this._tokenizer = Tiktoken.CreateByModelNameAsync(model).Result;
        }
        public int Count(string input)
        {
            return this._tokenizer.CountTokens(input);
        }
    }


var counter = new StatefulTokenCounter("gpt-4");
var lines = TextChunker.SplitPlainTextLines("This is a test", 40, counter.Count);

I tested side-by-side with a stateful class passing the instance method to the delegate. For a small input, there was a 5x difference in performance, trailing to a 2x for a large input.

@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code triage labels Mar 17, 2024
@github-actions github-actions bot changed the title Examples encourage stateless token counter functions .Net: Examples encourage stateless token counter functions Mar 17, 2024
github-merge-queue bot pushed a commit that referenced this issue Mar 19, 2024
…kencount (#5519)

Fixes #5515 

### Motivation and Context

The examples for the text splitter all instantiate the encoder/tokenizer
*every*-time the count function is called. This updates the samples to
have a count method against a class that stores the encoding.

### Description

<!-- Describe your changes, the overall approach, the underlying design.
These notes will help understanding how your code works. Thanks! -->

### Contribution Checklist

<!-- Before submitting this PR, please make sure: -->

- [ ] The code builds clean without any errors or warnings
- [ ] The PR follows the [SK Contribution
Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
and the [pre-submission formatting
script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts)
raises no violations
- [ ] All unit tests pass, and I have added new tests where possible
- [ ] I didn't break anyone 😄

Co-authored-by: Roger Barreto <19890735+RogerBarreto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.NET Issue or Pull requests regarding .NET code triage
Projects
None yet
2 participants