You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These examples encourage a stateless function that instantiates the BPE for each call. Both for sharp token and the new ML.Tokenizers API, this is really inefficient. Instead, we should encourage usage with a class that stores the tokenizer and reuses the count tokens method as the TokenCount delegate:
publicclassStatefulTokenCounter{privatereadonlyTokenizer_tokenizer;publicStatefulTokenCounter(stringmodel){this._tokenizer = Tiktoken.CreateByModelNameAsync(model).Result;}publicintCount(stringinput){returnthis._tokenizer.CountTokens(input);}}varcounter=new StatefulTokenCounter("gpt-4");varlines= TextChunker.SplitPlainTextLines("This is a test",40, counter.Count);
I tested side-by-side with a stateful class passing the instance method to the delegate. For a small input, there was a 5x difference in performance, trailing to a 2x for a large input.
The text was updated successfully, but these errors were encountered:
…kencount (#5519)
Fixes#5515
### Motivation and Context
The examples for the text splitter all instantiate the encoder/tokenizer
*every*-time the count function is called. This updates the samples to
have a count method against a class that stores the encoding.
### Description
<!-- Describe your changes, the overall approach, the underlying design.
These notes will help understanding how your code works. Thanks! -->
### Contribution Checklist
<!-- Before submitting this PR, please make sure: -->
- [ ] The code builds clean without any errors or warnings
- [ ] The PR follows the [SK Contribution
Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
and the [pre-submission formatting
script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts)
raises no violations
- [ ] All unit tests pass, and I have added new tests where possible
- [ ] I didn't break anyone 😄
Co-authored-by: Roger Barreto <19890735+RogerBarreto@users.noreply.github.com>
Describe the bug
The sample functions in
Example55_textchunker.cs
show how to use a BPE encoding function to replace the builtin left-shift approximation function:https://github.com/microsoft/semantic-kernel/blob/main/dotnet/samples/KernelSyntaxExamples/Example55_TextChunker.cs#L87C5-L108
These examples encourage a stateless function that instantiates the BPE for each call. Both for sharp token and the new ML.Tokenizers API, this is really inefficient. Instead, we should encourage usage with a class that stores the tokenizer and reuses the count tokens method as the TokenCount delegate:
I tested side-by-side with a stateful class passing the instance method to the delegate. For a small input, there was a 5x difference in performance, trailing to a 2x for a large input.
The text was updated successfully, but these errors were encountered: