Small parameter encoding improvement #1296

rusher · 2023-03-20T10:51:11Z

Some types of parameters encoded as "text" or "length encoded text" are guaranteed to be ASCII. Using the ASCII encoder provides a significant performance improvement.

bgrainger · 2023-03-20T19:25:25Z

Using the ASCII encoder provides a significant performance improvement.

Do you have a benchmark that quantifies the size of the improvement?

Note that Encoding.UTF8.GetEncoder() is cached in the existing code, but Encoding.ASCII.GetEncoder() isn't (in the PR), making me wonder if this PR creates additional garbage (that could impact performance)?

bgrainger · 2023-03-20T19:27:53Z

src/MySqlConnector/Protocol/Serialization/ByteBufferWriter.cs

@@ -126,6 +129,30 @@ public void Write(ReadOnlySpan<char> chars, bool flush)
 		}
 	}

+	public void WriteAscii(ReadOnlySpan<char> chars, bool flush)


Is this ever called with flush: false? If not, the parameter could be removed and the code simplified.

bgrainger · 2023-03-20T19:29:33Z

src/MySqlConnector/Protocol/Serialization/ByteBufferWriter.cs

+			Reallocate(chars.Length);
+		while (chars.Length > 0)
+		{
+			encoder.Convert(chars, m_output.Span, false, out var charsUsed, out var bytesUsed, out var completed);


The flush parameter could very likely be true here, because we're assuming the input string contains only ASCII characters, which convert one-to-one to bytes.

bgrainger · 2023-03-20T19:30:06Z

src/MySqlConnector/Protocol/Serialization/ByteBufferWriter.cs

+		var encoder = Encoding.ASCII.GetEncoder();
+		if (m_output.Length < chars.Length)
+			Reallocate(chars.Length);
+		while (chars.Length > 0)


I think this loop should be unnecessary, because ASCII characters convert to bytes one-to-one and the output buffer is resized to be large enough on the line above.

bgrainger · 2023-03-20T19:39:57Z

FWIW, you can git commit --amend and git push --force to update the commit on an existing PR (if you want to make an edit to a PR that's already open).

bgrainger · 2023-03-20T19:44:25Z

I ran the following BenchmarkDotNet test:

[Params("short", "a long ASCII string that's longer")]
public string Text { get; set; }

static Encoder s_utf8 = System.Text.Encoding.UTF8.GetEncoder();
static Encoder s_ascii = System.Text.Encoding.ASCII.GetEncoder();
static byte[] s_bytes = new byte[200];

[Benchmark(Baseline = true)]
public int CachedUtf8() => s_utf8.GetBytes(Text.AsSpan(), s_bytes.AsSpan(), true);

[Benchmark]
public int CachedAscii() => s_ascii.GetBytes(Text.AsSpan(), s_bytes.AsSpan(), true);

[Benchmark]
public int Ascii()
{
	var encoder = System.Text.Encoding.ASCII.GetEncoder();
	return encoder.GetBytes(Text.AsSpan(), s_bytes.AsSpan(), true);
}

BenchmarkDotNet=v0.13.5, OS=Windows 10 (10.0.19045.2670/22H2/2022Update)
Intel Core i7-10875H CPU 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-preview.2.23157.25
  [Host] : .NET 7.0.4 (7.0.423.11508), X64 RyuJIT AVX2

Method	Text	Mean	Error	StdDev	Ratio	RatioSD
CachedUtf8	a lon(...)onger [33]	21.74 ns	0.393 ns	0.349 ns	1.00	0.00
CachedAscii	a lon(...)onger [33]	20.32 ns	0.433 ns	0.724 ns	0.93	0.05
Ascii	a lon(...)onger [33]	29.80 ns	0.570 ns	0.533 ns	1.37	0.03

CachedUtf8	short	17.96 ns	0.388 ns	0.518 ns	1.00	0.00
CachedAscii	short	15.05 ns	0.332 ns	0.382 ns	0.84	0.03
Ascii	short	23.96 ns	0.478 ns	1.233 ns	1.39	0.07

It looks like using ASCII encoding would be 5-10% faster for the encoding itself, and likely even faster if the complicated looping logic (for UTF-8 surrogate characters) was omitted from the ASCII version. However, this only occurs if the Encoder is cached; if a new one is created every time, using ASCII is slower.

bgrainger · 2023-03-20T20:56:00Z

I tested another approach of just converting the chars to bytes:

[Benchmark]
public int Loop()
{
	var span = Text.AsSpan();
	var output = s_bytes.AsSpan();
	var i = 0;
	foreach (var ch in span)
		output[i++] = ch is >= (char)0x20 and <= (char)0x7F ? (byte) ch : (byte) '?';
	return i;
}

It's significantly faster for short strings, but surprisingly slow for longer ones:

Method	Text	Mean	Error	StdDev	Ratio	RatioSD
CachedUtf8	a lon(...)onger [33]	19.251 ns	0.2851 ns	0.2527 ns	1.00	0.00
CachedAscii	a lon(...)onger [33]	19.381 ns	0.4126 ns	0.5218 ns	1.00	0.03
Ascii	a lon(...)onger [33]	28.755 ns	0.5946 ns	0.9431 ns	1.48	0.06
Loop	a lon(...)onger [33]	51.139 ns	0.8243 ns	0.7711 ns	2.66	0.05

CachedUtf8	short	17.090 ns	0.2262 ns	0.2005 ns	1.00	0.00
CachedAscii	short	15.865 ns	0.3395 ns	0.5184 ns	0.93	0.03
Ascii	short	22.774 ns	0.3866 ns	0.3616 ns	1.33	0.02
Loop	short	6.848 ns	0.1077 ns	0.0955 ns	0.40	0.01

If we expected that most of the strings would be short, then perhaps it would be better on average?

bgrainger · 2023-03-20T21:10:46Z

If we avoid the range check (by assuming the input is truly all ASCII), then performance improves:

[Benchmark]
public int Loop()
{
	var span = Text.AsSpan();
	var output = s_bytes.AsSpan();
	var i = 0;
	foreach (var ch in span)
		output[i++] = unchecked((byte) ch);
	return i;
}

Method	Text	Mean	Error	StdDev	Ratio	RatioSD
CachedUtf8	a lon(...)onger [33]	19.838 ns	0.4264 ns	0.8113 ns	1.00	0.00
CachedAscii	a lon(...)onger [33]	19.108 ns	0.2547 ns	0.2258 ns	0.97	0.04
Ascii	a lon(...)onger [33]	26.106 ns	0.3199 ns	0.2835 ns	1.32	0.05
Loop	a lon(...)onger [33]	24.771 ns	0.5208 ns	0.5572 ns	1.25	0.06

CachedUtf8	short	18.620 ns	0.2729 ns	0.2553 ns	1.00	0.00
CachedAscii	short	14.202 ns	0.1102 ns	0.0977 ns	0.76	0.01
Ascii	short	21.940 ns	0.2139 ns	0.2001 ns	1.18	0.02
Loop	short	4.114 ns	0.1111 ns	0.1697 ns	0.22	0.01

rusher · 2023-03-22T16:33:23Z

Right, there is indeed some corrections to add :

ASCII encoder needs to be cached.
there is no use for flush parameter
since we are sure that string is ascii AND output byte array size is > to string size, no need to loop.

btw, i've check that using latin1 can make even better performance :
this is because compare to ascii encoder, dotnet then permits to use SSE2 hardware, permitting more effective copy, by bunch of 8 bytes :

benchmark result

|      Method |                 Text |      Mean |     Error |
|------------ |--------------------- |----------:|----------:|
|  CachedUtf8 |                 1234 | 10.927 ns | 0.0709 ns |
| CachedAscii |                 1234 | 10.020 ns | 0.0872 ns |
| CachedLatin |                 1234 |  9.668 ns | 0.1090 ns |
|        Loop |                 1234 |  3.041 ns | 0.0366 ns |
|             |                      |           |           |
|  CachedUtf8 | 1970-(...)00000 [26] | 13.105 ns | 0.0968 ns |
| CachedAscii | 1970-(...)00000 [26] | 12.456 ns | 0.0602 ns |
| CachedLatin | 1970-(...)00000 [26] | 11.637 ns | 0.0718 ns |
|        Loop | 1970-(...)00000 [26] | 19.641 ns | 0.1691 ns |

(tested with code :

    [Params("1234", "1970-01-01 00:00:00.000000")]
    public string Text { get; set; }
    

    static Encoder s_utf8 = System.Text.Encoding.UTF8.GetEncoder();
    static Encoder s_ascii = System.Text.Encoding.ASCII.GetEncoder();
    static Encoder s_latin = System.Text.Encoding.Latin1.GetEncoder();
    static byte[] s_bytes = new byte[200];

    [Benchmark(Baseline = true)]
    public int CachedUtf8() => s_utf8.GetBytes(Text.AsSpan(), s_bytes.AsSpan(), true);
    
    [Benchmark]
    public int CachedAscii() => s_ascii.GetBytes(Text.AsSpan(), s_bytes.AsSpan(), true);
    
    
    [Benchmark]
    public int CachedLatin() => s_latin.GetBytes(Text.AsSpan(), s_bytes.AsSpan(), true);

    //
    // [Benchmark]
    // public int Ascii()
    // {
    //     var encoder = System.Text.Encoding.ASCII.GetEncoder();
    //     return encoder.GetBytes(Text.AsSpan(), s_bytes.AsSpan(), true);
    // }
    
    [Benchmark]
    public int Loop()
    {
        var span = Text.AsSpan();
        var output = s_bytes.AsSpan();
        var i = 0;
        foreach (var ch in span)
            output[i++] = unchecked((byte) ch);
        return i;
    }

In this specific case, it might be even a better solution, so I'll change methods name and implementation accordingly.

This will be use mostly for decimal and datetime, so i think ASCII/latin1 encoder is preferable to loop.

bgrainger · 2023-03-23T08:09:26Z

The ASCII encoder also appears to be vectorised: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs#L1155

Since the timing is close (I don't know why Latin1 benchmarks as slightly faster?) I would prefer to use the ASCII encoder as it matches the expected range of the input data better (i.e., there won't be any non-ASCII characters that need to be encoded as Latin1).

rusher · 2023-03-23T09:44:05Z

change submitted.

(for info: latin1 difference compare to ASCII https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/Latin1Utility.cs#L537 )

bgrainger · 2023-03-23T16:26:32Z

src/MySqlConnector/Protocol/Serialization/ByteBufferWriter.cs

@@ -226,7 +237,8 @@ private void Reallocate(int additional = 0)
 		m_output = new(m_buffer, usedLength, m_buffer.Length - usedLength);
 	}

-	private Encoder? m_encoder;
+	private static Encoder m_utf8Encoder = Encoding.UTF8.GetEncoder();


This type isn't thread-safe and can't be shared across multiple instances of ByteBufferWriter. In particular, "An Encoder object maintains state information between successive calls to GetBytes or Convert methods so that it can correctly encode character sequences that span blocks."

Data would very likely be corrupted if m_utf8Encoder was used simultaneously by multiple threads.

bgrainger · 2023-03-23T16:29:20Z

src/MySqlConnector/Protocol/Serialization/ByteBufferWriter.cs

@@ -226,7 +237,8 @@ private void Reallocate(int additional = 0)
 		m_output = new(m_buffer, usedLength, m_buffer.Length - usedLength);
 	}

-	private Encoder? m_encoder;
+	private static Encoder m_utf8Encoder = Encoding.UTF8.GetEncoder();
+	private static Encoder m_asciiEncoder = Encoding.ASCII.GetEncoder();


Although this is used to perform a "one-shot" conversion (from UTF-16 to ASCII) I don't think the documentation provides any guarantee that it's safe to do that simultaneously from multiple threads. Therefore, this should also not be static.

bgrainger · 2023-03-23T16:32:46Z

src/MySqlConnector/Protocol/Serialization/ByteBufferWriter.cs

+	public static void WriteLengthEncodedAsciiString(this ByteBufferWriter writer, ReadOnlySpan<char> value)
+	{
+		writer.WriteLengthEncodedInteger((ulong) value.Length);
+		writer.Write(value, flush: true);


Suggested change

writer.Write(value, flush: true);

writer.WriteAscii(value);

Did you mean WriteAscii here?

bgrainger · 2023-03-23T16:33:53Z

src/MySqlConnector/Protocol/Serialization/ByteBufferWriter.cs

+	private static Encoder m_utf8Encoder = Encoding.UTF8.GetEncoder();
+	private static Encoder m_asciiEncoder = Encoding.ASCII.GetEncoder();


Suggested change

private static Encoder m_utf8Encoder = Encoding.UTF8.GetEncoder();

private static Encoder m_asciiEncoder = Encoding.ASCII.GetEncoder();

private Encoder? m_utf8Encoder;

private Encoder? m_asciiEncoder;

Actually, m_asciiEncoder probably isn't necessary... the Encoding.ASCII.GetBytes static method is probably sufficient since we're encoding the whole string at once.

Encoding.ASCII.GetBytes is also much faster than a cached Encoder object:

Method Text Mean Error StdDev Ratio RatioSD

CachedUtf8 1234 16.584 ns 0.3541 ns 0.4349 ns 1.00 0.00

CachedAscii 1234 14.951 ns 0.1972 ns 0.1845 ns 0.91 0.03

Ascii 1234 5.975 ns 0.1181 ns 0.0986 ns 0.36 0.01

Loop 1234 3.002 ns 0.0853 ns 0.0798 ns 0.18 0.01

CachedUtf8 1970-(...)00000 [26] 19.331 ns 0.4160 ns 0.3891 ns 1.00 0.00

CachedAscii 1970-(...)00000 [26] 17.843 ns 0.2783 ns 0.2324 ns 0.92 0.02

Ascii 1970-(...)00000 [26] 9.683 ns 0.1879 ns 0.1569 ns 0.50 0.01

Loop 1970-(...)00000 [26] 20.023 ns 0.4302 ns 0.4955 ns 1.03 0.03

rusher · 2023-03-24T12:19:13Z

all goods remarks... I wouldn't have think a simple change like this would have taken so much exchanges.
I've confirm that Encoding.ASCII.GetBytes(ReadOnlySpan<char>, Span<bute>); is by far the best choice and force push the PR accordingly. I think this is now ok.

…re guaranteed to be ASCII. Using the ASCII encoder provides a significant performance improvement. Signed-off-by: rusher <diego.dupin@gmail.com>

src/MySqlConnector/Protocol/Serialization/ByteBufferWriter.cs

Co-authored-by: Bradley Grainger <bgrainger@gmail.com>

bgrainger · 2023-03-30T19:46:49Z

Thanks for the improvement!

bgrainger reviewed Mar 20, 2023

View reviewed changes

bgrainger reviewed Mar 23, 2023

View reviewed changes

rusher force-pushed the encoding branch from 10e6cc6 to 9bd563c Compare March 24, 2023 12:15

Some types of parameters encoded as "text" or "length encoded text" a…

814531a

…re guaranteed to be ASCII. Using the ASCII encoder provides a significant performance improvement. Signed-off-by: rusher <diego.dupin@gmail.com>

rusher force-pushed the encoding branch from 9bd563c to 814531a Compare March 24, 2023 12:20

bgrainger reviewed Mar 24, 2023

View reviewed changes

src/MySqlConnector/Protocol/Serialization/ByteBufferWriter.cs Outdated Show resolved Hide resolved

following correction after ASCII encoding change

45c0410

Co-authored-by: Bradley Grainger <bgrainger@gmail.com>

bgrainger approved these changes Mar 30, 2023

View reviewed changes

bgrainger merged commit d04e24a into mysql-net:master Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small parameter encoding improvement #1296

Small parameter encoding improvement #1296

rusher commented Mar 20, 2023

bgrainger commented Mar 20, 2023

bgrainger Mar 20, 2023

bgrainger Mar 20, 2023

bgrainger Mar 20, 2023

bgrainger commented Mar 20, 2023

bgrainger commented Mar 20, 2023

bgrainger commented Mar 20, 2023

bgrainger commented Mar 20, 2023

rusher commented Mar 22, 2023

bgrainger commented Mar 23, 2023

rusher commented Mar 23, 2023

bgrainger Mar 23, 2023

bgrainger Mar 23, 2023

bgrainger Mar 23, 2023

bgrainger Mar 23, 2023

bgrainger Mar 23, 2023

bgrainger Mar 23, 2023

rusher commented Mar 24, 2023

bgrainger commented Mar 30, 2023

		private static Encoder m_utf8Encoder = Encoding.UTF8.GetEncoder();
		private static Encoder m_asciiEncoder = Encoding.ASCII.GetEncoder();

Method	Text	Mean	Error	StdDev	Ratio	RatioSD
CachedUtf8	1234	16.584 ns	0.3541 ns	0.4349 ns	1.00	0.00
CachedAscii	1234	14.951 ns	0.1972 ns	0.1845 ns	0.91	0.03
Ascii	1234	5.975 ns	0.1181 ns	0.0986 ns	0.36	0.01
Loop	1234	3.002 ns	0.0853 ns	0.0798 ns	0.18	0.01

CachedUtf8	1970-(...)00000 [26]	19.331 ns	0.4160 ns	0.3891 ns	1.00	0.00
CachedAscii	1970-(...)00000 [26]	17.843 ns	0.2783 ns	0.2324 ns	0.92	0.02
Ascii	1970-(...)00000 [26]	9.683 ns	0.1879 ns	0.1569 ns	0.50	0.01
Loop	1970-(...)00000 [26]	20.023 ns	0.4302 ns	0.4955 ns	1.03	0.03

Small parameter encoding improvement #1296

Small parameter encoding improvement #1296

Conversation

rusher commented Mar 20, 2023

bgrainger commented Mar 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgrainger commented Mar 20, 2023

bgrainger commented Mar 20, 2023

bgrainger commented Mar 20, 2023

bgrainger commented Mar 20, 2023

rusher commented Mar 22, 2023

bgrainger commented Mar 23, 2023

rusher commented Mar 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rusher commented Mar 24, 2023

bgrainger commented Mar 30, 2023