Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k12: improve API and support multithreaded computation #443

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bwesterb
Copy link
Member

@bwesterb bwesterb commented Jun 4, 2023

Use options style API, so that the common case is very simple:

h := k12.NewDraft10()

but we can provide options elegantly:

h := k12.NewDraft10(
    WithContext([]byte("some context")),
    WithWorkers(runtime.NumCPU()),
)

Allows multithreaded computation with the WithWorkers() option. On M2 Pro scales well with a few workers, but isn't able to utilize all cores effectively:

BenchmarkK12_100B-12          	 5223334	       228.8 ns/op	 437.07 MB/s
BenchmarkK12_10K-12           	  105744	     11183 ns/op	 894.21 MB/s
BenchmarkK12_100K-12          	   27141	     44364 ns/op	2254.06 MB/s
BenchmarkK12_3M-12            	    1172	   1010401 ns/op	3243.07 MB/s
BenchmarkK12_32M-12           	     100	  10033730 ns/op	3265.78 MB/s
BenchmarkK12_327M-12          	      12	  98888208 ns/op	3313.64 MB/s
BenchmarkK12_3276M-12         	       2	 991111938 ns/op	3306.19 MB/s
BenchmarkK12x2_32M-12         	     183	   6510423 ns/op	5033.16 MB/s
BenchmarkK12x2_327M-12        	      18	  63622058 ns/op	5150.41 MB/s
BenchmarkK12x2_3276M-12       	       2	 632823584 ns/op	5178.06 MB/s
BenchmarkK12x4_32M-12         	     364	   3300120 ns/op	9929.34 MB/s
BenchmarkK12x4_327M-12        	      39	  29477854 ns/op	11116.14 MB/s
BenchmarkK12x4_3276M-12       	       2	 581819167 ns/op	11263.98 MB/s
BenchmarkK12x8_32M-12         	     520	   2301923 ns/op	14235.05 MB/s
BenchmarkK12x8_327M-12        	      76	  15590312 ns/op	21018.18 MB/s
BenchmarkK12x8_3276M-12       	       4	 296590469 ns/op	22096.46 MB/s
BenchmarkK12xCPUs_32M-12      	     472	   2526827 ns/op	12968.04 MB/s
BenchmarkK12xCPUs_327M-12     	      78	  15139957 ns/op	21643.39 MB/s
BenchmarkK12xCPUs_3276M-12    	       4	 280958114 ns/op	23325.90 MB/s

We only reach 23GB/s (at 12x) instead of the lower bound of 33GB/s expected with 10 performance cores.

Adds {Max,Next}WriteSize to suggest the caller how big to choose their Write() calls.

@bwesterb bwesterb force-pushed the bas/pk12 branch 3 times, most recently from 1cec358 to 02c9cde Compare June 4, 2023 16:38
Copy link
Contributor

@armfazh armfazh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 32 GB,

Cores Speedup
1 1.00x
2 1.49x
4 2.48x
8 4.27x

For 32MB

Cores Speedup
1 1.00x
2 1.28x
4 1.81x
8 1.92x

the speedup increases with the length of the message (as expected) but the largest test case (where numbers to look ok) requires message of many GB in memory. Not sure if this is gonna be realistic since usually large reads are done in batches of smaller sizes.

For medium-sized messages (say 32MB), the max speedup is below 2x even with 8 threads.

*not in opposition to this effort, but timings make me think if something must be done differently (or I missed something)

xof/k12/k12.go Show resolved Hide resolved
xof/k12/k12.go Outdated Show resolved Hide resolved
xof/k12/k12.go Outdated Show resolved Hide resolved
xof/k12/k12_test.go Show resolved Hide resolved
xof/k12/k12.go Show resolved Hide resolved
xof/k12/k12.go Show resolved Hide resolved
@bwesterb
Copy link
Member Author

requires message of many GB in memory. Not sure if this is gonna be realistic since usually large reads are done in batches of smaller sizes.

Those tests don't use a buffer of GBs. Instead they use the size suggested by MaxWriteSize(), which atm is lanes * CPUs * 8kB, ie. 192kB on M2 Pro.

but timings make me think if something must be done differently (or I missed something)

There clearly is some concurrency bottleneck here which I haven't figured out yet.

@bwesterb bwesterb force-pushed the bas/pk12 branch 2 times, most recently from 80cdfd3 to 7df5996 Compare June 17, 2023 20:36
Use options style API, so that the common case is very simple:

    h := k12.NewDraft10()

but we can provide options elegantly:

    h := k12.NewDraft10(
        WithContext([]byte("some context")),
        WithWorkers(runtime.NumCPU()),
    )

Allows multithreaded computation with the WithWorkers() option.
On M2 Pro scales well with a few workers, but isn't able to
utilize all cores effectively.

    BenchmarkK12_100B-12          	 5223334	       228.8 ns/op	 437.07 MB/s
    BenchmarkK12_10K-12           	  105744	     11183 ns/op	 894.21 MB/s
    BenchmarkK12_100K-12          	   27141	     44364 ns/op	2254.06 MB/s
    BenchmarkK12_3M-12            	    1172	   1010401 ns/op	3243.07 MB/s
    BenchmarkK12_32M-12           	     100	  10033730 ns/op	3265.78 MB/s
    BenchmarkK12_327M-12          	      12	  98888208 ns/op	3313.64 MB/s
    BenchmarkK12_3276M-12         	       2	 991111938 ns/op	3306.19 MB/s
    BenchmarkK12x2_32M-12         	     183	   6510423 ns/op	5033.16 MB/s
    BenchmarkK12x2_327M-12        	      18	  63622058 ns/op	5150.41 MB/s
    BenchmarkK12x2_3276M-12       	       2	 632823584 ns/op	5178.06 MB/s
    BenchmarkK12x4_32M-12         	     364	   3300120 ns/op	9929.34 MB/s
    BenchmarkK12x4_327M-12        	      39	  29477854 ns/op	11116.14 MB/s
    BenchmarkK12x4_3276M-12       	       2	 581819167 ns/op	11263.98 MB/s
    BenchmarkK12x8_32M-12         	     520	   2301923 ns/op	14235.05 MB/s
    BenchmarkK12x8_327M-12        	      76	  15590312 ns/op	21018.18 MB/s
    BenchmarkK12x8_3276M-12       	       4	 296590469 ns/op	22096.46 MB/s
    BenchmarkK12xCPUs_32M-12      	     472	   2526827 ns/op	12968.04 MB/s
    BenchmarkK12xCPUs_327M-12     	      78	  15139957 ns/op	21643.39 MB/s
    BenchmarkK12xCPUs_3276M-12    	       4	 280958114 ns/op	23325.90 MB/s

We only reach 23GB/s (at 12x) instead of the lower bound of 33GB/s
expected with 10 performance cores.

Adds {Max,Next}WriteSize to suggest the caller how big to choose
their Write() calls.

Adds a test to the CI, to check the k12 package with the data race
detector on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants