New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify hllDenseRegHisto() #13196
base: unstable
Are you sure you want to change the base?
Simplify hllDenseRegHisto() #13196
Conversation
In HyperLogLog, every register has 6 bits and every 8 registers can be processed with same logic. Therefore, the code for handling 16 registers can be simplified to only handle 8.
@panzhongxian unrolled loops also means reducing the call of jump. |
@sundb There seems no |
@panzhongxian we just put benchmarks that can't be used directly with benchmarks into benchmark.c. |
@sundb OK. To garantee the import redis
r = redis.Redis(host='localhost', port=6379)
for i in range(100000):
key = f"key_{i}"
r.pfadd("test_key", key) By Then I run same benchmark command (
case 2. after changing:
And I do one more benchmark on after changing case :
|
@panzhongxian I used memtier_benchmark and don't see any significant benefit from this PR, am I missing something? |
Hi, @sundb. This PR indeed bring no performance improvement, but it can simplify the code and improve readability without compromising performance. |
Change processing 4 registers in one loop rather than 12.
@panzhongxian loop unrolling not only reduces the number of loops, but more importantly, it utilizes the pipelining capabilities of modern CPUs, which can execute multiple instructions in parallel. for example, the following two lines of code, when r[5] and r[6] are ready, these two line can be doing at the same time, because there is no dependency between them. r7 = (r[5] >> 2) & 63;
r8 = r[6] & 63; |
@sundb I understand what you mean. So I write a comaprison test by extracting the I built the simplified and not simplified hllDenseRegHisto() function by whether defining On Linux: On Mac: |
SIMPLIFIED in my local PC(7950x ubuntu) is 8% faster than the other.
#include <stdint.h>
#include <stdio.h>
#include <time.h>
#define HLL_REGISTERS 16384
#define HLL_BITS 6
/* Compute the register histogram in the dense representation. */
void hllDenseRegHisto(uint8_t* registers, int* reghisto) {
int j;
/* Redis default is to use 16384 registers 6 bits each. The code works
* with other values by modifying the defines, but for our target value
* we take a faster path with unrolled loops. */
uint8_t* r = registers;
#ifndef SIMPLIFIED
unsigned long r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r13, r14,
r15;
for (j = 0; j < 1024; j++) {
/* Handle 16 registers per iteration. */
r0 = r[0] & 63;
r1 = (r[0] >> 6 | r[1] << 2) & 63;
r2 = (r[1] >> 4 | r[2] << 4) & 63;
r3 = (r[2] >> 2) & 63;
r4 = r[3] & 63;
r5 = (r[3] >> 6 | r[4] << 2) & 63;
r6 = (r[4] >> 4 | r[5] << 4) & 63;
r7 = (r[5] >> 2) & 63;
r8 = r[6] & 63;
r9 = (r[6] >> 6 | r[7] << 2) & 63;
r10 = (r[7] >> 4 | r[8] << 4) & 63;
r11 = (r[8] >> 2) & 63;
r12 = r[9] & 63;
r13 = (r[9] >> 6 | r[10] << 2) & 63;
r14 = (r[10] >> 4 | r[11] << 4) & 63;
r15 = (r[11] >> 2) & 63;
reghisto[r0]++;
reghisto[r1]++;
reghisto[r2]++;
reghisto[r3]++;
reghisto[r4]++;
reghisto[r5]++;
reghisto[r6]++;
reghisto[r7]++;
reghisto[r8]++;
reghisto[r9]++;
reghisto[r10]++;
reghisto[r11]++;
reghisto[r12]++;
reghisto[r13]++;
reghisto[r14]++;
reghisto[r15]++;
r += 12;
}
#else
unsigned long r0, r1, r2, r3;
for (j = 0; j < 4096; j++) {
r0 = r[0] & 63;
r1 = (r[0] >> 6 | r[1] << 2) & 63;
r2 = (r[1] >> 4 | r[2] << 4) & 63;
r3 = (r[2] >> 2) & 63;
reghisto[r0]++;
reghisto[r1]++;
reghisto[r2]++;
reghisto[r3]++;
r += 3;
}
#endif
}
int main() {
uint8_t registers[HLL_REGISTERS];
int reghisto[64] = {0};
clock_t start = clock();
for (int i = 0; i < 1000000; i++) {
hllDenseRegHisto(registers, reghisto);
}
printf("time consume: %f seconds\n", ((double) (clock() - start)) / CLOCKS_PER_SEC);
return 0;
} |
In HyperLogLog, every register has 6 bits and every 8 registers can be processed with same logic. Therefore, the code for handling 16 registers can be simplified to only handle 8.
The comment "we take a faster path with unrolled loops" refers to replacing obtaining the value of each register individually, as opposed to duplicating the for loop logic twice. As seen in the this commit 3ed947f.