Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify hllDenseRegHisto() #13196

Open
wants to merge 2 commits into
base: unstable
Choose a base branch
from

Conversation

panzhongxian
Copy link

In HyperLogLog, every register has 6 bits and every 8 registers can be processed with same logic. Therefore, the code for handling 16 registers can be simplified to only handle 8.

The comment "we take a faster path with unrolled loops" refers to replacing obtaining the value of each register individually, as opposed to duplicating the for loop logic twice. As seen in the this commit 3ed947f.

@CLAassistant
Copy link

CLAassistant commented Apr 7, 2024

CLA assistant check
All committers have signed the CLA.

In HyperLogLog, every register has 6 bits and every 8 registers can be
processed with same logic. Therefore, the code for handling 16 registers
can be simplified to only handle 8.
@sundb
Copy link
Collaborator

sundb commented Apr 7, 2024

@panzhongxian unrolled loops also means reducing the call of jump.
did you do some benchmark to verify it?

@panzhongxian
Copy link
Author

panzhongxian commented Apr 7, 2024

@panzhongxian unrolled loops also means reducing the call of jump. did you do some benchmark to verify it?

@sundb There seems no PFADD and other hll commands in src/benchmark.c. Should add some benchmark code in this file?

@sundb
Copy link
Collaborator

sundb commented Apr 7, 2024

@panzhongxian we just put benchmarks that can't be used directly with benchmarks into benchmark.c.
in this case we can use ./src/redis-benchmark -n 10000000 PFADD key 1000

@panzhongxian
Copy link
Author

panzhongxian commented Apr 7, 2024

@panzhongxian we just put benchmarks that can't be used directly with benchmarks into benchmark.c. in this case we can use ./src/redis-benchmark -n 10000000 PFADD key 1000

@sundb OK. To garantee the hllDenseRegHisto() can be called, I set into 100000 different keys into the test_key (run this before each PFCOUNT test):

import redis

r = redis.Redis(host='localhost', port=6379)

for i in range(100000):
    key = f"key_{i}"
    r.pfadd("test_key", key)

By get test_key, I can read the encoding byte is 0(HLL_DENSE):

image

Then I run same benchmark command (./src/redis-benchmark -n 10000000 PFCOUNT test_key on MacOS ) on different local redis-server. It brings very little performance improvement or degradation.
case 1. before changing:

====== PFCOUNT test_key ======
  10000000 requests completed in 164.33 seconds
  50 parallel clients
  31 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no
  
Summary:
  throughput summary: 60854.64 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.423     0.184     0.399     0.639     0.903    17.375

case 2. after changing:

====== PFCOUNT test_key ======
  10000000 requests completed in 170.76 seconds
  50 parallel clients
  31 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no
  
  
Summary:
  throughput summary: 58562.41 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.440     0.208     0.415     0.663     0.895    13.279

And I do one more benchmark on after changing case :

====== PFCOUNT test_key ======
  10000000 requests completed in 167.50 seconds
  50 parallel clients
  31 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no
  
Summary:
  throughput summary: 65174.60 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.394     0.184     0.383     0.487     0.671    11.543

@sundb
Copy link
Collaborator

sundb commented Apr 8, 2024

@panzhongxian I used memtier_benchmark and don't see any significant benefit from this PR, am I missing something?

@panzhongxian
Copy link
Author

panzhongxian commented Apr 8, 2024

Hi, @sundb. This PR indeed bring no performance improvement, but it can simplify the code and improve readability without compromising performance.

Change processing 4 registers in one loop rather than 12.
@sundb
Copy link
Collaborator

sundb commented Apr 8, 2024

@panzhongxian loop unrolling not only reduces the number of loops, but more importantly, it utilizes the pipelining capabilities of modern CPUs, which can execute multiple instructions in parallel.

for example, the following two lines of code, when r[5] and r[6] are ready, these two line can be doing at the same time, because there is no dependency between them.
i thinks we shouldn't disturbe them until there is a benefit, and maybe a potential regression.

            r7 = (r[5] >> 2) & 63;
            r8 = r[6] & 63;

@panzhongxian
Copy link
Author

panzhongxian commented Apr 8, 2024

@sundb I understand what you mean. So I write a comaprison test by extracting the hllDenseRegHisto() here.

I built the simplified and not simplified hllDenseRegHisto() function by whether defining SIMPLIFIED. Meanwhile I used the -O3 flag. These two function cost almost same time(the SIMPLIFIED cost a little less).

On Linux:

image

On Mac:

image

@sundb
Copy link
Collaborator

sundb commented Apr 8, 2024

SIMPLIFIED in my local PC(7950x ubuntu) is 8% faster than the other.
btw: I don't trust time too much, I calculate the time consumption in the code.

time consume: 5.844146 seconds   <- without SIMPLIFIED
time consume: 5.400453 seconds  <- with SIMPLIFIED
#include <stdint.h>
#include <stdio.h>
#include <time.h>

#define HLL_REGISTERS 16384
#define HLL_BITS 6
/* Compute the register histogram in the dense representation. */
void hllDenseRegHisto(uint8_t* registers, int* reghisto) {
  int j;

  /* Redis default is to use 16384 registers 6 bits each. The code works
   * with other values by modifying the defines, but for our target value
   * we take a faster path with unrolled loops. */
  uint8_t* r = registers;
#ifndef SIMPLIFIED
  unsigned long r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r13, r14,
      r15;
  for (j = 0; j < 1024; j++) {
    /* Handle 16 registers per iteration. */
    r0 = r[0] & 63;
    r1 = (r[0] >> 6 | r[1] << 2) & 63;
    r2 = (r[1] >> 4 | r[2] << 4) & 63;
    r3 = (r[2] >> 2) & 63;
    r4 = r[3] & 63;
    r5 = (r[3] >> 6 | r[4] << 2) & 63;
    r6 = (r[4] >> 4 | r[5] << 4) & 63;
    r7 = (r[5] >> 2) & 63;
    r8 = r[6] & 63;
    r9 = (r[6] >> 6 | r[7] << 2) & 63;
    r10 = (r[7] >> 4 | r[8] << 4) & 63;
    r11 = (r[8] >> 2) & 63;
    r12 = r[9] & 63;
    r13 = (r[9] >> 6 | r[10] << 2) & 63;
    r14 = (r[10] >> 4 | r[11] << 4) & 63;
    r15 = (r[11] >> 2) & 63;

    reghisto[r0]++;
    reghisto[r1]++;
    reghisto[r2]++;
    reghisto[r3]++;
    reghisto[r4]++;
    reghisto[r5]++;
    reghisto[r6]++;
    reghisto[r7]++;
    reghisto[r8]++;
    reghisto[r9]++;
    reghisto[r10]++;
    reghisto[r11]++;
    reghisto[r12]++;
    reghisto[r13]++;
    reghisto[r14]++;
    reghisto[r15]++;

    r += 12;
  }
#else
  unsigned long r0, r1, r2, r3;
  for (j = 0; j < 4096; j++) {
    r0 = r[0] & 63;
    r1 = (r[0] >> 6 | r[1] << 2) & 63;
    r2 = (r[1] >> 4 | r[2] << 4) & 63;
    r3 = (r[2] >> 2) & 63;

    reghisto[r0]++;
    reghisto[r1]++;
    reghisto[r2]++;
    reghisto[r3]++;

    r += 3;
  }
#endif
}

int main() {
  uint8_t registers[HLL_REGISTERS];
  int reghisto[64] = {0};

  clock_t start = clock();

  for (int i = 0; i < 1000000; i++) {
    hllDenseRegHisto(registers, reghisto);
  }

  printf("time consume: %f seconds\n", ((double) (clock() - start)) / CLOCKS_PER_SEC);
  return 0;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants