Simplify hllDenseRegHisto() #13196

panzhongxian · 2024-04-07T03:18:58Z

In HyperLogLog, every register has 6 bits and every 8 registers can be processed with same logic. Therefore, the code for handling 16 registers can be simplified to only handle 8.

The comment "we take a faster path with unrolled loops" refers to replacing obtaining the value of each register individually, as opposed to duplicating the for loop logic twice. As seen in the this commit 3ed947f.

CLAassistant · 2024-04-07T03:19:03Z

All committers have signed the CLA.

In HyperLogLog, every register has 6 bits and every 8 registers can be processed with same logic. Therefore, the code for handling 16 registers can be simplified to only handle 8.

sundb · 2024-04-07T03:35:27Z

@panzhongxian unrolled loops also means reducing the call of jump.
did you do some benchmark to verify it?

panzhongxian · 2024-04-07T03:51:22Z

@panzhongxian unrolled loops also means reducing the call of jump. did you do some benchmark to verify it?

@sundb There seems no PFADD and other hll commands in src/benchmark.c. Should add some benchmark code in this file?

sundb · 2024-04-07T04:01:18Z

@panzhongxian we just put benchmarks that can't be used directly with benchmarks into benchmark.c.
in this case we can use ./src/redis-benchmark -n 10000000 PFADD key 1000

panzhongxian · 2024-04-07T04:29:51Z

@panzhongxian we just put benchmarks that can't be used directly with benchmarks into benchmark.c. in this case we can use ./src/redis-benchmark -n 10000000 PFADD key 1000

@sundb OK. To garantee the hllDenseRegHisto() can be called, I set into 100000 different keys into the test_key (run this before each PFCOUNT test):

import redis

r = redis.Redis(host='localhost', port=6379)

for i in range(100000):
    key = f"key_{i}"
    r.pfadd("test_key", key)

By get test_key, I can read the encoding byte is 0(HLL_DENSE):

Then I run same benchmark command (./src/redis-benchmark -n 10000000 PFCOUNT test_key on MacOS ) on different local redis-server. It brings very little performance improvement or degradation.
case 1. before changing:

====== PFCOUNT test_key ======
  10000000 requests completed in 164.33 seconds
  50 parallel clients
  31 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no
  
Summary:
  throughput summary: 60854.64 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.423     0.184     0.399     0.639     0.903    17.375

case 2. after changing:

====== PFCOUNT test_key ======
  10000000 requests completed in 170.76 seconds
  50 parallel clients
  31 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no
  
  
Summary:
  throughput summary: 58562.41 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.440     0.208     0.415     0.663     0.895    13.279

And I do one more benchmark on after changing case :

====== PFCOUNT test_key ======
  10000000 requests completed in 167.50 seconds
  50 parallel clients
  31 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no
  
Summary:
  throughput summary: 65174.60 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.394     0.184     0.383     0.487     0.671    11.543

sundb · 2024-04-08T09:27:44Z

@panzhongxian I used memtier_benchmark and don't see any significant benefit from this PR, am I missing something?

panzhongxian · 2024-04-08T10:05:08Z

Hi, @sundb. This PR indeed bring no performance improvement, but it can simplify the code and improve readability without compromising performance.

Change processing 4 registers in one loop rather than 12.

sundb · 2024-04-08T10:22:35Z

@panzhongxian loop unrolling not only reduces the number of loops, but more importantly, it utilizes the pipelining capabilities of modern CPUs, which can execute multiple instructions in parallel.

for example, the following two lines of code, when r[5] and r[6] are ready, these two line can be doing at the same time, because there is no dependency between them.
i thinks we shouldn't disturbe them until there is a benefit, and maybe a potential regression.

            r7 = (r[5] >> 2) & 63;
            r8 = r[6] & 63;

panzhongxian · 2024-04-08T11:15:14Z

@sundb I understand what you mean. So I write a comaprison test by extracting the hllDenseRegHisto() here.

I built the simplified and not simplified hllDenseRegHisto() function by whether defining SIMPLIFIED. Meanwhile I used the -O3 flag. These two function cost almost same time(the SIMPLIFIED cost a little less).

On Linux:

On Mac:

sundb · 2024-04-08T11:35:14Z

SIMPLIFIED in my local PC(7950x ubuntu) is 8% faster than the other.
btw: I don't trust time too much, I calculate the time consumption in the code.

time consume: 5.844146 seconds   <- without SIMPLIFIED
time consume: 5.400453 seconds  <- with SIMPLIFIED

#include <stdint.h>
#include <stdio.h>
#include <time.h>

#define HLL_REGISTERS 16384
#define HLL_BITS 6
/* Compute the register histogram in the dense representation. */
void hllDenseRegHisto(uint8_t* registers, int* reghisto) {
  int j;

  /* Redis default is to use 16384 registers 6 bits each. The code works
   * with other values by modifying the defines, but for our target value
   * we take a faster path with unrolled loops. */
  uint8_t* r = registers;
#ifndef SIMPLIFIED
  unsigned long r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r13, r14,
      r15;
  for (j = 0; j < 1024; j++) {
    /* Handle 16 registers per iteration. */
    r0 = r[0] & 63;
    r1 = (r[0] >> 6 | r[1] << 2) & 63;
    r2 = (r[1] >> 4 | r[2] << 4) & 63;
    r3 = (r[2] >> 2) & 63;
    r4 = r[3] & 63;
    r5 = (r[3] >> 6 | r[4] << 2) & 63;
    r6 = (r[4] >> 4 | r[5] << 4) & 63;
    r7 = (r[5] >> 2) & 63;
    r8 = r[6] & 63;
    r9 = (r[6] >> 6 | r[7] << 2) & 63;
    r10 = (r[7] >> 4 | r[8] << 4) & 63;
    r11 = (r[8] >> 2) & 63;
    r12 = r[9] & 63;
    r13 = (r[9] >> 6 | r[10] << 2) & 63;
    r14 = (r[10] >> 4 | r[11] << 4) & 63;
    r15 = (r[11] >> 2) & 63;

    reghisto[r0]++;
    reghisto[r1]++;
    reghisto[r2]++;
    reghisto[r3]++;
    reghisto[r4]++;
    reghisto[r5]++;
    reghisto[r6]++;
    reghisto[r7]++;
    reghisto[r8]++;
    reghisto[r9]++;
    reghisto[r10]++;
    reghisto[r11]++;
    reghisto[r12]++;
    reghisto[r13]++;
    reghisto[r14]++;
    reghisto[r15]++;

    r += 12;
  }
#else
  unsigned long r0, r1, r2, r3;
  for (j = 0; j < 4096; j++) {
    r0 = r[0] & 63;
    r1 = (r[0] >> 6 | r[1] << 2) & 63;
    r2 = (r[1] >> 4 | r[2] << 4) & 63;
    r3 = (r[2] >> 2) & 63;

    reghisto[r0]++;
    reghisto[r1]++;
    reghisto[r2]++;
    reghisto[r3]++;

    r += 3;
  }
#endif
}

int main() {
  uint8_t registers[HLL_REGISTERS];
  int reghisto[64] = {0};

  clock_t start = clock();

  for (int i = 0; i < 1000000; i++) {
    hllDenseRegHisto(registers, reghisto);
  }

  printf("time consume: %f seconds\n", ((double) (clock() - start)) / CLOCKS_PER_SEC);
  return 0;
}

panzhongxian force-pushed the unstable branch from 0456d0c to 0960fd6 Compare April 7, 2024 03:21

Simplify hllDenseRegHisto()

c346ce5

In HyperLogLog, every register has 6 bits and every 8 registers can be processed with same logic. Therefore, the code for handling 16 registers can be simplified to only handle 8.

panzhongxian force-pushed the unstable branch from 0960fd6 to c346ce5 Compare April 7, 2024 03:22

Simplify hllDenseRegHisto()

ee0f3d3

Change processing 4 registers in one loop rather than 12.

panzhongxian force-pushed the unstable branch from 752750e to ee0f3d3 Compare April 8, 2024 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify hllDenseRegHisto() #13196

Simplify hllDenseRegHisto() #13196

panzhongxian commented Apr 7, 2024

CLAassistant commented Apr 7, 2024 •

edited

sundb commented Apr 7, 2024

panzhongxian commented Apr 7, 2024 •

edited

sundb commented Apr 7, 2024

panzhongxian commented Apr 7, 2024 •

edited

sundb commented Apr 8, 2024

panzhongxian commented Apr 8, 2024 •

edited

sundb commented Apr 8, 2024 •

edited

panzhongxian commented Apr 8, 2024 •

edited

sundb commented Apr 8, 2024 •

edited

Simplify hllDenseRegHisto() #13196

Are you sure you want to change the base?

Simplify hllDenseRegHisto() #13196

Conversation

panzhongxian commented Apr 7, 2024

CLAassistant commented Apr 7, 2024 • edited

sundb commented Apr 7, 2024

panzhongxian commented Apr 7, 2024 • edited

sundb commented Apr 7, 2024

panzhongxian commented Apr 7, 2024 • edited

sundb commented Apr 8, 2024

panzhongxian commented Apr 8, 2024 • edited

sundb commented Apr 8, 2024 • edited

panzhongxian commented Apr 8, 2024 • edited

sundb commented Apr 8, 2024 • edited

CLAassistant commented Apr 7, 2024 •

edited

panzhongxian commented Apr 7, 2024 •

edited

panzhongxian commented Apr 7, 2024 •

edited

panzhongxian commented Apr 8, 2024 •

edited

sundb commented Apr 8, 2024 •

edited

panzhongxian commented Apr 8, 2024 •

edited

sundb commented Apr 8, 2024 •

edited