Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Improve performance of boolean filters 1-100x. #14746

Merged
merged 13 commits into from
Feb 29, 2024

Conversation

orlp
Copy link
Collaborator

@orlp orlp commented Feb 28, 2024

This should improve the performance of boolean filters (that is, x.filter(y) where both columns are boolean) on all platforms, most notably on x86-64 processors with a fast PEXT instruction (which includes all processors that can run the main polars package, except AMD Zen and Zen2 processors).

This also applies to filtering columns that have nulls, as such columns have an associated validity bitmask, at least it will in the future (I still have to re-write the other filters to use this).

For N = 10^9 elements and a variety of selectivity levels of the filter, I saw the following speedups compared to 0.20.10:

ARM Apple M1:
selectivity        numpy      0.20.10      branch  speedup vs 0.20.10
0.00000       0.38286737   0.00369267  0.01368550    noise (constant time, assuming bits_set is counted)
0.00001       1.23256500   0.22485958  0.07338388     3.1x 
0.00010       1.33348025   0.25534363  0.12926171     2.0x 
0.00100       2.12070358   0.34963996  0.13418650     2.6x 
0.01000       2.62164525   1.50416246  1.03395996     1.5x 
0.10000      14.91083154   5.09605792  1.42290787     3.6x 
0.50000      42.28403404  19.43598492  0.99224012    19.6x 
0.90000      16.85105679  33.99424500  1.05689296    32.2x 
0.99000       6.31169138  18.63647588  1.21944046    15.3x 
0.99900       4.64618433   3.60898004  0.66184075     5.5x 
0.99990       4.55612183   1.56502888  0.57366629     2.7x 
0.99999       4.51338837   1.37130917  0.57086287     2.4x 
1.00000       4.37032300   0.00878733  0.00324175    noise (constant time, assuming bits_set is counted)
Intel Xeon Platinum 8481C:
selectivity        numpy      0.20.10      branch speedup vs 0.20.10
0.00000       0.88690699   0.02304096  0.02045852   noise (constant time, assuming bits_set is counted)
0.00001       2.08193143   0.30353796  0.16164698    1.9x
0.00010       2.19105948   0.31070255  0.20649040    1.5x
0.00100       2.84320821   0.46460736  0.27247562    1.7x
0.01000       4.86268612   2.14474782  0.95093239    2.3x
0.10000      26.36343429   7.78334800  0.39050875   19.9x
0.50000      62.70436204  32.31799418  0.47154556   68.5x
0.90000      25.43500997  56.82847105  0.56006352  101.5x
0.99000       8.28475304  30.93340283  0.56136180   55.1x
0.99900       6.82336972   5.97752795  0.54880103   10.9x
0.99990       6.73292214   2.59171845  0.55360964    4.7x
0.99999       6.82205267   2.24096934  0.58239432    3.8x
1.00000       6.85277840   0.01151463  0.01138404   noise (constant time, assuming bits_set is counted)  

The above numbers were generated with this script:

import polars as pl
import numpy as np
from timeit import timeit

n = 10**9
selectivities = [0, 0.00001, 0.0001, 0.001, 0.01, 0.1, 0.5, 0.9, 0.99, 0.999, 0.9999, 0.99999, 1]

rng = np.random.Generator(np.random.PCG64(42))
payload = rng.uniform(size = n) < 0.5

print("polars timings")
for selectivity in selectivities:
    threshold = selectivity if selectivity < 1 else 1.1  # Just to be safe.
    filt = rng.uniform(size = n) < threshold
    df = pl.DataFrame({"filt": filt, "payload": payload})
    time = timeit(lambda: df.select(pl.col.payload.filter(pl.col.filt)), number=10)
    print(f"{selectivity:.5f}  {time:11.8f}")
    
print("numpy timings")
for selectivity in selectivities:
    threshold = selectivity if selectivity < 1 else 1.1  # Just to be safe.
    filt = rng.uniform(size = n) < threshold
    time = timeit(lambda: payload[filt], number=10)
    print(f"{selectivity:.5f}  {time:11.8f}")

@stinodego stinodego changed the title Improve performance of boolean filters perf: Improve performance of boolean filters Feb 28, 2024
@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Feb 28, 2024
// let obvious_part_remaining = obvious_part % consume;
// let total_remaining = min_length_for_iter + obvious_part_remaining;
// assert!(total_remaining >= min_length_for_iter); // We have at least 1 more iter.
// assert!(obvious_part_remaining < consume); // Basic modulo property.
Copy link
Member

@ritchie46 ritchie46 Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accidental?

Copy link
Collaborator Author

@orlp orlp Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither, it's a mathematical proof written in the style of Rust code that 1 + obvious_iters is the correct answer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah.. check.

@ritchie46 ritchie46 changed the title perf: Improve performance of boolean filters perf: Improve performance of boolean filters 1-100x. Feb 29, 2024
@ritchie46 ritchie46 merged commit c75fecd into pola-rs:main Feb 29, 2024
23 checks passed
@jtanx
Copy link

jtanx commented Feb 29, 2024

This just broke running with a skylake CPU. This release just looks broken, the feature flag detection code has not been updated to provide this flag so we're just getting unknown feature flag exceptions

return {
"sse3": bool(cpuid1.ecx & (1 << 0)),
"ssse3": bool(cpuid1.ecx & (1 << 9)),
"fma": bool(cpuid1.ecx & (1 << 12)),
"sse4.1": bool(cpuid1.ecx & (1 << 19)),
"sse4.2": bool(cpuid1.ecx & (1 << 20)),
"popcnt": bool(cpuid1.ecx & (1 << 23)),
"avx": bool(cpuid1.ecx & (1 << 28)),
"bmi1": bool(cpuid7.ebx & (1 << 3)),
"bmi2": bool(cpuid7.ebx & (1 << 8)),
"avx2": bool(cpuid7.ebx & (1 << 5)),
"lzcnt": bool(cpuid81h.ecx & (1 << 5)),
}

@stinodego
Copy link
Contributor

We are aware and will issue a new release momentarily

@orlp
Copy link
Collaborator Author

orlp commented Feb 29, 2024

Oops, I forgot that we pass all CPU flags into _cpu_check.py.

@c-peters c-peters added the accepted Ready for implementation label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants