BigInt speedup #3379

iliya-malecki · 2023-08-11T13:43:40Z

My contributions to pyo3 are available under the terms of either the Apache 2.0 OR the MIT license :)
This PR is a continuation of #3365
I am very much unsure about some parts, feel free to comment on how it might be improved or ditched. The balance i was trying to strike is between speed and maintainability - however, whether i was successful isnt very clear-cut. I succeeded in not using pointer offsets and other ridiculous methods of reaching into pyobjects memory, though

iliya-malecki · 2023-08-11T13:46:10Z

This is the speedup i achieved (note the slowdown for small numbers)

davidhewitt

Thanks for the contribution!

So I've been reading through this and trying to get a sense of what might be the cause of the slowdown.

I noticed that internally BigInt seems to use either u32 or u64 according to the platform pointer size.

https://github.com/rust-num/num-bigint/blob/65f62a8b1484448bfb9789ef4123b50556254905/build.rs#L9-L12

https://github.com/rust-num/num-bigint/blob/65f62a8b1484448bfb9789ef4123b50556254905/src/biguint.rs#L527-L538

It looks to me that on 64-bit systems using Vec<u32> as an intermediary isn't beneficial, because this gets copied into a new u64 representation. So the Vec gets dropped on the floor.

Given that, I'm a little surprised that we've got a speedup here. Have you got an intuition for which bits of the change yielded the performance boost? I see we now skip zero-initialization, I don't see any other clear wins...

src/conversions/num_bigint.rs

iliya-malecki · 2023-08-13T17:54:23Z

im again puzzled, the patch should now have 100% coverage shouldnt it?

davidhewitt · 2023-08-16T07:57:08Z

Looking at the PR on codecov it's the error branches which have no coverage, which is often the case so not a huge surprise.

@iliya-malecki are you ok if I push some additional changes to this PR? I'm tempted to do a little further refactoring and see if I can get the small bigint performance back on par with the original implementation.

(And if you're interested in doing the upstream PR to num-bigint so that we could directly build the integer from the "native" u32 or u64 representation and avoid another Vec allocation then that would be a win for the whole ecosystem!)

iliya-malecki · 2023-08-16T08:05:30Z

I would absolutely love that, I did this pr mainly to learn Rust so any input is very welcome (and yes i have great plans to attempt something with num-bigint)

iliya-malecki · 2023-08-16T08:14:59Z

Also, I'm looking at the codecov page but I'm not sure what I'm seeing, where does the value 15% (of uncovered lines) come from? From two lines of returning errors? Or is it a quirk that arises from those errors themselves not being covered anywhere?

davidhewitt · 2023-08-18T12:42:39Z

Ok, I've pushed a refactoring which cut the duplication between signed & unsigned parts out into common helpers. For abi3 we now just use the bytes slice and let num_bigint do the work.

I had a quick look with miri on play.rust-lang.org using std::ptr::write_bytes as a substitute for _PyLong_AsByteArray and it seems happy with the idea of just reserving capacity and then using set_len after calling a method which operates on the vec pointer.

I also changed the negative integer case to do the "subtract 1" bit as part of our code rather than calling - 1 on the resulting BigInt object. For good measure I changed our tests to check the first 2000 fibonacci numbers, and the Rust and Python sequences agree so I'm pretty confident in our implementation.

This is now as fast as I can get the code. I did some profiling and unsurprisingly the conversion from u32 -> u64 form comes out as a large chunk of the runtime on my machine. @iliya-malecki if you're interested in taking on a follow-up with num-bigint to make it possible for us to pass u64 "digits" directly then I'm sure that will yield another significant speedup (maybe 2x?)

adamreichold · 2023-08-18T17:10:42Z

src/conversions/num_bigint.rs

+        )?;
+        buffer.set_len(n_digits)
+    };
+    buffer


One thing I wonder considering that the majority of users are probably on 64-bit architectures and even if we fix num-bigint, being able to rely on that fix will take a while, is whether using Vec<u8> with _PyLong_AsByteArray and BigUint::from_bytes_le might actually be faster in the 64-bit case?

(And might significantly simplify the signed case.)

I think the original solution was that, wasn't it?

I think the original solution was that, wasn't it?

Well, with additional zero-initialization and one additional copy in the limited API case, right? But yes, I would be interesting to see if there is any speed up left in the 64-bit case when Vec::with_capacity(n_digits) is replaced by vec![0; n_digits].

main:

extract_bigint_huge_negative time: [2.0350 µs 2.0353 µs 2.0356 µs] extract_bigint_huge_positive time: [1.5174 µs 1.5177 µs 1.5181 µs]

this branch + zero initialization

extract_bigint_huge_negative time: [1.1691 µs 1.1694 µs 1.1697 µs] extract_bigint_huge_positive time: [1.1229 µs 1.1233 µs 1.1239 µs]

So it's not just skipping zero-initialisation giving the speedup. My intuition for what's going on (having peeked in the num_bigint code) is that the implementation we've written here to convert from u8 to u32 and also to do two's complement is better optimised than the upstream versions in these cases too.

For example, BigInt::from_signed_bytes_le takes a copy of the bytes to do two's complement on that, before then throwing that away to do a loop which uses a lot of shifting to merge u8 to u64 (where we effectively just transmute a Vec modulo endianness to achieve that conversion).

https://github.com/rust-num/num-bigint/blob/65f62a8b1484448bfb9789ef4123b50556254905/src/bigint/convert.rs#L408C44-L408C44
https://github.com/rust-num/num-bigint/blob/65f62a8b1484448bfb9789ef4123b50556254905/src/biguint/convert.rs#L43

davidhewitt

As far as I'm concerned, this is as good as we can do here. There looks like there's still more to be won with improvements upstream, however I think that's out of scope for me to worry about for now.

davidhewitt reviewed Aug 11, 2023

View reviewed changes