Optimized `transform.scale2x()` #2859

itzpr3d4t0r · 2024-05-18T14:17:37Z

This PR optimizes all bpp cases for the transform.scale2x() function. It does so through better use of pointers / switch to static inline functions instead of macros that repeat some calculations and one optimization suggestion in https://www.scale2x.it/algorithm.

Performance improvements vary on a case by case basis depending on suface size and bpp, so I'll just leave my results here (top one is 32bpp):

Starbuck5 · 2024-05-18T15:13:39Z

How would switching from a macro to a static inline function have any impact on performance? They both compile into the caller.

itzpr3d4t0r · 2024-05-18T15:28:28Z

How would switching from a macro to a static inline function have any impact on performance? They both compile into the caller.

I'm not sure but it should be a combination of factors, especially because the macros repeated ternary ops and other operations whereas the functions do not.

Anyways I tested for this and no macros = 30-35% performance improvement:

Take a READINT24 macro for example, it expands to this code:

((srcpix + ((((0) > (looph - 1)) ? (0) : (looph - 1)) * srcpitch) + (3 * loopw))[0] << 16 |
 (srcpix + ((((0) > (looph - 1)) ? (0) : (looph - 1)) * srcpitch) + (3 * loopw))[1] << 8 |
 (srcpix + ((((0) > (looph - 1)) ? (0) : (looph - 1)) * srcpitch) + (3 * loopw))[2])

So it does the same calculation thrice. With a function we pass in that pointer with the calculations in once and just index it so we save 2 calculations. So since we have 5 calls, we get 20 calculations with macros and 5 without.

This still applies for the WRITEINT24 macro.

Gabryel-lima · 2024-05-23T15:10:30Z

This PR optimizes all bpp cases for the transform.scale2x() function. It does so through better use of pointers / switch to static inline functions instead of macros that repeat some calculations and one optimization suggestion in https://www.scale2x.it/algorithm.

Performance improvements vary on a case by case basis depending on suface size and bpp, so I'll just leave my results here (top one is 32bpp):

What are these so-called “macros”? And by the way, the dash-board is really cool 😁

MyreMylar

Spent some time investigating the restrict keyword as it was new to me, but I think we'll be alright with it. Technically Cpython's 3.7 to 3.10 only promise to support 'some' features of C99 - but I think practically you won't find a real compiler that will build those CPython versions that doesn't also support restrict. Though perhaps we should clarify somewhere in our docs what C versions can build pygame-ce. 3.11 onwards support C11 interestingly.

Anyway, the code seems alright to me I don't see any problem with replacing macros with functions, in fact generally I prefer functions all else being equal. In this case it seems to help performance rather than hinder so... great!

My testing results comparing different scaling functions doubling the size of a 64x64 pixel image 1 million times:

Current main branch 2x scale speeds:
-----------------------------
SCALE: 10.366427599918097
SCALE2X: 17.69909530004952
SCALE_BY: 10.473752800025977
SMOOTHSCALE: 18.78695980005432
SMOOTHSCALE_BY: 18.953008799930103


This PR branch 2x scale speeds:
-----------------------------
SCALE: 10.371418899972923
SCALE2X: 6.727054100017995
SCALE_BY: 10.522077800007537
SMOOTHSCALE: 18.166397199966013
SMOOTHSCALE_BY: 18.299656499992125

Test program:

from timeit import timeit

import pygame
from pygame.transform import scale, scale_by, smoothscale, smoothscale_by, scale2x

pygame.init()
WIDTH, HEIGHT = 500, 500
win = pygame.display.set_mode((WIDTH, HEIGHT))

IMAGE = pygame.image.load("images/glow.png").convert()
GLOB = {
    "scale": scale,
    "IMAGE": IMAGE,
    "scale2x": scale2x,
    "scale_by": scale_by,
    "smoothscale": smoothscale,
    "smoothscale_by": smoothscale_by,
}


def test(test_str: str):
    print(timeit(test_str, globals=GLOB, number=100000))


print("SCALE: ", end="")
test("scale(IMAGE, (128, 128))")

print("SCALE2X: ", end="")
test("scale2x(IMAGE)")

print("SCALE_BY: ", end="")
test("scale_by(IMAGE, 2.0)")

print("SMOOTHSCALE: ", end="")
test("smoothscale(IMAGE, (128, 128))")

print("SMOOTHSCALE_BY: ", end="")
test("smoothscale_by(IMAGE, 2.0)")

So this seems to actually restore the promise of scale2x to actually be faster than scale again. I think previous optimisations to scale had made it actually worse as a doubler than the baseline scale.

Approved!

ankith26 · 2024-05-24T15:08:09Z

It is interesting to see that pre-this-pr the performance of scale2x is actually worse than the generic scale for your usage.

Starbuck5 · 2024-05-28T02:43:02Z

What are these so-called “macros”? And by the way, the dash-board is really cool 😁

@Gabryel-lima macros are a concept in C and C++. https://www.geeksforgeeks.org/macros-and-its-types-in-c-cpp/

@itzpr3d4t0r I don't think we can use restrict here, because technically there could be other instances of things accessing the surface pixels at the same time right? Like with surfarray? I'd be interested to see if there's actually a significant difference between using restrict and not using restrict.

Starbuck5 · 2024-05-28T02:52:18Z

src_c/scale2x.c

+                Uint8 *src_row = srcpix + looph * srcpitch;
+                Uint8 *dst_row0 = dstpix + looph * 2 * dstpitch;
+                Uint8 *dst_row1 = dstpix + (looph * 2 + 1) * dstpitch;
+
+                Uint8 *src_row_prev = srcpix + MAX(0, looph - 1) * srcpitch;
+                Uint8 *src_row_next =
+                    srcpix + MIN(height - 1, looph + 1) * srcpitch;


So in your PR description you said you had optimized using pointers and using functions instead of macros, but reading this code it seems like the main optimization is actually some manual Loop-invariant code motion.

Very clever! This reminds me of your Rect union PR, also a great way of looking at existing code and making it faster in a straightforward way.

It’s a shame the compiler doesn’t do that automatically in these cases.

MyreMylar · 2024-05-28T09:03:17Z

@itzpr3d4t0r I don't think we can use restrict here, because technically there could be other instances of things accessing the surface pixels at the same time right? Like with surfarray? I'd be interested to see if there's actually a significant difference between using restrict and not using restrict.

How exactly does this happen? Shouldn't locking/unlocking by surfarray and this function prevent multiple threads accessing the memory at the same time. Is there another way that this memory could be accessed simultaneously, outside of a multithreaded context, that I'm blanking on? My computer science educational background is sometimes spotty.

itzpr3d4t0r · 2024-05-28T09:34:46Z

I have to admit restrict is sort of a weird optimization path, I observed some minor-medium benefits but only in specific size ranges so I guess it's not worth the concern.

Gabryel-lima · 2024-05-28T14:09:31Z

What are these so-called “macros”? And by the way, the dash-board is really cool 😁

@Gabryel-lima macros are a concept in C and C++. https://www.geeksforgeeks.org/macros-and-its-types-in-c-cpp/

@itzpr3d4t0r I don't think we can use restrict here, because technically there could be other instances of things accessing the surface pixels at the same time right? Like with surfarray? I'd be interested to see if there's actually a significant difference between using restrict and not using restrict.

@Starbuck5 Wow, very interesting! Thanks!

Starbuck5 · 2024-05-30T06:25:45Z

@itzpr3d4t0r I don't think we can use restrict here, because technically there could be other instances of things accessing the surface pixels at the same time right? Like with surfarray? I'd be interested to see if there's actually a significant difference between using restrict and not using restrict.

How exactly does this happen? Shouldn't locking/unlocking by surfarray and this function prevent multiple threads accessing the memory at the same time. Is there another way that this memory could be accessed simultaneously, outside of a multithreaded context, that I'm blanking on? My computer science educational background is sometimes spotty.

It should, I'm just being cautious... Maybe if you constructed a Surface with shared memory from another object you could edit it while running something else?

I don't have a solid scenario in mind.

optimized all Bpp cases of scale2x

39539df

itzpr3d4t0r added Performance Related to the speed or resource usage of the project transform pygame.transform labels May 18, 2024

itzpr3d4t0r requested a review from a team as a code owner May 18, 2024 14:17

finish 3Bpp case

a4cbd0c

MyreMylar approved these changes May 24, 2024

View reviewed changes

Starbuck5 reviewed May 28, 2024

View reviewed changes

removed restrict keyword, added Bpp var

4e41d84

itzpr3d4t0r requested a review from a team as a code owner May 28, 2024 09:28

format

8acbe81

itzpr3d4t0r requested a review from Starbuck5 June 11, 2024 10:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized `transform.scale2x()` #2859

Optimized `transform.scale2x()` #2859

itzpr3d4t0r commented May 18, 2024 •

edited

Starbuck5 commented May 18, 2024

itzpr3d4t0r commented May 18, 2024 •

edited

Gabryel-lima commented May 23, 2024

MyreMylar left a comment •

edited

ankith26 commented May 24, 2024

Starbuck5 commented May 28, 2024 •

edited by MyreMylar

Starbuck5 May 28, 2024 •

edited

MyreMylar commented May 28, 2024

itzpr3d4t0r commented May 28, 2024

Gabryel-lima commented May 28, 2024

Starbuck5 commented May 30, 2024 •

edited

Optimized transform.scale2x() #2859

Are you sure you want to change the base?

Optimized transform.scale2x() #2859

Conversation

itzpr3d4t0r commented May 18, 2024 • edited

Starbuck5 commented May 18, 2024

itzpr3d4t0r commented May 18, 2024 • edited

Gabryel-lima commented May 23, 2024

MyreMylar left a comment • edited

Choose a reason for hiding this comment

ankith26 commented May 24, 2024

Starbuck5 commented May 28, 2024 • edited by MyreMylar

Starbuck5 May 28, 2024 • edited

Choose a reason for hiding this comment

MyreMylar commented May 28, 2024

itzpr3d4t0r commented May 28, 2024

Gabryel-lima commented May 28, 2024

Starbuck5 commented May 30, 2024 • edited

Optimized `transform.scale2x()` #2859

Optimized `transform.scale2x()` #2859

itzpr3d4t0r commented May 18, 2024 •

edited

itzpr3d4t0r commented May 18, 2024 •

edited

MyreMylar left a comment •

edited

Starbuck5 commented May 28, 2024 •

edited by MyreMylar

Starbuck5 May 28, 2024 •

edited

Starbuck5 commented May 30, 2024 •

edited