Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized transform.scale2x() #2859

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

itzpr3d4t0r
Copy link
Member

@itzpr3d4t0r itzpr3d4t0r commented May 18, 2024

This PR optimizes all bpp cases for the transform.scale2x() function. It does so through better use of pointers / switch to static inline functions instead of macros that repeat some calculations and one optimization suggestion in https://www.scale2x.it/algorithm.

Performance improvements vary on a case by case basis depending on suface size and bpp, so I'll just leave my results here (top one is 32bpp):
Figure_1

@itzpr3d4t0r itzpr3d4t0r added Performance Related to the speed or resource usage of the project transform pygame.transform labels May 18, 2024
@itzpr3d4t0r itzpr3d4t0r requested a review from a team as a code owner May 18, 2024 14:17
@Starbuck5
Copy link
Member

How would switching from a macro to a static inline function have any impact on performance? They both compile into the caller.

@itzpr3d4t0r
Copy link
Member Author

itzpr3d4t0r commented May 18, 2024

How would switching from a macro to a static inline function have any impact on performance? They both compile into the caller.

I'm not sure but it should be a combination of factors, especially because the macros repeated ternary ops and other operations whereas the functions do not.

Anyways I tested for this and no macros = 30-35% performance improvement:

Figure_1

Take a READINT24 macro for example, it expands to this code:

((srcpix + ((((0) > (looph - 1)) ? (0) : (looph - 1)) * srcpitch) + (3 * loopw))[0] << 16 |
 (srcpix + ((((0) > (looph - 1)) ? (0) : (looph - 1)) * srcpitch) + (3 * loopw))[1] << 8 |
 (srcpix + ((((0) > (looph - 1)) ? (0) : (looph - 1)) * srcpitch) + (3 * loopw))[2])

So it does the same calculation thrice. With a function we pass in that pointer with the calculations in once and just index it so we save 2 calculations. So since we have 5 calls, we get 20 calculations with macros and 5 without.

This still applies for the WRITEINT24 macro.

@Gabryel-lima
Copy link

This PR optimizes all bpp cases for the transform.scale2x() function. It does so through better use of pointers / switch to static inline functions instead of macros that repeat some calculations and one optimization suggestion in https://www.scale2x.it/algorithm.

Performance improvements vary on a case by case basis depending on suface size and bpp, so I'll just leave my results here (top one is 32bpp): Figure_1

What are these so-called “macros”? And by the way, the dash-board is really cool 😁

Copy link
Member

@MyreMylar MyreMylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent some time investigating the restrict keyword as it was new to me, but I think we'll be alright with it. Technically Cpython's 3.7 to 3.10 only promise to support 'some' features of C99 - but I think practically you won't find a real compiler that will build those CPython versions that doesn't also support restrict. Though perhaps we should clarify somewhere in our docs what C versions can build pygame-ce. 3.11 onwards support C11 interestingly.

Anyway, the code seems alright to me I don't see any problem with replacing macros with functions, in fact generally I prefer functions all else being equal. In this case it seems to help performance rather than hinder so... great!

My testing results comparing different scaling functions doubling the size of a 64x64 pixel image 1 million times:

Current main branch 2x scale speeds:
-----------------------------
SCALE: 10.366427599918097
SCALE2X: 17.69909530004952
SCALE_BY: 10.473752800025977
SMOOTHSCALE: 18.78695980005432
SMOOTHSCALE_BY: 18.953008799930103


This PR branch 2x scale speeds:
-----------------------------
SCALE: 10.371418899972923
SCALE2X: 6.727054100017995
SCALE_BY: 10.522077800007537
SMOOTHSCALE: 18.166397199966013
SMOOTHSCALE_BY: 18.299656499992125

Test program:

from timeit import timeit

import pygame
from pygame.transform import scale, scale_by, smoothscale, smoothscale_by, scale2x

pygame.init()
WIDTH, HEIGHT = 500, 500
win = pygame.display.set_mode((WIDTH, HEIGHT))

IMAGE = pygame.image.load("images/glow.png").convert()
GLOB = {
    "scale": scale,
    "IMAGE": IMAGE,
    "scale2x": scale2x,
    "scale_by": scale_by,
    "smoothscale": smoothscale,
    "smoothscale_by": smoothscale_by,
}


def test(test_str: str):
    print(timeit(test_str, globals=GLOB, number=100000))


print("SCALE: ", end="")
test("scale(IMAGE, (128, 128))")

print("SCALE2X: ", end="")
test("scale2x(IMAGE)")

print("SCALE_BY: ", end="")
test("scale_by(IMAGE, 2.0)")

print("SMOOTHSCALE: ", end="")
test("smoothscale(IMAGE, (128, 128))")

print("SMOOTHSCALE_BY: ", end="")
test("smoothscale_by(IMAGE, 2.0)")

So this seems to actually restore the promise of scale2x to actually be faster than scale again. I think previous optimisations to scale had made it actually worse as a doubler than the baseline scale.

Approved!

@ankith26
Copy link
Member

It is interesting to see that pre-this-pr the performance of scale2x is actually worse than the generic scale for your usage.

@Starbuck5
Copy link
Member

Starbuck5 commented May 28, 2024

What are these so-called “macros”? And by the way, the dash-board is really cool 😁

@Gabryel-lima macros are a concept in C and C++. https://www.geeksforgeeks.org/macros-and-its-types-in-c-cpp/

@itzpr3d4t0r I don't think we can use restrict here, because technically there could be other instances of things accessing the surface pixels at the same time right? Like with surfarray? I'd be interested to see if there's actually a significant difference between using restrict and not using restrict.

Comment on lines +117 to +123
Uint8 *src_row = srcpix + looph * srcpitch;
Uint8 *dst_row0 = dstpix + looph * 2 * dstpitch;
Uint8 *dst_row1 = dstpix + (looph * 2 + 1) * dstpitch;

Uint8 *src_row_prev = srcpix + MAX(0, looph - 1) * srcpitch;
Uint8 *src_row_next =
srcpix + MIN(height - 1, looph + 1) * srcpitch;
Copy link
Member

@Starbuck5 Starbuck5 May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in your PR description you said you had optimized using pointers and using functions instead of macros, but reading this code it seems like the main optimization is actually some manual Loop-invariant code motion.

Very clever! This reminds me of your Rect union PR, also a great way of looking at existing code and making it faster in a straightforward way.

It’s a shame the compiler doesn’t do that automatically in these cases.

@MyreMylar
Copy link
Member

@itzpr3d4t0r I don't think we can use restrict here, because technically there could be other instances of things accessing the surface pixels at the same time right? Like with surfarray? I'd be interested to see if there's actually a significant difference between using restrict and not using restrict.

How exactly does this happen? Shouldn't locking/unlocking by surfarray and this function prevent multiple threads accessing the memory at the same time. Is there another way that this memory could be accessed simultaneously, outside of a multithreaded context, that I'm blanking on? My computer science educational background is sometimes spotty.

@itzpr3d4t0r itzpr3d4t0r requested a review from a team as a code owner May 28, 2024 09:28
@itzpr3d4t0r
Copy link
Member Author

I have to admit restrict is sort of a weird optimization path, I observed some minor-medium benefits but only in specific size ranges so I guess it's not worth the concern.

@Gabryel-lima
Copy link

What are these so-called “macros”? And by the way, the dash-board is really cool 😁

@Gabryel-lima macros are a concept in C and C++. https://www.geeksforgeeks.org/macros-and-its-types-in-c-cpp/

@itzpr3d4t0r I don't think we can use restrict here, because technically there could be other instances of things accessing the surface pixels at the same time right? Like with surfarray? I'd be interested to see if there's actually a significant difference between using restrict and not using restrict.

@Starbuck5 Wow, very interesting! Thanks!

@Starbuck5
Copy link
Member

Starbuck5 commented May 30, 2024

@itzpr3d4t0r I don't think we can use restrict here, because technically there could be other instances of things accessing the surface pixels at the same time right? Like with surfarray? I'd be interested to see if there's actually a significant difference between using restrict and not using restrict.

How exactly does this happen? Shouldn't locking/unlocking by surfarray and this function prevent multiple threads accessing the memory at the same time. Is there another way that this memory could be accessed simultaneously, outside of a multithreaded context, that I'm blanking on? My computer science educational background is sometimes spotty.

It should, I'm just being cautious... Maybe if you constructed a Surface with shared memory from another object you could edit it while running something else?

I don't have a solid scenario in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Related to the speed or resource usage of the project transform pygame.transform
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants