Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use block matrix to reduce peak memory usage for matmul #3947

Merged
merged 4 commits into from
Mar 7, 2025

Conversation

badGarnet
Copy link
Collaborator

@badGarnet badGarnet commented Mar 6, 2025

This PR targets the most memory expensive operation in partition pdf and images: deduplicate pdfminer elements. In large pages the number of elements can be over 10k, which would generate multiple 10k x 10k square double float matrices during deduplication, pushing peak memory usage close to 13Gb
Screenshot 2025-03-06 at 3 22 52 PM

This PR breaks this computation down by computing partial IOU. More precisely it computes IOU for each 2000 elements against all the elements at a time to reduce peak memory usage by about 10x to around 1.6Gb.
image

The block size is configurable based on user preference for peak memory usage and it is set by changing the env UNST_MATMUL_MEMORY_CAP_IN_GB.

@badGarnet badGarnet marked this pull request as ready for review March 6, 2025 21:53
CHANGELOG.md Outdated
@@ -1,8 +1,9 @@
## 0.16.24-dev4
## 0.16.24-dev5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of cutting a new release at this point?

Copy link

@ctrahey ctrahey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have timing observations for this?
If it is a significant regression in time, we can consider processpool...

Nice work!

Copy link
Collaborator

@christinestraub christinestraub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of creating an experiment PR to check metrics changes in core-product before merging this PR? I think it would worth to have an experiment PR in core-product first for all the features that could impact metrics scores.

@ctrahey
Copy link

ctrahey commented Mar 6, 2025

Do you have timing observations for this? If it is a significant regression in time, we can consider processpool...

Nice work!

Thinking on this more... there is nuance because I suppose if the reduction in memory is linear with time, then it's not anything gained to do it in processpool... unless each thread working on a block is faster with the smaller overall "job size". Then it would be a net win.

@badGarnet
Copy link
Collaborator Author

Do you have timing observations for this? If it is a significant regression in time, we can consider processpool...

Nice work!

it is actually slightly faster because of the counter intuitive way cpu works. This reduces cpu cache trips actually (computation can fit into cache easier). One can even try to optimize against specific cpu by tweaking the block size smaller.

@badGarnet
Copy link
Collaborator Author

What do you think of creating an experiment PR to check metrics changes in core-product before merging this PR? I think it would worth to have an experiment PR in core-product first for all the features that could impact metrics scores.

this change doesn't change any change logic at all; it is purely a different procedure to compute the same thing

@badGarnet badGarnet enabled auto-merge March 6, 2025 23:53
@ctrahey
Copy link

ctrahey commented Mar 6, 2025

Do you have timing observations for this? If it is a significant regression in time, we can consider processpool...
Nice work!

it is actually slightly faster because of the counter intuitive way cpu works. This reduces cpu cache trips actually (computation can fit into cache easier). One can even try to optimize against specific cpu by tweaking the block size smaller.

Heck yeah - I was hoping you'd say that!

@badGarnet badGarnet added this pull request to the merge queue Mar 7, 2025
Merged via the queue into main with commit 961c8d5 Mar 7, 2025
43 checks passed
@badGarnet badGarnet deleted the feat/use-block-matrix-to-cap-memory-usage branch March 7, 2025 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants