-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use block matrix to reduce peak memory usage for matmul #3947
Conversation
CHANGELOG.md
Outdated
@@ -1,8 +1,9 @@ | |||
## 0.16.24-dev4 | |||
## 0.16.24-dev5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of cutting a new release at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have timing observations for this?
If it is a significant regression in time, we can consider processpool...
Nice work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of creating an experiment PR to check metrics changes in core-product before merging this PR? I think it would worth to have an experiment PR in core-product first for all the features that could impact metrics scores.
Thinking on this more... there is nuance because I suppose if the reduction in memory is linear with time, then it's not anything gained to do it in processpool... unless each thread working on a block is faster with the smaller overall "job size". Then it would be a net win. |
it is actually slightly faster because of the counter intuitive way cpu works. This reduces cpu cache trips actually (computation can fit into cache easier). One can even try to optimize against specific cpu by tweaking the block size smaller. |
this change doesn't change any change logic at all; it is purely a different procedure to compute the same thing |
…-to-cap-memory-usage
Heck yeah - I was hoping you'd say that! |
This PR targets the most memory expensive operation in partition pdf and images: deduplicate pdfminer elements. In large pages the number of elements can be over 10k, which would generate multiple 10k x 10k square double float matrices during deduplication, pushing peak memory usage close to 13Gb

This PR breaks this computation down by computing partial IOU. More precisely it computes IOU for each 2000 elements against all the elements at a time to reduce peak memory usage by about 10x to around 1.6Gb.

The block size is configurable based on user preference for peak memory usage and it is set by changing the env
UNST_MATMUL_MEMORY_CAP_IN_GB
.