Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Allow empty memmaps in most situations #27723

Merged
merged 10 commits into from
Nov 18, 2024
Merged

Conversation

laarohi
Copy link
Contributor

@laarohi laarohi commented Nov 8, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@@ -276,6 +276,9 @@ def __new__(subtype, filename, dtype=uint8, mode='r+', offset=0,

start = offset - offset % mmap.ALLOCATIONGRANULARITY
bytes -= start
if bytes == 0 and start > 0:
bytes += mmap.ALLOCATIONGRANULARITY
start -= mmap.ALLOCATIONGRANULARITY
Copy link
Member

@seberg seberg Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more explanation, I think. To me it also looks like this will simply fail if you pass offset=0. Even if I am missing something, that means it needs testing.
I.e. it looks a bit like it fixes your specific issue, but may not improve things generally.

It seems like a 0 size memmap is quite fraught to begin with, so I am not sure what is right. Setting the bytes to be at least 1 may make sense, but on linux may also fail if the size is actually 0 it seems (so at the very least it would also lead to a regressions in some places).
Of course sure, if an offset is used, that offset can be adjusted so that we don't need to "read" a byte beyond the end.

EDIT: Ack, sorry I guess the start > 0 prevents things, but it still would be nice to be clear that this only works with offsets. And it would be good to think about what to do about the other case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I could have been more clear.

So just to clarify, on Linux (haven't tested on Windows), in the current numpy implementation:

  • Works:
    • 0 size memmap with offset > 0 and offset % mmap.ALLOCATIONGRANULARITY != 0
  • Fails:
    • 0 size memmap with offset = 0
    • 0 size memmap with offset > 0 and offset % mmap.ALLOCATIONGRANULARITY == 0

With the suggested fix:

  • Works:
    • 0 size memmap with offset > 0
  • Fails:
    • 0 size memmap with offset = 0

I agree that a 0 size memmap is quite fraught, but it would be nice if the behaviour of a 0 size memmap was independent of whether the offset was a multiple of the allocation granularity or not.

An alternative solution would be to not allow 0 size memmaps at all.

I do see some logic as to why it makes sense to allow a 0 size memmap with offset > 0 as this offset typically implies that the file contains some sort of metadata besides for the 0 size array. For a 0 size memmap with offset = 0 the file would be completely empty and thus there is nothing to map at all (mmap.mmap with length=0 even fails on Windows for an empty file (see here)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeay, thanks. I have two asks:

  1. Please add a test for arr.offset being correct, even though you are messing around with the start. (I suspect it already is.)
  2. Can you add (ideally not too long) code comment about bytes=0 being problematic but we can avoid it easily if offset!=0. It can reference to this PR gh-27723.

But, maybe we can rethink the bytes=0 part as well? It seems to me that it is impossible to memory map an empty file on all systems. In other words, mapping 0 bytes works on linux probably, but only because it maps the full file and then reads nothing.

I think that means that we can:

  • Use bytes=1 if offset=0
  • Use your trick here if offset > 0 in order to not grow the file unnecessarily. Although, I don't think it would matter too much, that seems pretty safe to me.

Or am I missing some path where if bytes == 0 and offset == 0: bytes = 1 might backfire?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test for arr.offset == offset, please confirm that this is what you had in mind as the test seems a bit trivial to me (but it could very well be that I misunderstood what you were asking).

I've also added a code comment as asked.

I also had a think about the 0 size memmap with offset=0 and your suggestion to use bytes=1 makes sense. I've implemented it in such a way that it requires mode in ('w+', 'r+') since we will be writing a byte to the file. There was an existing test for an empty memmap which I have adjusted accordingly.

Have a look at the changes and let me know what you think.

Luke Aarohi added 4 commits November 13, 2024 11:04

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, i think this is ready, unless anyone is worried about padding empty files with a single byte to make almost all zero-size memmaps works.

@laarohi if you feel like this, I think it would make sense to add a brief enhancement release note (just a bullet point even) that empty memmaps are now better supported.
(They go into docs/release/upcoming_changes there is a README.md with some notes.

laarohi and others added 5 commits November 18, 2024 10:16
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
@laarohi
Copy link
Contributor Author

laarohi commented Nov 18, 2024

@seberg enhancement release note added.

@seberg
Copy link
Member

seberg commented Nov 18, 2024

Thanks for the followups especially, @laarohi! Let's give this a shot. Is a backport important to you? Otherwise, I am not sure if I feel this is a change that really needs it.

@seberg seberg changed the title BUG: error empty memmap offset is a multiple of allocation granularity BUG: Allow empty memmaps in most situations Nov 18, 2024
@seberg seberg merged commit f3cca78 into numpy:main Nov 18, 2024
69 checks passed
@laarohi
Copy link
Contributor Author

laarohi commented Nov 18, 2024

No need for a backport. Thanks for reviewing & merging @seberg !

ArvidJB pushed a commit to ArvidJB/numpy that referenced this pull request Jan 8, 2025
Allow empty memmap in more cases, by:
* avoiding `bytes=0` when an offset is used (but it is a multiple of allocation granularity)
* Use a size of 1 when a memmap is created fresh but it is empty.

---

* BUG: error empty memmap offset is a multiple of allocation granularity

* DOC: added a code comment explaining issue with `bytes==0` in memmap

* TST: test for arr.offset being correct

* ENH: allow empty memmap

* TST: adjust test for empty memmap

* STY: numpy/_core/tests/test_memmap.py

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

* STY: Update numpy/_core/tests/test_memmap.py

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

* STY: Update numpy/_core/tests/test_memmap.py

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

* TST: Update numpy/_core/tests/test_memmap.py

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

* DOC: added enhancement release note for numpygh-27723

---------

Co-authored-by: Luke Aarohi <luke@dormouse.com>
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
nfrasser added a commit to cryoem-uoft/cryosparc-tools that referenced this pull request Mar 4, 2025
We found an issue where datasets of this file size fail to load due to a bug in numpy < 2.2. Appears to be related to page size on linux.

See numpy/numpy#27723 (released in 2.2, but we have to support older versions)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: numpy._core.memmap Error when offset is a multiple of allocation granularity
2 participants