-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Allow empty memmaps in most situations #27723
Conversation
laarohi
commented
Nov 8, 2024
- Fixes BUG: numpy._core.memmap Error when offset is a multiple of allocation granularity #27722
- Added a test case to confirm the fix.
@@ -276,6 +276,9 @@ def __new__(subtype, filename, dtype=uint8, mode='r+', offset=0, | |||
|
|||
start = offset - offset % mmap.ALLOCATIONGRANULARITY | |||
bytes -= start | |||
if bytes == 0 and start > 0: | |||
bytes += mmap.ALLOCATIONGRANULARITY | |||
start -= mmap.ALLOCATIONGRANULARITY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs more explanation, I think. To me it also looks like this will simply fail if you pass offset=0
. Even if I am missing something, that means it needs testing.
I.e. it looks a bit like it fixes your specific issue, but may not improve things generally.
It seems like a 0 size memmap is quite fraught to begin with, so I am not sure what is right. Setting the bytes
to be at least 1 may make sense, but on linux may also fail if the size is actually 0 it seems (so at the very least it would also lead to a regressions in some places).
Of course sure, if an offset is used, that offset can be adjusted so that we don't need to "read" a byte beyond the end.
EDIT: Ack, sorry I guess the start > 0
prevents things, but it still would be nice to be clear that this only works with offsets. And it would be good to think about what to do about the other case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I could have been more clear.
So just to clarify, on Linux (haven't tested on Windows), in the current numpy implementation:
- Works:
- 0 size memmap with
offset > 0 and offset % mmap.ALLOCATIONGRANULARITY != 0
- 0 size memmap with
- Fails:
- 0 size memmap with
offset = 0
- 0 size memmap with
offset > 0 and offset % mmap.ALLOCATIONGRANULARITY == 0
- 0 size memmap with
With the suggested fix:
- Works:
- 0 size memmap with
offset > 0
- 0 size memmap with
- Fails:
- 0 size memmap with
offset = 0
- 0 size memmap with
I agree that a 0 size memmap is quite fraught, but it would be nice if the behaviour of a 0 size memmap was independent of whether the offset was a multiple of the allocation granularity or not.
An alternative solution would be to not allow 0 size memmaps at all.
I do see some logic as to why it makes sense to allow a 0 size memmap with offset > 0
as this offset typically implies that the file contains some sort of metadata besides for the 0 size array. For a 0 size memmap with offset = 0
the file would be completely empty and thus there is nothing to map at all (mmap.mmap
with length=0
even fails on Windows for an empty file (see here)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeay, thanks. I have two asks:
- Please add a test for
arr.offset
being correct, even though you are messing around with the start. (I suspect it already is.) - Can you add (ideally not too long) code comment about
bytes=0
being problematic but we can avoid it easily ifoffset!=0
. It can reference to this PRgh-27723
.
But, maybe we can rethink the bytes=0
part as well? It seems to me that it is impossible to memory map an empty file on all systems. In other words, mapping 0 bytes works on linux probably, but only because it maps the full file and then reads nothing.
I think that means that we can:
- Use
bytes=1
ifoffset=0
- Use your trick here if
offset > 0
in order to not grow the file unnecessarily. Although, I don't think it would matter too much, that seems pretty safe to me.
Or am I missing some path where if bytes == 0 and offset == 0: bytes = 1
might backfire?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test for arr.offset == offset
, please confirm that this is what you had in mind as the test seems a bit trivial to me (but it could very well be that I misunderstood what you were asking).
I've also added a code comment as asked.
I also had a think about the 0 size memmap
with offset=0
and your suggestion to use bytes=1
makes sense. I've implemented it in such a way that it requires mode in ('w+', 'r+')
since we will be writing a byte to the file. There was an existing test for an empty memmap
which I have adjusted accordingly.
Have a look at the changes and let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, i think this is ready, unless anyone is worried about padding empty files with a single byte to make almost all zero-size memmaps works.
@laarohi if you feel like this, I think it would make sense to add a brief enhancement release note (just a bullet point even) that empty memmaps are now better supported.
(They go into docs/release/upcoming_changes
there is a README.md with some notes.
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
@seberg enhancement release note added. |
Thanks for the followups especially, @laarohi! Let's give this a shot. Is a backport important to you? Otherwise, I am not sure if I feel this is a change that really needs it. |
No need for a backport. Thanks for reviewing & merging @seberg ! |
Allow empty memmap in more cases, by: * avoiding `bytes=0` when an offset is used (but it is a multiple of allocation granularity) * Use a size of 1 when a memmap is created fresh but it is empty. --- * BUG: error empty memmap offset is a multiple of allocation granularity * DOC: added a code comment explaining issue with `bytes==0` in memmap * TST: test for arr.offset being correct * ENH: allow empty memmap * TST: adjust test for empty memmap * STY: numpy/_core/tests/test_memmap.py Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net> * STY: Update numpy/_core/tests/test_memmap.py Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net> * STY: Update numpy/_core/tests/test_memmap.py Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net> * TST: Update numpy/_core/tests/test_memmap.py Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net> * DOC: added enhancement release note for numpygh-27723 --------- Co-authored-by: Luke Aarohi <luke@dormouse.com> Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
We found an issue where datasets of this file size fail to load due to a bug in numpy < 2.2. Appears to be related to page size on linux. See numpy/numpy#27723 (released in 2.2, but we have to support older versions)