Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blaze slicing #1

Closed
mrocklin opened this issue Jan 9, 2015 · 4 comments
Closed

Blaze slicing #1

mrocklin opened this issue Jan 9, 2015 · 4 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented Jan 9, 2015

We want to implement slicing on chunked dask graphs representing arrays. E.g. if I have a 100 by 100 array, split into 10 by 10 blocks and I compute

>>> x[:15, :15]

Then I want the upper left block completely, and half of the block above and below, and a quarter of the block diagonally lower-right. Note that doing this logic correctly depends not only on the dask, but also in the shape-metadata in the Array object.

Slicing might become complex (there is a lot of potential book-keeping here) but a partial solution is probably good enough for now.

Example

Given a dask Array object like the following

>>> x = np.ones((20, 20))
>>> dsk = {'x': x}
>>> a = into(Array, x, blockshape=(5, 5), name='y')
>>> a.dask
{'y': array([[ 1.,  1.,  1.,  1.,  1 ...
 ('y', 0, 0): (<function dask.array.ndslice>, 'y', (5, 5), 0, 0),
 ('y', 0, 1): (<function dask.array.ndslice>, 'y', (5, 5), 0, 1),
...
 ('y', 3, 2): (<function dask.array.ndslice>, 'y', (5, 5), 3, 2),
 ('y', 3, 3): (<function dask.array.ndslice>, 'y', (5, 5), 3, 3)}
}

and a blaze expression like the following

>>> from blaze import symbol, compute
>>> s = symbol('s', '20 * 20 * int')
>>> expr = s[:8, :8]

We'd like to be able to compute a new dask.obj.Array object with the following dask

def sliceit(x, *inds):
    return x[*inds]

{('y_1', 0, 0): ('y', 0, 0),
 ('y_1', 0, 1): (sliceit, ('y', 0, 1), slice(None, None), slice(None, 3)),
 ('y_1', 1, 0): (sliceit, ('y', 1, 0), slice(None, 3), slice(None, None)),
 ('y_1', 1, 1): (sliceit, ('y', 1, 1), slice(None, 3), slice(None, 3)),
 ...

Note on code complexity

Dask is likely to be refactored. Things like the Array class are experimental and likely to change shape in the future. We want to avoid work when that refactor occurs. Because of this it's nice to keep as much gritty detail work like this independent from the current conventions for as long as possible. E.g. there is likely a std-lib only function that creates a dictionary from pure-python terms (tuples, dicts, names, slices) and then a dask.obj.Array-aware function. This is the approach behind array.top and obj.atop

@mrocklin
Copy link
Member Author

Here is another possible example interface and test

def slice_dask(outname, inname, index, shape, blockshape):
    ...

assert slice_dask('y', 'x', slice(0, 20), (100,), (25,)) == \
        {('y', 0), (operator.getitem, ('x', 0), slice(0, 20))}

assert slice_dask('y', 'x', slice(20, 30), (100,), (25,)) == \
        {('y', 0), (operator.getitem, ('x', 0), slice(20, 25)),
         ('y', 1), (operator.getitem, ('x', 1), slice(0, 5))}

@mrocklin
Copy link
Member Author

From private conversation:

I might solve this problem by creating a small helper function with the following behavior:

def f(n, blocksize, index):
    ...

assert f(100, 20, slice( 0, 10)) == {0: slice( 0, 10)}
assert f(100, 20, slice(20, 30)) == {1: slice( 0, 10)}
assert f(100, 20, slice(15, 30)) == {0: slice(15, 20),
                                     1: slice( 0, 10)}
assert f(100, 20, slice(15, 85)) == {0: slice(15, 20),
                                     1: slice( 0, 20),
                                     2: slice( 0, 20),
                                     3: slice( 0, 20),
                                     4: slice( 0,  5)}
assert f(100, 20, 5)             == {0: 5}
assert f(100, 20, 35)            == {1: 15}

One can then use this function in the following way.

shape = (100, 100, 100)
blockshape = (20, 20, 20)
index = (5, slice(5, 35), slice(40, 90))
inname = 'x'
outname = 'y'

# maps = [f(d, bd, i) for b, bd, i in zip(shape, blockshape, index)]
maps = map(f, shape, blockshape, index)  # fancy way of saying the above

# Now have something like
[{0: 5},
 {0: slice(5, 20), 1: slice(0, 15)},
 {2: slice(0, 20), 3: slice(0, 20), 4: slice(0, 10)}]

# Need to do get cartesian product of these three (using itertools.product or some sort
# of recursion

{(0, 0, 0): ((0, 0, 2), (5, slice(5, 20), slice(0, 20))),
 (0, 0, 1): ((0, 0, 3), (5, slice(5, 20), slice(0, 20))),
 (0, 0, 2): ((0, 0, 4), (5, slice(5, 20), slice(0, 10))),
 (0, 1, 0): ((0, 1, 2), (5, slice(0, 15), slice(0, 20))),
 (0, 1, 1): ((0, 1, 3), (5, slice(0, 15), slice(0, 20))),
 (0, 1, 2): ((0, 1, 4), (5, slice(0, 15), slice(0, 10)))}

And then add in the administrative details

# add in administrative stuff just at the end

{('y', 0, 0, 0): (getitem, ('x', 0, 0, 2), (5, slice(5, 20), slice(0, 20))),
 ('y', 0, 0, 1): (getitem, ('x', 0, 0, 3), (5, slice(5, 20), slice(0, 20))),
 ('y', 0, 0, 2): (getitem, ('x', 0, 0, 4), (5, slice(5, 20), slice(0, 10))),
 ('y', 0, 1, 0): (getitem, ('x', 0, 1, 2), (5, slice(0, 15), slice(0, 20))),
 ('y', 0, 1, 1): (getitem, ('x', 0, 1, 3), (5, slice(0, 15), slice(0, 20))),
 ('y', 0, 1, 2): (getitem, ('x', 0, 1, 4), (5, slice(0, 15), slice(0, 10)))}

@mrocklin
Copy link
Member Author

OK, so after playing with this some more we might need to make the Array class and the interace to slicing more complex.

dask Arrays are currently defined by

  • an identifier/name like 'x'
  • a shape
  • a blocksize
  • a dask

Slicing breaks the concept of a single regular blocksize. We now have a sequence of block-lengths along each dimension. Repeated slicing requires us to handle possibly irregularly spaced block-lengths.

E.g. we might have an array

Array(dsk, 'x', shape=(10, 10), blockshape=([2, 5, 3], [5, 5]))

And so 1d slicing functions should presumably take not a blocksize, but a sequence of blocksizes

def f(n, blocksizes, index):
    ...

assert f(10, [2, 5, 3], slice(1, 5)) == {0: slice( 1, 2), 1: slice(0, 3)}

Additionally, we now need to know the new block-lengths, in this case

[1, 3]

This could be computed from the original inputs or from the result of f

mrocklin added a commit that referenced this issue Jan 28, 2015
Added negative slicing capabilities to _slice_1d().
@mrocklin
Copy link
Member Author

mrocklin commented Feb 1, 2015

Most of this is done. Closing this in favor of issues #22 and #23 with the left-over parts.

@mrocklin mrocklin closed this as completed Feb 1, 2015
mrocklin pushed a commit that referenced this issue Feb 20, 2015
jcrist added a commit that referenced this issue Jun 30, 2015
martindurant pushed a commit that referenced this issue Apr 5, 2017
Add index=False description to read_parquet
@mrocklin mrocklin mentioned this issue Oct 5, 2018
3 tasks
jrbourbeau pushed a commit that referenced this issue Apr 18, 2019
* Add name kwarg to from_zarr (#1)

* Add name kwarg to from_zarr

* added default from_zarr naming convention per jrbourbeau

* flake8 fix

* Added test for from_zarr name hashing

* added name test

* update from_zarr docstring

* grammer curl

Co-Authored-By: mpeaton <mpeaton@users.noreply.github.com>

* python-snappy
jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this issue May 14, 2019
* Add name kwarg to from_zarr (dask#1)

* Add name kwarg to from_zarr

* added default from_zarr naming convention per jrbourbeau

* flake8 fix

* Added test for from_zarr name hashing

* added name test

* update from_zarr docstring

* grammer curl

Co-Authored-By: mpeaton <mpeaton@users.noreply.github.com>

* python-snappy
mpeaton added a commit to Quansight-Labs/dask that referenced this issue May 25, 2019
* Add name kwarg to from_zarr

* added default from_zarr naming convention per jrbourbeau

* flake8 fix
crusaderky referenced this issue in crusaderky/dask Apr 21, 2022
@hendrikmakait hendrikmakait mentioned this issue Feb 24, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant