Blaze slicing #1

mrocklin · 2015-01-09T16:31:29Z

We want to implement slicing on chunked dask graphs representing arrays. E.g. if I have a 100 by 100 array, split into 10 by 10 blocks and I compute

>>> x[:15, :15]

Then I want the upper left block completely, and half of the block above and below, and a quarter of the block diagonally lower-right. Note that doing this logic correctly depends not only on the dask, but also in the shape-metadata in the Array object.

Slicing might become complex (there is a lot of potential book-keeping here) but a partial solution is probably good enough for now.

Example

Given a dask Array object like the following

>>> x = np.ones((20, 20))
>>> dsk = {'x': x}
>>> a = into(Array, x, blockshape=(5, 5), name='y')
>>> a.dask
{'y': array([[ 1.,  1.,  1.,  1.,  1 ...
 ('y', 0, 0): (<function dask.array.ndslice>, 'y', (5, 5), 0, 0),
 ('y', 0, 1): (<function dask.array.ndslice>, 'y', (5, 5), 0, 1),
...
 ('y', 3, 2): (<function dask.array.ndslice>, 'y', (5, 5), 3, 2),
 ('y', 3, 3): (<function dask.array.ndslice>, 'y', (5, 5), 3, 3)}
}

and a blaze expression like the following

>>> from blaze import symbol, compute
>>> s = symbol('s', '20 * 20 * int')
>>> expr = s[:8, :8]

We'd like to be able to compute a new dask.obj.Array object with the following dask

def sliceit(x, *inds):
    return x[*inds]

{('y_1', 0, 0): ('y', 0, 0),
 ('y_1', 0, 1): (sliceit, ('y', 0, 1), slice(None, None), slice(None, 3)),
 ('y_1', 1, 0): (sliceit, ('y', 1, 0), slice(None, 3), slice(None, None)),
 ('y_1', 1, 1): (sliceit, ('y', 1, 1), slice(None, 3), slice(None, 3)),
 ...

Note on code complexity

Dask is likely to be refactored. Things like the Array class are experimental and likely to change shape in the future. We want to avoid work when that refactor occurs. Because of this it's nice to keep as much gritty detail work like this independent from the current conventions for as long as possible. E.g. there is likely a std-lib only function that creates a dictionary from pure-python terms (tuples, dicts, names, slices) and then a dask.obj.Array-aware function. This is the approach behind array.top and obj.atop

The text was updated successfully, but these errors were encountered:

mrocklin · 2015-01-14T21:10:24Z

Here is another possible example interface and test

def slice_dask(outname, inname, index, shape, blockshape):
    ...

assert slice_dask('y', 'x', slice(0, 20), (100,), (25,)) == \
        {('y', 0), (operator.getitem, ('x', 0), slice(0, 20))}

assert slice_dask('y', 'x', slice(20, 30), (100,), (25,)) == \
        {('y', 0), (operator.getitem, ('x', 0), slice(20, 25)),
         ('y', 1), (operator.getitem, ('x', 1), slice(0, 5))}

mrocklin · 2015-01-15T20:27:27Z

From private conversation:

I might solve this problem by creating a small helper function with the following behavior:

def f(n, blocksize, index):
    ...

assert f(100, 20, slice( 0, 10)) == {0: slice( 0, 10)}
assert f(100, 20, slice(20, 30)) == {1: slice( 0, 10)}
assert f(100, 20, slice(15, 30)) == {0: slice(15, 20),
                                     1: slice( 0, 10)}
assert f(100, 20, slice(15, 85)) == {0: slice(15, 20),
                                     1: slice( 0, 20),
                                     2: slice( 0, 20),
                                     3: slice( 0, 20),
                                     4: slice( 0,  5)}
assert f(100, 20, 5)             == {0: 5}
assert f(100, 20, 35)            == {1: 15}

One can then use this function in the following way.

shape = (100, 100, 100)
blockshape = (20, 20, 20)
index = (5, slice(5, 35), slice(40, 90))
inname = 'x'
outname = 'y'

# maps = [f(d, bd, i) for b, bd, i in zip(shape, blockshape, index)]
maps = map(f, shape, blockshape, index)  # fancy way of saying the above

# Now have something like
[{0: 5},
 {0: slice(5, 20), 1: slice(0, 15)},
 {2: slice(0, 20), 3: slice(0, 20), 4: slice(0, 10)}]

# Need to do get cartesian product of these three (using itertools.product or some sort
# of recursion

{(0, 0, 0): ((0, 0, 2), (5, slice(5, 20), slice(0, 20))),
 (0, 0, 1): ((0, 0, 3), (5, slice(5, 20), slice(0, 20))),
 (0, 0, 2): ((0, 0, 4), (5, slice(5, 20), slice(0, 10))),
 (0, 1, 0): ((0, 1, 2), (5, slice(0, 15), slice(0, 20))),
 (0, 1, 1): ((0, 1, 3), (5, slice(0, 15), slice(0, 20))),
 (0, 1, 2): ((0, 1, 4), (5, slice(0, 15), slice(0, 10)))}

And then add in the administrative details

# add in administrative stuff just at the end

{('y', 0, 0, 0): (getitem, ('x', 0, 0, 2), (5, slice(5, 20), slice(0, 20))),
 ('y', 0, 0, 1): (getitem, ('x', 0, 0, 3), (5, slice(5, 20), slice(0, 20))),
 ('y', 0, 0, 2): (getitem, ('x', 0, 0, 4), (5, slice(5, 20), slice(0, 10))),
 ('y', 0, 1, 0): (getitem, ('x', 0, 1, 2), (5, slice(0, 15), slice(0, 20))),
 ('y', 0, 1, 1): (getitem, ('x', 0, 1, 3), (5, slice(0, 15), slice(0, 20))),
 ('y', 0, 1, 2): (getitem, ('x', 0, 1, 4), (5, slice(0, 15), slice(0, 10)))}

mrocklin · 2015-01-20T21:03:36Z

OK, so after playing with this some more we might need to make the Array class and the interace to slicing more complex.

dask Arrays are currently defined by

an identifier/name like 'x'
a shape
a blocksize
a dask

Slicing breaks the concept of a single regular blocksize. We now have a sequence of block-lengths along each dimension. Repeated slicing requires us to handle possibly irregularly spaced block-lengths.

E.g. we might have an array

Array(dsk, 'x', shape=(10, 10), blockshape=([2, 5, 3], [5, 5]))

And so 1d slicing functions should presumably take not a blocksize, but a sequence of blocksizes

def f(n, blocksizes, index):
    ...

assert f(10, [2, 5, 3], slice(1, 5)) == {0: slice( 1, 2), 1: slice(0, 3)}

Additionally, we now need to know the new block-lengths, in this case

[1, 3]

This could be computed from the original inputs or from the result of f

Added negative slicing capabilities to _slice_1d().

mrocklin · 2015-02-01T21:00:17Z

Most of this is done. Closing this in favor of issues #22 and #23 with the left-over parts.

linalg imports

Profile

Add index=False description to read_parquet

* Add name kwarg to from_zarr (#1) * Add name kwarg to from_zarr * added default from_zarr naming convention per jrbourbeau * flake8 fix * Added test for from_zarr name hashing * added name test * update from_zarr docstring * grammer curl Co-Authored-By: mpeaton <mpeaton@users.noreply.github.com> * python-snappy

* Add name kwarg to from_zarr (dask#1) * Add name kwarg to from_zarr * added default from_zarr naming convention per jrbourbeau * flake8 fix * Added test for from_zarr name hashing * added name test * update from_zarr docstring * grammer curl Co-Authored-By: mpeaton <mpeaton@users.noreply.github.com> * python-snappy

* Add name kwarg to from_zarr * added default from_zarr naming convention per jrbourbeau * flake8 fix

review core.py

mrocklin added a commit that referenced this issue Jan 28, 2015

Merge pull request #1 from nevermindewe/master

5a0f92f

Added negative slicing capabilities to _slice_1d().

mrocklin closed this as completed Feb 1, 2015

mrocklin pushed a commit that referenced this issue Feb 20, 2015

Merge pull request #1 from mrocklin/qr

3e5964d

linalg imports

jcrist added a commit that referenced this issue Jun 30, 2015

Merge pull request #1 from mrocklin/profile

3f69db3

Profile

koverholt mentioned this issue Feb 19, 2016

Consistency in collections API for partitions #1008

Closed

ankravch mentioned this issue Jul 15, 2016

dd.merge returns incorrect results #1375

Closed

martindurant pushed a commit that referenced this issue Apr 5, 2017

Merge pull request #1 from wikiped/wikiped-read_parquet-doc-patch

b04e1ed

Add index=False description to read_parquet

mrocklin mentioned this issue Oct 5, 2018

Atop fusion #3998

Merged

3 tasks

rmccorm4 mentioned this issue Feb 24, 2019

'Rolling' object has no attribute 'cov' in dask.dataframe #4053

Closed

mpeaton added a commit to Quansight-Labs/dask that referenced this issue May 25, 2019

Add name kwarg to from_zarr (dask#1)

6cc11f8

* Add name kwarg to from_zarr * added default from_zarr naming convention per jrbourbeau * flake8 fix

mmccarty mentioned this issue Aug 3, 2019

Add draft of institutional FAQ #5214

Merged

2 tasks

JavierRuano mentioned this issue Apr 28, 2021

xarray, python3.9 dask/array/slicing.py in slice_wrap_lists Don't yet support nd fancy indexing #7616

Closed

GenevieveBuckley mentioned this issue Jun 30, 2021

Adding new characteristics to the HLG visualizations #7847

Closed

2 tasks

freyam mentioned this issue Jul 10, 2021

Add node size scaling to the Graphviz output for the high level graphs #7869

Merged

2 tasks

vikshanker mentioned this issue Sep 16, 2021

Inconsistent Tokenization due to Lazy Registration #8111

Closed

kinghuang mentioned this issue Oct 14, 2021

Creating a Client with set_as_default=False sets client as default dask/distributed#5425

Open

MrPowers mentioned this issue Dec 6, 2021

[Discussion] Don't compute divisions by default in set_index? #8435

Open

crusaderky referenced this issue in crusaderky/dask Apr 21, 2022

Merge pull request #1 from crusaderky/mypy

4546f94

review core.py

GenevieveBuckley mentioned this issue Dec 5, 2022

Generalized ufuncs don't always work for cupy chunked dask arrays #9714

Open

hendrikmakait mentioned this issue Feb 24, 2023

P2P rechunking #9939

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blaze slicing #1

Blaze slicing #1

mrocklin commented Jan 9, 2015

mrocklin commented Jan 14, 2015

mrocklin commented Jan 15, 2015

mrocklin commented Jan 20, 2015

mrocklin commented Feb 1, 2015

Blaze slicing #1

Blaze slicing #1

Comments

mrocklin commented Jan 9, 2015

Example

Note on code complexity

mrocklin commented Jan 14, 2015

mrocklin commented Jan 15, 2015

mrocklin commented Jan 20, 2015

mrocklin commented Feb 1, 2015