You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The shard's chunk grid shape is not stored — it's reverse-engineered from the array's shape everywhere it's needed:
@propertydefchunks_per_shard(self) ->tuple[int, ...]:
result=tuple(self.offsets_and_lengths.shape[0:-1])
# The cast is required until https://github.com/numpy/numpy/pull/27211 is mergedreturncast("tuple[int, ...]", result)
This works because offsets_and_lengths is constructed with shape (*chunks_per_shard, 2) — so .shape[:-1] recovers the grid. But the recovery is lossy at the boundary: for a 0-dimensional array, chunks_per_shard == (), so offsets_and_lengths has shape (2,) instead of (1, 2). The array is effectively (n_chunk_dims + 1)-dimensional, which collapses to rank-1 for 0-D and breaks any method that assumes rank ≥ 2.
This is the root cause behind #3751 / #3966: get_chunk_slices_vectorized did offsets_and_lengths[:, 0], which fails on the rank-1 array. #3966 patches that one method with a special-case branch, but the underlying representation is still irregular — the scalar methods rely on the (2,) shape, the vectorized methods need (1, 2), and chunks_per_shard only works for 0-D by the accident of ()[:-1] == ().
Proposal
Store the chunk grid shape as an explicit field; make offsets_and_lengths a dumb payload whose shape is no longer load-bearing:
The chunks_per_shard property becomes a trivial field read — and the numpy/numpy#27211 cast workaround on line 132 goes away (it's a real tuple, not a numpy shape).
The scattered offsets_and_lengths.shape[:-1] reads (e.g. the chunk iterators around lines 283, 301) become field reads.
offsets_and_lengths could optionally be normalized to always-2-D (prod(chunks_per_shard), 2), since the real shape now lives elsewhere — making every lookup method uniform. (Optional; can be a second step.)
The "is this 0-D?" question moves from inspecting an array's rank (or, in #3966's patch, inspecting the caller's query-array shape) to reading the index's own stored schema — which is where that knowledge belongs.
Cost / scope
This is an API change to _ShardIndex's constructor. Call sites that need updating (~4):
_ShardIndex.create_empty — already takeschunks_per_shard as an argument; just store it.
_ShardReader.create_empty — threads the value through.
_ShardIndex(index_array.as_numpy_array()) on the deserialization path (~line 715) — needs chunks_per_shard threaded in. It's available at that call site (the shard reader knows it), but it's a signature change.
The _ShardIndex(...) construction around line 797.
Acceptance
_ShardIndex stores chunks_per_shard explicitly; the chunks_per_shard property returns it directly.
(claude wrote this at my request. I agree with it)
Make
_ShardIndexstorechunks_per_shardexplicitly instead of inferring it from array shapeBackground
_ShardIndex(insrc/zarr/codecs/sharding.py) currently has a single field:The shard's chunk grid shape is not stored — it's reverse-engineered from the array's shape everywhere it's needed:
This works because
offsets_and_lengthsis constructed with shape(*chunks_per_shard, 2)— so.shape[:-1]recovers the grid. But the recovery is lossy at the boundary: for a 0-dimensional array,chunks_per_shard == (), sooffsets_and_lengthshas shape(2,)instead of(1, 2). The array is effectively(n_chunk_dims + 1)-dimensional, which collapses to rank-1 for 0-D and breaks any method that assumes rank ≥ 2.This is the root cause behind #3751 / #3966:
get_chunk_slices_vectorizeddidoffsets_and_lengths[:, 0], which fails on the rank-1 array. #3966 patches that one method with a special-case branch, but the underlying representation is still irregular — the scalar methods rely on the(2,)shape, the vectorized methods need(1, 2), andchunks_per_shardonly works for 0-D by the accident of()[:-1] == ().Proposal
Store the chunk grid shape as an explicit field; make
offsets_and_lengthsa dumb payload whose shape is no longer load-bearing:With
chunks_per_shardauthoritative:chunks_per_shardproperty becomes a trivial field read — and thenumpy/numpy#27211cast workaround on line 132 goes away (it's a real tuple, not a numpy shape).get_chunk_slices_vectorized(added in fix: allow writing to 0-dimensional arrays with sharding #3966) can be removed — there's no longer a rank to infer from the array.offsets_and_lengths.shape[:-1]reads (e.g. the chunk iterators around lines 283, 301) become field reads.offsets_and_lengthscould optionally be normalized to always-2-D(prod(chunks_per_shard), 2), since the real shape now lives elsewhere — making every lookup method uniform. (Optional; can be a second step.)The "is this 0-D?" question moves from inspecting an array's rank (or, in #3966's patch, inspecting the caller's query-array shape) to reading the index's own stored schema — which is where that knowledge belongs.
Cost / scope
This is an API change to
_ShardIndex's constructor. Call sites that need updating (~4):_ShardIndex.create_empty— already takeschunks_per_shardas an argument; just store it._ShardReader.create_empty— threads the value through._ShardIndex(index_array.as_numpy_array())on the deserialization path (~line 715) — needschunks_per_shardthreaded in. It's available at that call site (the shard reader knows it), but it's a signature change._ShardIndex(...)construction around line 797.Acceptance
_ShardIndexstoreschunks_per_shardexplicitly; thechunks_per_shardproperty returns it directly.numpy/numpy#27211cast workaround is removed.get_chunk_slices_vectorizedhas a single uniform path.test_sharding_zero_dimensional,test_shard_index_get_chunk_slices_vectorized_zero_dimensional).Related