Skip to content

explicitly encode chunk grid shape in _ShardIndex #3974

@d-v-b

Description

@d-v-b

(claude wrote this at my request. I agree with it)

Make _ShardIndex store chunks_per_shard explicitly instead of inferring it from array shape

Background

_ShardIndex (in src/zarr/codecs/sharding.py) currently has a single field:

class _ShardIndex(NamedTuple):
    offsets_and_lengths: npt.NDArray[np.uint64]

The shard's chunk grid shape is not stored — it's reverse-engineered from the array's shape everywhere it's needed:

@property
def chunks_per_shard(self) -> tuple[int, ...]:
    result = tuple(self.offsets_and_lengths.shape[0:-1])
    # The cast is required until https://github.com/numpy/numpy/pull/27211 is merged
    return cast("tuple[int, ...]", result)

This works because offsets_and_lengths is constructed with shape (*chunks_per_shard, 2) — so .shape[:-1] recovers the grid. But the recovery is lossy at the boundary: for a 0-dimensional array, chunks_per_shard == (), so offsets_and_lengths has shape (2,) instead of (1, 2). The array is effectively (n_chunk_dims + 1)-dimensional, which collapses to rank-1 for 0-D and breaks any method that assumes rank ≥ 2.

This is the root cause behind #3751 / #3966: get_chunk_slices_vectorized did offsets_and_lengths[:, 0], which fails on the rank-1 array. #3966 patches that one method with a special-case branch, but the underlying representation is still irregular — the scalar methods rely on the (2,) shape, the vectorized methods need (1, 2), and chunks_per_shard only works for 0-D by the accident of ()[:-1] == ().

Proposal

Store the chunk grid shape as an explicit field; make offsets_and_lengths a dumb payload whose shape is no longer load-bearing:

class _ShardIndex(NamedTuple):
    chunks_per_shard: tuple[int, ...]
    offsets_and_lengths: npt.NDArray[np.uint64]

With chunks_per_shard authoritative:

  • The chunks_per_shard property becomes a trivial field read — and the numpy/numpy#27211 cast workaround on line 132 goes away (it's a real tuple, not a numpy shape).
  • The 0-D special case in get_chunk_slices_vectorized (added in fix: allow writing to 0-dimensional arrays with sharding #3966) can be removed — there's no longer a rank to infer from the array.
  • The scattered offsets_and_lengths.shape[:-1] reads (e.g. the chunk iterators around lines 283, 301) become field reads.
  • offsets_and_lengths could optionally be normalized to always-2-D (prod(chunks_per_shard), 2), since the real shape now lives elsewhere — making every lookup method uniform. (Optional; can be a second step.)

The "is this 0-D?" question moves from inspecting an array's rank (or, in #3966's patch, inspecting the caller's query-array shape) to reading the index's own stored schema — which is where that knowledge belongs.

Cost / scope

This is an API change to _ShardIndex's constructor. Call sites that need updating (~4):

  • _ShardIndex.create_empty — already takes chunks_per_shard as an argument; just store it.
  • _ShardReader.create_empty — threads the value through.
  • _ShardIndex(index_array.as_numpy_array()) on the deserialization path (~line 715) — needs chunks_per_shard threaded in. It's available at that call site (the shard reader knows it), but it's a signature change.
  • The _ShardIndex(...) construction around line 797.

Acceptance

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions