Skip to content

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611

Draft
Nucs wants to merge 102 commits into
masterfrom
nditer
Draft

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611
Nucs wants to merge 102 commits into
masterfrom
nditer

Conversation

@Nucs
Copy link
Copy Markdown
Member

@Nucs Nucs commented Apr 22, 2026

Summary

This PR ports NumPy 2.4.2's nditer machinery to NumSharp (NpyIter), builds a composable expression DSL on top (NpyExpr) with a three-tier custom-op API, wires multi-order memory layout (C/F/A/K) through the entire API surface, and replaces the matmul fallback with stride-native GEMM for all 12 dtypes (eliminates a ~100x slowdown on transposed inputs). Also lands a new Char8 1-byte dtype with 100% Python bytes parity, a trainable MNIST MLP example demonstrating fusion, and several pre-existing bug fixes surfaced by battletest.

Stats: +50,426 / -1,188 across 156 files, 64 commits.

TL;DR

  • NpyIter -- full NumPy nditer port: 32+ APIs, all iteration orders (C/F/A/K with NEGPERM), all indexing modes (MULTI_INDEX / C_INDEX / F_INDEX / RANGE), buffered casting, buffered-reduce double-loop, masking, unlimited operands and dimensions. 566/566 NumPy 2.4.2 parity scenarios pass byte-for-byte.
  • NpyExpr DSL + 3-tier custom-op API (Tier 3A raw IL / Tier 3B element-wise w/ SIMD / Tier 3C expression composition + Call(...) for arbitrary Func / MethodInfo invocation). 50+ ops, full operator overloads, structural cache key, ~5ns delegate dispatch.
  • C/F/A/K memory layout wired through np.copy, np.array, np.asarray, np.asanyarray, np.asfortranarray (new), np.ascontiguousarray (new), *_like, astype, flatten, ravel, reshape, eye, concatenate, vstack, hstack, cumsum, argsort -- plus post-hoc F-contig preservation across the ILKernel dispatchers (41 element-wise layout bugs fixed).
  • Stride-native matmul for all 12 dtypes -- np.dot(x.T, grad) MLP shape: 240 ms -> 1 ms. Removes ~165 lines of dead fallback code.
  • Char8 -- new 1-byte dtype, NumPy S1 / Python bytes(1) equivalent, full Python bytes API parity (battletested against Python oracle).
  • Trainable MNIST MLP example -- fused forward + backward + Adam, 99.89% test accuracy, 100 epochs in ~42s.
  • Bug fixes: NPTypeCode.Char.SizeOf() returned 1 (real=2), IsInf was stubbed to return null, Decimal priority was stale, argsort mishandled non-C-contig input, several NpyExpr IL emission bugs.
  • Test suite: 6,710 passing on net8.0 + net10.0, zero regressions.

Contents

NpyIter -- Full NumPy nditer Port

From-scratch C# port of NumPy 2.4.2's nditer under src/NumSharp.Core/Backends/Iterators/.

Files (new)

File Lines Purpose
NpyIter.cs 3,331 Core iterator engine
NpyIter.State.cs 977 Dynamic per-operand state allocation
NpyIter.Execution.cs 657 Iteration execution paths
NpyIter.Execution.Custom.cs 155 Tier 3A/3B/3C custom-op entry points
NpyIterBufferManager.cs 637 Buffered iteration with casting
NpyIterFlags.cs 516 Flag enums (matches NumPy semantics)
NpyIterCoalescing.cs 495 Axis reorder + coalesce
NpyIterCasting.cs 483 Casting validation + conversion
NpyIterKernels.cs 263 Path selection
NpyAxisIter.cs 492 Axis-based iterator (cumsum, var, std)
NpyLogicalReductionKernels.cs 127 Generic reduction kernel interfaces

Capability Matrix

Capability Notes
Iteration orders C, F, A, K (NEGPERM for negative-stride memory order)
Indexing modes MULTI_INDEX, C_INDEX, F_INDEX, RANGE (parallel chunking)
Buffering Type conversion, full casting rules (no / equiv / safe / same_kind / unsafe)
Reduction op_axes w/ -1 reduction axes, REDUCE_OK, IsFirstVisit, buffered-reduce double-loop including bufferSize < coreSize
Multi-operand Unlimited (NumPy NPY_MAXARGS=64 parity, dynamic alloc)
Dimensions Unlimited (NumSharp divergence; replaces NPY_MAXDIMS=64)
Masking WRITEMASKED + ARRAYMASK w/ reduction safety check
APIs ported Copy, GotoIndex, GotoMultiIndex, RemoveAxis, RemoveMultiIndex, ResetBasePointers, GetMultiIndexFunc, GetInnerFixedStrideArray, GetAxisStrideArray, CreateCompatibleStrides, DebugPrint, GetIterView, IterRange, Iternext, GetValue<T> / SetValue<T>, Finished, Shape, OVERLAP_ASSUME_ELEMENTWISE, TRANSFERFLAGS, reduction-axis encoding, more

Bugs found and fixed during port

  • Negative strides flipped regardless of order -- NumPy only flips when FORCEDORDER is unset (K-order only).
  • NO_BROADCAST flag not enforced -- was skipping operands instead of validating shape match.
  • F_INDEX returned C-order indices -- coalescing reduced NDim=1, losing axis structure. Now disables coalescing for C_INDEX / F_INDEX / MULTI_INDEX.
  • ALLOCATE with null operand threw NRE -- CalculateBroadcastShape accessed op[i].ndim before allocating.
  • op_axes reductions broken -- ApplyOpAxes was re-applying axes to already-correct strides, zeroing non-reduce strides.
  • SetupBufferedReduction produced inverted strides for non-reduce operands; only worked for 2-axis cases.
  • K-order on broadcast / non-contiguous arrays produced wrong order -- fixed by falling back to C-order when stride=0 is present.
  • Reset() desynced ranged iterators with IterStart > 0 -- now delegates to GotoIterIndex(IterStart).
  • CoalesceAxes rejected size-1 axes unless stride==0 -- size-1 axes contribute no iteration and should always absorb.
  • Buffer free-list bug -- Dispose used NativeMemory.Free for AlignedAlloc'd buffers (memory corruption).

NpyExpr DSL + 3-tier Custom-Op API

User-extensible kernel layer built on NpyIter.

Tiers

  • Tier 3A -- ExecuteRawIL(body, key, aux): emit raw IL against the NumPy ufunc signature void(void** dataptrs, long* byteStrides, long count, void*). Full control.
  • Tier 3B -- ExecuteElementWise(scalar, vector, ...): per-element IL + 4x-unrolled SIMD shell + 1-vec remainder + scalar tail + scalar-strided fallback. SIMD when all operand dtypes match and are SIMD-capable.
  • Tier 3C -- ExecuteExpression(expr, inputTypes, outputType): compose NpyExpr trees via operator syntax, Compile() emits IL automatically.

NpyExpr Node Catalog

Category Ops
Binary arithmetic Add Subtract Multiply Divide Mod Power FloorDivide ATan2
Binary bitwise BitwiseAnd BitwiseOr BitwiseXor
Unary arithmetic Negate Abs Sign Sqrt Cbrt Square Reciprocal
Unary exp/log Exp Exp2 Expm1 Log Log2 Log10 Log1p
Unary trig Sin Cos Tan Sinh Cosh Tanh ASin ACos ATan Deg2Rad Rad2Deg
Unary rounding Floor Ceil Round Truncate
Unary predicates BitwiseNot LogicalNot IsNaN IsFinite IsInf
Comparisons Equal NotEqual Less LessEqual Greater GreaterEqual (return 0/1 at output dtype)
Combinators Min Max Clamp Where(cond, a, b)
Operators + - * / % & OR ^ ~ ! unary-

Call(...) escape hatch (commit 8da3e693)

Invoke any Func<...>, Delegate, or MethodInfo per element -- three dispatch paths chosen at construction:

Condition Emitted IL
Static method, no target call <methodinfo> (zero-indirection, JIT-inlinable)
Instance MethodInfo + target ldc.i4 id -> LookupTarget -> castclass T -> callvirt <mi>
Any other Delegate ldc.i4 id -> LookupDelegate -> castclass Func<..> -> callvirt Invoke

Auto-conversion at the call boundary (input/output via EmitConvertTo). Process-wide DelegateSlots registry, ~5ns lookup. Cache key includes MetadataToken + ModuleVersionId to disambiguate dynamic assemblies.

Bugs caught during DSL battletest

  • IsNaN / IsFinite / IsInf silently wrote I4 0/1 into double slots (denormals instead of 1.0). Fix: UnaryNode inserts trailing EmitConvertTo(Int32, outType).
  • LogicalNot broken for Int64 / Single / Double / Decimal -- Ldc_I4_0+Ceq only valid for I4-sized operands. Fix: route through EmitComparisonOperation(Equal, outType) with properly-typed zero literal.
  • WhereNode prelude unfinished (threw at compile). Rewrote.
  • MinMaxNode did not propagate NaN -- rerouted through Math.Min / Math.Max (matches np.minimum / np.maximum, not fmin / fmax).
  • Vector256.Round / Truncate are .NET 9+ only -- excluded from SIMD path; scalar path works on net8.

Multi-Order Memory Layout (C/F/A/K)

Shape changes (src/NumSharp.Core/View/Shape.cs, +171 lines)

  • New IsFContiguous (O(1) flag check via ArrayFlags.F_CONTIGUOUS).
  • New ComputeFContiguousStrides helper.
  • New Shape(long[] dims, char order) ctor for explicit physical-order construction.
  • Aligned _UpdateContiguousFlags with NumPy -- single-pass (isC, isF) tuple, fewer call sites.
  • Fixed Shape.Order -- was hardcoded to 'C', now derives from contiguity flags.
  • Fixed empty-array semantics -- any dim==0 is both C- and F-contig per NumPy, no BROADCASTED flag.
  • OrderResolver.cs (new, 75 lines) -- centralizes C/F/A/K -> C/F mapping.

API surface wiring

API Change
np.copy, NDArray.copy(order) New overload, default 'K' (was 'C')
np.array(Array, ..., order) F-order materialization via copy('F')
np.asarray, np.asanyarray New overloads accepting Type? + order
np.asfortranarray, np.ascontiguousarray NEW thin wrappers
np.empty_like / zeros_like / ones_like / full_like New order overload, default 'K'
astype(Type, copy, order) New overload, default 'K'
flatten(order), ravel(order) F-order via copy('F') reinterpret
reshape(Shape, char order) New overload; F-order column-major fill
np.eye(..., order) New overload
np.concatenate, vstack, hstack Preserve F-contig when all inputs F-contig
np.cumsum(axis) Preserve F-contig via post-hoc copy('F')
NDArray.argsort Copies non-C-contig input to C-contig first

Post-hoc F-contig preservation across ILKernel dispatch (Group F, 41 bugs fixed)

Refactoring 27 partial files (~21K lines) of IL emitters to accept arbitrary output strides was rejected as too invasive. Instead, central dispatchers (ExecuteBinaryOp, ExecuteUnaryOp, ExecuteComparisonOp) call ShouldProduceFContigOutput(operands, resultShape) after the kernel and relay via cheap .copy('F') when every non-scalar operand is strictly F-contig. Matches NumPy's F+F->F, C+C->C, F+C->C, F*scalar->F, F+FCol->F rules.

np.modf, np.clip, np.negative, np.maximum / minimum updated individually (do not route through the central dispatchers).

TDD coverage

51 sections of OrderSupport.OpenBugs.Tests.cs (3,005 lines), each driven by side-by-side Python / NumPy 2.4.2 output. Remaining [OpenBugs] are minimal API gaps (np.tile, np.flip, np.where, np.sort).

Stride-Native MatMul

np.dot / np.matmul previously fell into a ~100x slower fallback for any non-contiguous operand (transposed view, slice). This PR ships stride-native paths for all 12 dtypes.

New files

  • SimdMatMul.Strided.cs (338 lines) -- generalized 8x16 Vector256 FMA micro-kernel for float. New packers PackAPanelsStrided / PackBPanelsStrided with fast paths for transposed-contig and row-contig.
  • SimdMatMul.Double.cs (108 lines) -- stride-aware IKJ Vector256 kernel (4 FMAs).
  • Default.MatMul.Strided.cs (357 lines) -- MatMulStridedSame<T> where T : INumber<T> (JIT specializes per type with auto-vectorization), plus MatMulStridedBool (NumPy's OR-of-ANDs short-circuit), MatMulStridedMixed<TResult> (typed pointer reads via ReadAsDouble, no boxing).

Dead code removed (~165 lines)

MatMulGeneric, MatMulCore<TResult>, MatMulSameType<T>, four MatMulContiguous overloads (float/double/int/long), MatMulMixedType<TResult>.

Performance (MLP backward shapes)

Op Before After
dot(x.T, grad) 784x64 @ 64x128 240 ms 1 ms
dot(grad, W.T) 64x128 @ 128x784 226 ms 1 ms
Lt(400,500) @ L(500,400) blocked 12 ms 8 ms (skips copy)
int32 (150,200) @ (200,150) stride-native -- 10 ms (was 11 ms copy+matmul)

The MLP example's .copy() workaround on transposed views is now removed -- kernel absorbs strides directly.

Test coverage

MatMulStridedTests (28 tests): all 4 BLAS transpose cases (NN/NT/TN/TT) x float/double x simple/blocked, per-dtype stride-native (byte/int16/uint16/int32/uint32/int64/uint64/char/decimal/bool), sliced views with Shape.offset > 0, mixed-type, MLP-shape regression tests.

Char8 Dtype

New NumSharp.Char8 -- [StructLayout(Sequential, Size=1)] readonly struct, NumPy dtype('S1') / Python bytes(1) equivalent.

Files (new, ~1,450 lines)

File Lines Purpose
Char8.cs 725 Core: Latin1CharInfo, classification, case, operators, IConvertible, IFormattable
Char8.Operators.cs 169 Mixed-type ops (char/byte/int), Span reinterpret
Char8.Conversions.cs 261 Instance/factory To* / From* for all 12 dtypes, saturating, truncating
Char8.Spans.cs 201 Search / equality / UTF-8 classification on ReadOnlySpan<Char8>
Char8.PyBytes.cs 531 Python bytes array methods (Strip/Split/Replace/Center/...)
Converts.Char8.cs 317 Converts integration parallel to Converts.Native.cs

Adapted from .NET System.Char (src/dotnet/, fetched into a reference library; Latin1CharInfo[256] table copied verbatim).

Python parity (caught by oracle diff)

3 parity bugs surfaced and fixed during 250-line Python bytes oracle comparison:

  1. Count with empty pattern -- Python returns len(s) + 1, was 0.
  2. Center asymmetric padding -- CPython formula left = pad/2 + (pad & width & 1).
  3. SplitLines too permissive -- bytes.splitlines() only recognizes \n / \r / \r\n (not \v / \f / \x1c-1e).

Status

Standalone for now -- not yet wired into NPTypeCode enum (would touch ~50 switch statements across DefaultEngine / ILKernelGenerator / NpyIter / casting / Converts; deferred to a separate PR).

MNIST MLP Example

examples/NeuralNetwork.NumSharp/MnistMlp/ -- runnable end-to-end classifier.

  • Architecture: 784 -> 128 (ReLU) -> 10, float32, He-init, Adam.
  • Forward fusion: bias + ReLU collapses into one NpyIter per layer (NpyExpr.Max(Input(0) + Input(1), 0)).
  • Backward fusion: gradOut * (y > 0) ReLU mask fused.
  • Loss: SoftmaxCrossEntropy (combined, max-subtracted, numerically stable).

Results (6000 train / 1000 test, batch 128, Adam lr=1e-3):

Phase Total Final test acc
Pre-stride-native dot 100.7 s (5 epochs) 100%
Post-copy() workaround 3.2 s (5 epochs) 100%
Final 100-epoch demo (post stride-native) ~42 s 99.89%

NN scaffolding fixes outside MnistMlp/: completed every stubbed/broken class -- Softmax (was empty Forward + sigmoid-derivative Backward), Sigmoid.Forward (empty), CategoricalCrossentropy (no clipping, wrong backward formula), BinaryCrossEntropy (did not divide by N), Accuracy / BinaryAccuacy (returned scalar/null), FullyConnected (no bias, skewed init), NeuralNet.Train (used 2-index integer selection where slicing was intended), Adam (ms / vs init was commented out), added SGD optimizer. Each verified against analytical references with finite-difference grad checks (29/29 pass).

Bug Fixes

  • NPTypeCode.Char.SizeOf() returned 1, real is 2 (UTF-16). Affected NpyIter.SetOpDType (ElementSizes[op] x stride in 8 places), 8 cast sites, np.frombuffer, np.dtype(char).itemsize, axis reductions. Survived without test failures because NumPy has no native char dtype and ASCII reads accidentally land on the right byte.
  • GetPriority(Decimal) = 5*10*32 was stale after the prior Decimal SizeOf fix -- corrected to 5*10*16=800. No behavioral change (relative ordering preserved).
  • DefaultEngine.IsInf was stubbed to return null (NRE on any IsInf call). Now wired through ExecuteUnaryOp with the existing IL kernel.
  • NDArray.Copy.cs share-by-reference bug -- new Shape(this.Shape.dimensions, 'F') aliased the source int[]; cloned now.
  • NDArray.argsort -- copies non-C-contig input to C-contig first (matches NumPy's invariant that argsort always returns C-contig).
  • flatten allocation regression -- F-order path was double-allocating (copy('F') + Array.Clone()). Fixed: reinterpret directly.

Behavioral Changes

Area Change Migration
np.copy default order 'C' -> 'K' No change for C-contig input (K preserves layout)
MaxOperands=8 removed Now unlimited (dynamic alloc) Drop-in
MaxDims=64 removed Now unlimited (~300K dims, stackalloc-bound) Drop-in
F-order iteration Now produces [0,3,1,4,2,5] for 2x3 C-contig (was [0,1,2,3,4,5]) Matches NumPy
K-order on broadcast / non-contig Falls back to C-order (was stride-sort, broken with stride=0) Matches NumPy
Negative strides Only flipped for K-order Matches NumPy FORCEDORDER rule
Empty arrays IsContiguous and IsFContiguous both true (was both false) Matches NumPy
Shape.Order Now derives from contiguity flags (transpose of C reports 'F') Was hardcoded to 'C'

Documentation

  • docs/website-src/docs/NDIter.md (1,934 lines) -- comprehensive NpyIter reference: 7-technique quick reference, decision tree, full Tier C node catalog with NumPy-equivalent column, type discipline, SIMD coverage rules, caching/auto-keys, validation, gotchas, debugging, memory model + lifetime, 19 worked examples (Swish, GELU, Heaviside, Horner polynomial, fused sigmoid, NaN replacement, etc.).
  • docs/website-src/docs/ndarray.md (537 lines) -- NDArray reference page.
  • docs/NPYITER_AUDIT.md, NPYITER_DEEP_AUDIT.md, NPYITER_NUMPY_DIFFERENCES.md, NPYITER_BUFFERED_REDUCE_ANALYSIS.md -- implementation audit reports.
  • Tier renamed A/B/C -> 3A/3B/3C to make the layer-3 sub-tier relationship explicit (100 references across 6 files).

Test Plan

  • Full suite: 6,710 tests pass on net8.0 + net10.0 with CI filter TestCategory!=OpenBugs&TestCategory!=HighMemory. Zero regressions.
  • NumPy 2.4.2 nditer parity: 566/566 scenarios pass byte-for-byte (491 random fuzz seed=42 + 75 structured). Element sequences, stride arrays, multi-indices, reduction outputs all match.
  • NpyExpr + custom-op tests: +264 tests (NpyIterCustomOpTests, NpyIterCustomOpEdgeCaseTests, NpyExprExtensiveTests, NpyExprCallTests).
  • nditer API parity tests: +94 across 10 new files (NpyIterAxisStrideArrayTests, NpyIterCreateCompatibleStridesTests, NpyIterDebugPrintTests, NpyIterGetMultiIndexFuncTests, NpyIterInnerFixedStrideArrayTests, NpyIterOverlapAssumeElementwiseTests, NpyIterReductionAxisEncodingTests, NpyIterResetBasePointersTests, NpyIterTransferFlagsTests, NpyIterWriteMaskedTests).
  • MatMul stride-native: +28 MatMulStridedTests covering all 4 BLAS transpose cases, per-dtype stride-native, sliced views, mixed-type, MLP shapes.
  • Char8: +69 cases (source-generated discovery), 250-line Python bytes oracle diff (identical), 270+ C# edge assertions.
  • Order support TDD: +150 tests across 51 sections in OrderSupport.OpenBugs.Tests.cs.
  • Shape order: +24 tests in Shape.Order.Tests.cs.
  • MLP example: 100-epoch run reaches 99.89% test accuracy; per-epoch breakdown: forward 20.3% / loss 1.5% / backward 52.5% / optimizer 25.8%; bit-exact fused-vs-naive correctness (max |diff| = 0); kernel cache delta = 6 (compiled once, cache-hit thereafter); delegate-slot count = 0.

Known Issues / Out of Scope

  • Char8 not wired into NPTypeCode -- would touch ~50 switch statements; separate PR.
  • np.tile, np.flip, np.where, np.sort still missing (4 [OpenBugs] markers remain after this PR).
  • One pre-existing SetIndicesND assertion on multi-row fancy-write with scalar/matching-array RHS -- investigation in commit 47a74aa9 showed it is a pre-existing indexing bug, not F-order specific. Reproductions added under [OpenBugs].

Migration / Compatibility

Most changes are additive. The behavioral changes table above lists the user-visible deltas -- all align NumSharp closer to NumPy 2.4.2. No deprecated APIs, no removed public surface. The MaxOperands=8 and MaxDims=64 constants are removed but were internal.

Nucs added 30 commits April 22, 2026 23:27
Implements fixes detailed in docs/NPYITER_FIXES_REQUIRED.md to improve
NumPy compatibility of the NpyIter implementation.

Fix #1: Coalescing Always Runs
- Changed NpyIterRef.Initialize() to always coalesce axes after
  construction unless MULTI_INDEX flag is set
- Matches NumPy's nditer_constr.c line 395-396 behavior

Fix #2: Inner Stride Cache
- Added InnerStrides[MaxOperands] array to NpyIterState
- Added UpdateInnerStrides() method to gather inner strides
- GetInnerStrideArray() now returns contiguous array matching
  NumPy's NpyIter_GetInnerStrideArray() format

Fix #3: op_axes Parameter Implementation
- Added ApplyOpAxes() method to support axis remapping
- Supports -1 entries for broadcast/reduction axes
- Enables reduction operations via custom axis mapping

Fix #4: Multi-Index Support
- Added GetMultiIndex(Span<long>) for coordinate retrieval
- Added GotoMultiIndex(ReadOnlySpan<long>) for coordinate jumping
- Added HasMultiIndex property
- HASMULTIINDEX flag tracked during construction

Fix #5: Ranged Iteration
- Added ResetToIterIndexRange(start, end) for parallel chunking
- Added IterStart, IterEnd, and IsRanged properties
- RANGE flag tracks ranged iteration mode

Fix #6: Buffer Copy Type Dispatch
- Added non-generic CopyToBuffer/CopyFromBuffer overloads
- Runtime dtype dispatch for all 12 NumSharp types
- Enables dtype-agnostic iteration code

Fix #7: Flag Bit Positions Documented
- Added documentation explaining NumSharp's flag bit layout
- Legacy compatibility flags use bits 0-7
- NumPy-equivalent flags use bits 8-15
- Semantic meaning matches NumPy, positions differ

Fix #8: MaxDims Increased to 64
- Changed MaxDims from 32 to 64 to match NPY_MAXDIMS
- Supports high-dimensional array iteration

Test coverage:
- 13 new tests for coalescing, multi-index, ranged iteration,
  inner strides, and MaxDims validation
- All 5666 non-OpenBugs tests pass

Note: Full axis reordering before coalescing (for complete 1D
coalescing of contiguous arrays) not yet implemented. Current
implementation coalesces adjacent compatible axes only.
BREAKING: Replaces NumPy's fixed NPY_MAXDIMS=64 limit with unlimited
dimension support via dynamic array allocation.

NumSharp Divergence Rationale:
- NumSharp's Shape uses regular managed arrays (int[] dimensions, int[] strides)
- Practical limit is ~300,000 dimensions (stackalloc limit)
- This matches NumSharp's core design philosophy of unlimited dimensions
- Memory scales with actual dimensions, not worst-case fixed allocation

Implementation:
- Removed MaxDims constant (was 64)
- Added StridesNDim field to track stride array allocation size
- Dimension-dependent arrays (Shape, Coords, Perm, Strides) are now
  dynamically allocated pointers instead of fixed arrays
- Added AllocateDimArrays(ndim, nop) for allocation
- Added FreeDimArrays() for cleanup
- All arrays allocated in single contiguous block for cache efficiency

Per-operand arrays still use fixed MaxOperands=8 limit (reasonable for
multi-operand operations).

Memory Management:
- NpyIterRef.Dispose() calls FreeDimArrays()
- Static NpyIter methods use try/finally for cleanup
- Exception handling properly frees arrays on construction failure

Updated files:
- NpyIter.State.cs: Dynamic allocation with detailed comments
- NpyIter.cs: Call allocation in Initialize(), free in Dispose()
- NpyIterCoalescing.cs: Use StridesNDim instead of MaxDims
- NpyIterBufferManager.cs: Use StridesNDim for stride indexing
- NpyIterKernels.cs: Use StridesNDim in path selection

Tests:
- Removed MaxDims_Is64 test
- Added UnlimitedDimensions_HighDimensionalArray (20D array test)
- Added UnlimitedDimensions_MaxOperands (verifies MaxOperands=8)
- All 5667 tests pass
NpyAxisIter Implementation:
- NpyAxisIter.cs: Axis-based iterator for cumsum, cumprod, var, std
- NpyAxisIter.State.cs: State struct for axis iteration
- NpyLogicalReductionKernels.cs: Generic numeric reduction kernel interfaces

Logical Reduction Refactoring:
- Default.LogicalReduction.cs: Unified logical reduction implementation
- Default.All.cs/Default.Any.cs: Simplified to use new infrastructure
- np.all.cs/np.any.cs: Cleaned up API layer

Cumulative Operation Fixes:
- Default.Reduction.CumAdd.cs: Added contiguity checks for IL kernel path
- Default.Reduction.CumMul.cs: Added contiguity checks for IL kernel path
- Falls back to NpyAxisIter for sliced/reversed/broadcast views

Variance/Std Fixes:
- Default.Reduction.Std.cs: Updated reduction implementation
- Default.Reduction.Var.cs: Updated reduction implementation

Copy Kernel Infrastructure:
- CopyKernel.cs: Copy kernel key and execution path definitions
- ILKernelGenerator.Copy.cs: IL-generated copy kernels
- np.copyto.cs: Updated to use new copy infrastructure

Utilities:
- InfoOf.cs: Added GetSize helper for dtype size lookup
- MultiIterator.cs: Minor updates
- UnmanagedStorage.Cloning.cs: Minor updates

Documentation:
- docs/NPYITER_FIXES_REQUIRED.md: NpyIter implementation requirements
- docs/NPYITER_PARITY_ANALYSIS.md: NumPy parity analysis
- docs/DEFAULTENGINE_ILKERNEL_PLAYBOOK.md: IL kernel guidelines
- docs/DEFAULTENGINE_ILKERNEL_RULEBOOK.md: IL kernel rules
- docs/plans/NDITER.md: NDIter implementation plan

Tests:
- NpyIterBattleTests.cs: Iterator battle tests
- NpyIterReductionBattleTests.cs: Reduction battle tests
- NpyIterScanBattleTests.cs: Scan operation battle tests
- np.logical_reduction.iterator.tests.cs: Logical reduction tests
- np.copyto.NpyIter.Test.cs: Copy operation tests
- Updated np.all.Test.cs, np.any.Test.cs
- NpyIterRefTests.cs: Fix ref struct lambda capture issue
- np.logical_reduction.iterator.tests.cs: Add [TestClass], replace [Test] with [TestMethod]

All 5742 tests pass (excluding OpenBugs category)
…D collapse

NumPy reorders axes by stride magnitude BEFORE coalescing, which allows
contiguous arrays to fully coalesce to ndim=1. This commit implements
the same behavior in NumSharp.

Problem:
For a C-contiguous (2,3,4) array with strides [12,4,1], the coalescing
formula stride[i]*shape[i]==stride[i+1] fails without reordering:
- (0,1): 12*2=24 != 4 → cannot coalesce

Solution:
Added ReorderAxesForCoalescing() that sorts axes by minimum stride:
- After reorder: shapes [4,3,2], strides [1,4,12]
- (0,1): 1*4=4 == 4 ✓ → coalesce to [12,2], strides [1,12]
- (0,1): 1*12=12 == 12 ✓ → coalesce to [24], strides [1]

Changes:
- NpyIterCoalescing.cs: Added ReorderAxesForCoalescing(state, order)
  - Uses insertion sort (stable, good for nearly-sorted data)
  - Respects NPY_ORDER parameter (ascending for C-order, descending for F-order)
  - Marked old ReorderAxes() as obsolete

- NpyIter.cs: Initialize() now calls ReorderAxesForCoalescing() before
  CoalesceAxes() when multi-index is not tracked

- NpyIterRefTests.cs: Updated tests to expect ndim=1 for contiguous arrays

Test results: 5742 passed, 11 skipped, 0 failed
Add missing NumPy nditer features to NpyIterRef:
- RemoveMultiIndex(): Enable coalescing after construction by calling
  ReorderAxesForCoalescing + CoalesceAxes, matching NumPy behavior
- Finished property: Check if iteration is complete
- Shape property: Get current iterator shape after coalescing
- IterRange property: Get (Start, End) tuple
- Iternext(): Advance and return bool for remaining elements
- GetValue<T>/SetValue<T>: Type-safe value access at current position
- GetDataPtr(): Raw pointer access to current operand data

Fix RemoveAxis() to recalculate IterSize after dimension removal.

Add 45 comprehensive NumPy parity tests derived from actual NumPy 2.4.2
output, covering:
- Coalescing behavior (contiguous, transposed, sliced, scalar, empty)
- C_INDEX and F_INDEX tracking (2D and 3D arrays)
- RemoveMultiIndex and RemoveAxis
- Finished property and Iternext pattern
- Shape property changes after coalescing
- Ranged iteration
- Value access (GetValue, SetValue)
- Multi-operand iteration
- Sliced and broadcast arrays
- Edge cases (empty, scalar)

Document known divergences:
- F-order with MULTI_INDEX: skips axis reordering to preserve indices
- K-order on F-contiguous with MULTI_INDEX: same limitation

Create docs/NPYITER_NUMPY_DIFFERENCES.md with complete analysis of
NumPy nditer vs NumSharp NpyIter implementation differences.

Test results: 196 NpyIter tests passing, 5796 total tests passing
…INDEX

Add complete NumPy parity for iteration order when MULTI_INDEX is set:

F-order (NPY_FORTRANORDER):
- First axis changes fastest (column-major iteration)
- Reverses axis order so original axis 0 is innermost
- Uses Perm array to map internal coords to original axis order

K-order (NPY_KEEPORDER):
- Follows memory layout (smallest stride innermost)
- Sorts axes by stride, largest first when MULTI_INDEX is set
- Enables memory-order iteration on transposed/F-contiguous arrays

Key implementation changes:
- Initialize Perm to identity [0,1,2,...] in AllocateDimArrays
- Add forCoalescing parameter to ReorderAxesForCoalescing:
  - true (no MULTI_INDEX): ascending sort for coalescing formula
  - false (with MULTI_INDEX): descending sort for iteration order
- GetMultiIndex: Apply inverse permutation (outCoords[Perm[d]] = Coords[d])
- GotoMultiIndex: Apply permutation (Coords[d] = coords[Perm[d]])
- Shape property: Return shape in original axis order when MULTI_INDEX set

Test results:
- F-order: values 0,3,1,4,2,5 on (2,3) array (matches NumPy)
- K-order on transposed: values 0,1,2,3,4,5 following memory (matches NumPy)
- 196 NpyIter tests passing, 5796 total tests passing
Add GotoIndex() method that jumps to a specific flat index position
based on C_INDEX or F_INDEX flag. This enables random access by flat
array index while iterating.

Implementation details:
- Converts flat index to original coordinates using appropriate formula:
  - C-order: decompose using row-major index strides
  - F-order: decompose using column-major index strides
- Uses Perm array to map original coords to internal coords
- Updates data pointers correctly after position change

Fix ComputeFlatIndex to use original coordinate order:
- Build original coords/shape from internal using Perm array
- Compute C or F index in original array's coordinate system
- Fixes c_index tracking during F-order iteration

Fix Advance() to compute FlatIndex AFTER coords are updated:
- FlatIndex was being computed before coord increment (off by one)
- Now correctly computes after coordinate update
- Fast path (identity perm + C_INDEX) still uses simple increment

Add comprehensive tests:
- GotoIndex with C_INDEX (2D and 3D arrays)
- GotoIndex with F_INDEX
- C_INDEX tracking during F-order iteration

Test results: 200 NpyIter tests passing, 5800 total tests passing
Add Copy() method that creates an independent copy of the iterator
at its current position, matching NumPy's NpyIter_Copy behavior:
- Allocates new NpyIterState on heap
- Copies all fixed-size fields
- Allocates new dimension arrays (Shape, Coords, Perm, Strides)
- Returns new NpyIterRef that owns the copied state
- Advancing/resetting the copy does not affect the original

Add comprehensive tests:
- Copy preserves position
- Copy is independent (advancing original doesn't affect copy)
- Copy preserves flags (MULTI_INDEX, C_INDEX)
- Resetting copy doesn't affect original

Test results: 203 NpyIter tests passing, 5803 total tests passing
…eration

NumPy's nditer flips axes with all-negative strides for cache-efficient
memory-order iteration while tracking flipped coordinates via negative
Perm entries. This implementation adds full NumPy parity.

Key changes:
- FlipNegativeStrides(): Negate all-negative axes, adjust base pointers,
  mark with negative Perm entries, set NEGPERM flag
- GetMultiIndex/GotoMultiIndex: Handle NEGPERM by reversing coords for
  flipped axes (shape - coord - 1)
- GotoIndex: Handle NEGPERM in flat index to multi-index conversion
- ComputeFlatIndex: Handle NEGPERM for correct C/F index computation
- InitializeFlatIndex: Compute initial FlatIndex after axis setup
- HasNegPerm/HasIdentPerm: New properties for perm state inspection
- DONT_NEGATE_STRIDES: Flag support to preserve view logical order

13 new NumPy parity tests covering:
- 1D/2D/3D reversed arrays
- Row/col/both reversed 2D arrays
- GotoIndex/GotoMultiIndex with flipped axes
- Mixed operands (one positive stride prevents flip)
- DONT_NEGATE_STRIDES flag behavior
- Iteration without MULTI_INDEX flag

All 214 NpyIter tests pass, 5814 total tests pass.
GetIterView returns an NDArray view of the i-th operand with the
iterator's internal axes ordering. A C-order iteration of this view
is equivalent to the iterator's iteration order.

Key features:
- Returns view with iterator's internal shape and strides (after
  coalescing/reordering)
- For coalesced arrays: returns lower-dimensional view (e.g., 3D->1D)
- For sliced/transposed arrays: reflects internal optimization
- Throws InvalidOperationException when buffering is enabled
- Validates operand index bounds

8 new NumPy parity tests covering:
- Contiguous array (coalesced to 1D)
- MULTI_INDEX (preserves original shape)
- Transposed array with K-order
- Sliced arrays with non-contiguous strides
- Multiple operands
- Buffered iterator exception
- Invalid operand index exception
- Reversed array with flipped strides

All 222 NpyIter tests pass, 5822 total tests pass.
…ered iteration

NumPy's nditer supports automatic type conversion when iterating arrays
with different dtypes. This implementation adds full NumPy parity for
casting during buffered iteration.

Key changes:

- NpyIterCasting.cs (new): Complete casting validation and conversion
  - CanCast(): Validates casting rules (no_casting, equiv, safe, same_kind, unsafe)
  - IsSafeCast(): Safe casts like smaller int -> larger int, any int -> float64
  - IsSameKindCast(): Both integers or both floats
  - ValidateCasts(): Throws InvalidCastException for invalid casts
  - FindCommonDtype(): For COMMON_DTYPE flag
  - ConvertValue(): Single value conversion via double intermediate
  - CopyWithCast(): Strided array copy with type conversion

- NpyIter.State.cs:
  - OpSrcDTypes[]: Track source array dtypes for casting
  - SrcElementSizes[]: Source element sizes for stride calculation
  - GetOpSrcDType/SetOpSrcDType accessors
  - NeedsCast(op): Check if operand requires type conversion

- NpyIterBufferManager.cs:
  - CopyToBufferWithCast(): Copy from source to buffer with conversion
  - CopyFromBufferWithCast(): Copy from buffer to destination with conversion
  - Handles 1D and multi-dimensional strided arrays

- NpyIter.cs:
  - Initialize handles COMMON_DTYPE flag to auto-find common dtype
  - Stores source dtypes and validates casting rules
  - GetDataPtr returns buffer pointer when buffering enabled
  - CRITICAL BUG FIX: Dispose was using NativeMemory.Free for buffers
    allocated with AlignedAlloc. Now correctly uses FreeBuffers which
    calls AlignedFree. This was causing memory corruption and test crashes.

13 new NumPy parity tests:
- Cast_Int32ToFloat64_SafeCasting
- Cast_Float64ToInt32_UnsafeCasting
- Cast_Float64ToInt32_SafeCasting_Throws
- Cast_Int16ToInt32_SafeCasting
- Cast_CommonDtype_TwoOperands
- Cast_WriteOutput_WithConversion
- Cast_SameKindCasting_IntToInt
- Cast_SameKindCasting_IntToFloat_Throws
- Cast_NoCasting_SameType_Allowed
- Cast_NoCasting_DifferentType_Throws
- Cast_RequiresBuffered_ThrowsWithoutBuffer
- And more...

All 233 NpyIter tests pass, 5833 total tests pass.
Adds basic reduction iteration support matching NumPy's nditer API:

- Reduction detection via op_axes with -1 entries for READWRITE operands
- REDUCE_OK flag validation: throws if reduction detected without flag
- IsFirstVisit(operand): checks if current element is first visit
  (for initialization, e.g., set to 0 before summing)
- IsReduction property: check if iterator has reduction operands
- IsOperandReduction(op): check if specific operand is reduction
- REDUCE flags set on iterator (NpyIterFlags.REDUCE) and operands
  (NpyIterOpFlags.REDUCE) when reduction is detected

Key implementation details:
- Modified ApplyOpAxes to detect reduction axes and validate REDUCE_OK
- Fixed isWriteable check to only match WRITE flag (READWRITE includes both)
- Modified ValidateIterShape to account for op_axes -1 entries
- Modified Initialize to set up strides directly from op_axes when provided
  instead of using np.broadcast_to (which fails for reduction shapes)

Tests added (7 new NumPy parity tests):
- Reduction_1DToScalar_IteratesCorrectly
- Reduction_2DToScalar_IteratesCorrectly
- Reduction_2DAlongAxis1_ProducesCorrectResult
- Reduction_IsFirstVisit_ReturnsTrueOnFirstElement
- Reduction_WithoutReduceOK_Throws
- Reduction_ReadOnlyOperand_DoesNotThrow
- Reduction_HasReduceFlag_WhenReductionDetected

All 240 NpyIter tests and 5840 total tests passing.
- Add READWRITE validation: reduction operands must have both READ and
  WRITE flags (WRITEONLY throws ArgumentException). NumPy requires this
  because reduction must read existing value before accumulating.

- Add buffer reduction fields to NpyIterState:
  - ReducePos: current position in reduce outer loop
  - ReduceOuterSize: size of reduce outer loop
  - ReduceOuterStrides: per-operand reduce outer strides
  - GetReduceOuterStride/SetReduceOuterStride accessors

- Update IsFirstVisit to check buffer reduce_pos:
  Part 1: Check coordinates (existing) - if stride=0 and coord!=0, not first
  Part 2: Check buffer reduce_pos (new) - when BUFFERED flag set, if
  ReducePos!=0 and operand's reduce outer stride is 0, not first visit

- Add Reduction_WriteOnlyOperand_Throws test

241 NpyIter tests passing, 5843 total tests passing (excluding OpenBugs)
…Py parity

Implements NumPy's double-loop pattern for efficient buffered reduction
(nditer_templ.c.src lines 131-210). This avoids re-buffering during
reduction by separating iteration into inner (core) and outer loops.

Key changes:

NpyIterState new fields:
- CoreSize: Number of inputs per output element (reduce dimension size)
- CorePos: Current position within core [0, CoreSize) for IsFirstVisit
- SetBufStride/GetBufStride: Accessors for buffer strides

SetupBufferedReduction:
- CoreSize = Shape[outerDim] (reduce axis size)
- ReduceOuterSize = transferSize / CoreSize (number of outputs)
- BufStrides: 0 for reduce ops (stay at same output), elemSize for others
- ReduceOuterStrides: elemSize for reduce ops (next output),
  elemSize*CoreSize for non-reduce ops

BufferedReduceAdvance:
- Inner loop: increment CorePos, advance by BufStrides
- Outer loop: reset CorePos to 0, advance by ReduceOuterStrides
- Returns 0 when buffer exhausted for refill

IsFirstVisit for buffered mode:
- Uses CorePos = 0 check instead of coordinates
- First visit is only at start of each output element's accumulation

CopyReduceBuffersToArrays:
- For reduce operands: copy ReduceOuterSize elements (number of outputs)
- For non-reduce operands: copy full CoreSize * ReduceOuterSize elements
- Uses ResetDataPtrs as destination (original array, not buffer)

GetDataPtr for buffered reduce:
- Returns DataPtrs directly (tracked by BufferedReduceAdvance)
- Instead of computing from IterIndex which doesn't work with double-loop

Tests added:
- BufferedReduction_1DToScalar_ProducesCorrectResult
- BufferedReduction_2DAlongAxis1_ProducesCorrectResult
- BufferedReduction_IsFirstVisit_WorksWithBuffering
- BufferedReduction_LargeArray_ExceedsBuffer (tests buffer refill)
- BufferedReduction_WithCasting_WorksCorrectly
- BufferedReduction_DoubleLoopFields_AreSetCorrectly

247 NpyIter tests passing, 5847 total tests passing (excluding OpenBugs)
…coreSize)

Problem:
When buffer size is smaller than core size (reduce dimension size), the
buffered reduction double-loop pattern broke down:
- BufIterEnd was set to CoreSize instead of bufferSize
- coreOffset tracking was misaligned with actual core boundaries
- Reduce operand reload decisions were incorrect

Root Cause:
The coreOffset tracking was based on buffer refill counts, but core boundaries
are determined by iteration coordinates. When bufferSize < coreSize, multiple
buffer refills occur per core, causing the tracking to desync.

Fix:
1. Use pointer comparison to detect new output positions:
   - After GotoIterIndex, compare current array position with previous writeback
   - Only reload reduce operand if pointer changed (new output element)

2. Add ArrayWritebackPtrs field to store writeback positions separately:
   - ResetDataPtrs must stay as base pointers for GotoIterIndex
   - ArrayWritebackPtrs stores the actual writeback destinations

3. Set BufIterEnd to min(BufferSize, CoreSize) for small buffer support

Test Results:
- All 252 NpyIter tests pass
- Small buffer test (3,8)->(3,) with bufferSize=4 now produces [28, 92, 156]
- NumPy parity confirmed for buffered reduction edge cases

Analysis documented:
- EXLOOP, BUFNEVER, BUF_REUSABLE are performance optimizations not bugs
- Current implementation is functionally correct with NumPy
Audit Summary (2026-04-16):
- 252 tests passing (101 parity, 70 battle, 41 ref tests)
- 32 NumPy APIs fully implemented
- All core features complete: iteration, indexing, buffering, casting, reduction

Key findings:
- Implementation is PRODUCTION READY
- No critical missing features for NumSharp operations
- Full NumPy parity for buffered reduction including small buffer handling
- Intentional divergences documented (unlimited dims, 8 max operands)

Remaining low-priority items (performance only):
- BUFNEVER per-operand buffer skip
- Enhanced buffer reuse logic
- EXLOOP increment optimization
BREAKING CHANGE: Removed MaxOperands=8 limit

NumPy uses NPY_MAXARGS=64 (was 32 in 1.x) as a runtime constant. NumSharp
now achieves full parity by supporting truly unlimited operands through
dynamic allocation of all per-operand arrays.

Changes:
- NpyIterState: Convert all 14 fixed per-operand arrays to dynamically
  allocated pointers (DataPtrs, ResetDataPtrs, BaseOffsets, OpItFlags,
  OpDTypes, OpSrcDTypes, ElementSizes, SrcElementSizes, InnerStrides,
  Buffers, BufStrides, ReduceOuterStrides, ReduceOuterPtrs, ArrayWritebackPtrs)
- AllocateDimArrays: Now allocates both dimension arrays AND operand arrays
  in separate contiguous blocks with proper alignment
- FreeDimArrays: Now frees both blocks
- All accessor methods simplified (no more fixed() statements needed)
- Copy method: Fixed to properly copy operand arrays for all cases
  including scalar (NDim=0)
- NpyIterCoalescing: Updated to use direct pointer access

Tests:
- Added UnlimitedOperands_100Operands_IteratesCorrectly test
- Updated TooManyOperands_Throws to ManyOperands_Works
- Updated UnlimitedDimensions_MaxOperands to verify 16 operands work
- 253 NpyIter tests passing

Memory layout for operand arrays (per NOp elements):
- 9 long* arrays (72 bytes each for 64-bit pointers)
- 2 int* arrays
- 1 ushort* array
- 2 byte* arrays
All sections 8-byte aligned for optimal cache performance.
Deep audit validates NumSharp NpyIter against NumPy 2.x using:
1. Behavioral Comparison - 55 NumPy vs NumSharp tests
2. Edge Case Matrix - 12 systematic edge cases
3. Source Code Comparison - NumPy C vs C# structural analysis
4. Property Invariants - 13 mathematical invariant tests

Total validation: 333 tests (253 unit + 80 behavioral/invariant)

Key findings:
- All tests pass confirming full NumPy parity
- NEGPERM behavior verified (reversed arrays iterate in memory order)
- Buffered reduce double-loop matches NumPy structure
- Coalescing algorithm structurally identical
- All 6 property invariants hold

New files:
- docs/NPYITER_DEEP_AUDIT.md - Comprehensive audit report
- test/audit_behavioral.cs - Runtime audit script
- test/NumSharp.UnitTest/.../NpyIterNumPyBattleTests.cs - Battle tests
Three battle tests document NumSharp's iteration order differences:

1. F-order iteration: NumSharp is C-order only (documented limitation)
   - NumPy: [0, 3, 1, 4, 2, 5] (F-order)
   - NumSharp: [0, 1, 2, 3, 4, 5] (C-order)

2. Multi-operand broadcasting: Different iteration order
   - NumPy: [(0,0), (1,1), (2,2), (0,3), (1,4), (2,5)]
   - NumSharp: [(0,0), (0,3), (1,1), (1,4), (2,2), (2,5)]

3. Non-contiguous view: Memory order vs logical C-order
   - NumPy: [1, 3, 11, 13] (logical C-order)
   - NumSharp: [1, 11, 3, 13] (memory order)

All tests now pass (277 total NpyIter tests).
NumPy's nditer coalescing strategy:
- K-order: Always coalesce for memory efficiency (sort by stride)
- C-order on C-contiguous: Coalesce → memory order (== C-order)
- F-order on F-contiguous: Coalesce → memory order (== F-order)
- F-order on C-contiguous: NO coalescing, reverse axes for F-order

Previously NumSharp was coalescing for ALL orders when array was
contiguous in any layout, which produced incorrect iteration order
for F-order on C-contiguous arrays.

Changes:
- NpyIter.cs: Add CheckAllOperandsContiguous(bool cOrder) helper
  to check if arrays are contiguous in requested order
- NpyIter.cs: Only coalesce when order matches array contiguity
- NpyIterCoalescing.cs: Add IsContiguousForCoalescing() check

Test results:
- 277 NpyIter tests passing (including 24 new battle tests)
- 5813 total tests passing
- F-order now produces [0,3,1,4,2,5] instead of [0,1,2,3,4,5]
  for a 2x3 C-contiguous array (matches NumPy)
…arrays

Problem:
- K-order iteration on broadcast arrays produced wrong order
  (stride-based sorting with stride=0 breaks axis ordering)
- K-order iteration on non-contiguous views also used wrong order
- NumPy: (3,) x (2,3) broadcast iterates C-order: [(0,0),(1,1),(2,2),(0,3),(1,4),(2,5)]
- NumSharp was producing: [(0,0),(0,3),(1,1),(1,4),(2,2),(2,5)]

Root cause:
- For K-order, we sorted axes by stride magnitude
- But GetMinStride excludes stride=0, leading to incorrect axis ordering
- Non-contiguous views similarly got wrong ordering from stride sort

Solution:
- For K-order with broadcast dimensions (stride=0), fall back to C-order
- For K-order with non-contiguous arrays, fall back to C-order
- Added HasBroadcastStrides() helper to detect broadcast dimensions
- CheckAllOperandsContiguous now uses absolute strides to handle
  reversed arrays (negative strides become positive after FlipNegativeStrides)
- Separate coalescing logic for C/F/K orders to preserve iteration semantics

Changes:
- NpyIter.cs: Added broadcast detection, fixed coalescing decision logic
- NpyIterNumPyBattleTests.cs: Updated tests to expect correct NumPy behavior
  (removed [Misaligned] attributes from Battle_MultiOperandBroadcasting and
  Battle_NonContiguousViewOrder since they now match NumPy)

All 277 NpyIter tests passing.
All 5877 project tests passing.
Deep audit against NumPy 2.4.2 source revealed 7 behavioral bugs. All fixed via TDD.

Bug #1: Negative strides always flipped regardless of order
- NumPy (nditer_constr.c:297-307) only flips when NPY_ITFLAG_FORCEDORDER not set
- FORCEDORDER is set by C, F, and A orders. Only K-order skips it.
- Fix: Only call FlipNegativeStrides for K-order
- CheckAllOperandsContiguous now takes allowFlip param (abs strides only when flipping)
- Affects: 1D/2D reversed arrays with C/F/A orders

Bug #2: NO_BROADCAST flag not enforced
- Code was skipping NO_BROADCAST operands instead of enforcing the constraint
- Fix: NO_BROADCAST operands must match iterShape without dim-1 stretching
- ValidateIterShape now always runs (not just when iterShape is provided)

Bug #3: F_INDEX returned C-order indices
- Coalescing reduces to NDim=1, losing original axis structure needed for F-index
- Fix: Disable coalescing when C_INDEX or F_INDEX is set (like MULTI_INDEX)

Bug #4: ALLOCATE with null operand threw NullReferenceException
- CalculateBroadcastShape accessed null op[i].ndim
- Fix: Skip null operands in broadcast shape calc, then allocate them after
  with correct shape (from op_axes if provided) and dtype

Bug #5,6,7: op_axes reductions broken (axis=0 gave [15,0,0], axis=1 threw)
- ApplyOpAxes was re-applying op_axes to strides that were already correctly set
  in the main operand setup loop, zeroing out non-reduce strides
- CalculateBroadcastShape didn't know about op_axes, couldn't compute iter shape
- Fix: ApplyOpAxes now only validates and sets REDUCE flags, not strides
- Fix: CalculateBroadcastShape now accepts opAxesNDim/opAxes parameters
- Uses production Shape.ResolveReturnShape API for all broadcasting

Refactoring: Uses production Shape.ResolveReturnShape / np.broadcast_to
- Replaces custom broadcast shape calculation
- User feedback: production APIs are 1-to-1 with NumPy

Testing:
- 21 new TDD tests in NpyIterParityFixTests.cs
- All 298 NpyIter tests pass
- All 5898 project tests pass
- Final battletest: 21/21 scenarios match NumPy 2.4.2 exactly

Fixed test: NullOperand_Throws now expects ArgumentException (more accurate than
NullReferenceException since null operand without ALLOCATE is an argument error).
Adds F-contiguity detection and OrderResolver for NumPy's 4 memory orders
at minimum functionality, with zero behavioral change to existing code.

Changes:
- Shape.cs: F-contig detection via ComputeIsFContiguousStatic (mirror of C-contig
  algorithm, scan left-to-right). Sets ArrayFlags.F_CONTIGUOUS flag during flag
  computation. New IsFContiguous property (O(1) flag check). New
  ComputeFContiguousStrides helper. New Shape(long[] dims, char order)
  constructor for explicit physical-order construction.
- Scalar constructor now sets both C_CONTIGUOUS and F_CONTIGUOUS (matches NumPy).
- OrderResolver.cs (NEW): Resolves NumPy order chars (C/F/A/K) to physical
  storage orders (C or F). 'A' and 'K' require a source Shape for resolution
  (matches NumPy: creation functions without source throw "only 'C' or 'F'
  order is permitted").
- np.empty.cs: New overload np.empty(shape, order, dtype) wiring
  OrderResolver through to Shape.

Key insight: transpose already produces F-contig memory layout; previously this
went undetected because F_CONTIGUOUS flag was never set. Now:
  arr = np.arange(24).reshape(2,3,4)
  arr.T.Shape.IsFContiguous  // true (previously: false / undetected)

Design:
- Only C and F are physical storage layouts; A and K are logical decisions
  that resolve to C or F based on source array layout.
- OrderResolver centralizes the C/F/A/K -> C/F mapping, letting future
  wiring of np.copy/np.array/flatten/ravel/reshape be a 1-line call.
- Existing IsContiguous callers (116 usages across 50 files) unchanged -
  they still see C_CONTIGUOUS=false for F-contig arrays and take the
  strided path (which is correct, just not yet SIMD-accelerated).

Tests (24 new in Shape.Order.Tests.cs):
- Scalar and 1-D arrays are both C and F contig
- Multi-dim C-contig is not F-contig and vice versa
- Transpose of C-contig now reports IsFContiguous=true
- Shape(dims, 'F') produces correct F-order strides (1, 2, 6 for 2x3x4)
- Shape(dims, 'A'/'X') throws ArgumentException
- OrderResolver: C/F resolve directly; A/K without source throw; A/K with
  source resolve based on source layout
- np.empty(order='C'/'F') produces correct layout
- np.empty(order='A'/'K') throws (matches NumPy)

Verification:
- 6017 tests pass on both net8.0 and net10.0 (zero regressions)
- NumPy parity verified via Python side-by-side comparison
- All order resolution semantics match NumPy 2.4.2

Future phases unblocked (each a ~1-line change):
- ILKernelGenerator fast paths can add || IsFContiguous for element-wise ops
- NpyIter.CheckAllOperandsContiguous can use Shape.IsFContiguous directly
- np.copy(order), np.array(order), flatten(order), ravel(order) wiring
- np.asfortranarray, np.ascontiguousarray
Review of initial F-order support surfaced three design issues where
NumSharp diverged from NumPy's patterns. This refactor aligns with
NumPy's flagsobject.c:_UpdateContiguousFlags exactly.

Changes:

1. Unified contiguity computation (single-pass)
   - Replaced two separate functions (ComputeIsContiguousStatic,
     ComputeIsFContiguousStatic) with one combined
     ComputeContiguousFlagsStatic returning (isC, isF) tuple.
   - Mirrors NumPy's _UpdateContiguousFlags which computes both in one
     function with a shared dim==0 early exit.
   - Fewer call sites, one traversal per contiguity check, cleaner
     shared logic.

2. Fixed Shape.Order property (was hardcoded to layout = 'C')
   - Now derives from actual contiguity flags: returns 'F' if strictly
     F-contiguous (IsFContiguous && !IsContiguous), else 'C'.
   - Transposed C-contig arrays now correctly report Order='F'.
   - 1-D and scalar shapes (both C and F contig) report 'C' by
     convention (NumPy-default reference order).
   - Non-contiguous shapes report 'C' as default reference.

3. Fixed empty array flag computation (any dim == 0)
   - NumPy short-circuits _UpdateContiguousFlags on dim==0 setting
     BOTH C_CONTIGUOUS and F_CONTIGUOUS unconditionally and NOT setting
     BROADCASTED. Empty arrays have no elements so broadcast semantics
     are meaningless.
   - Previously NumSharp computed strides like (0, 3, 1) for shape
     (2, 0, 3), triggered IsBroadcasted=true, and then skipped
     contiguity flag assignment entirely. Result was an empty array
     reporting IsContiguous=false, IsFContiguous=false.
   - Now matches NumPy: any dim=0 short-circuits to set both C and F
     contig + WRITEABLE + ALIGNED, clear BROADCASTED.

4. Clarified `layout` const documentation
   - The internal const char layout = 'C' was misleadingly named (as if
     it described the shape's physical order) but only ever used as a
     hash seed in ComputeSizeAndHash. Updated doc comment to clarify
     this is NOT the physical memory order — use Order / IsContiguous
     / IsFContiguous for actual layout info.
   - Value unchanged to preserve existing hash stability.

Additional tests (6 new):
- Order property for C, F, transpose, 1-D, scalar cases
- Empty array is both C and F contiguous (matching NumPy 2.4.2)

Test results:
- 6023 tests pass on both net8.0 and net10.0 (was 6017; 6 new tests)
- Zero regressions

NumPy source reference: numpy/_core/src/multiarray/flagsobject.c
…enarios

Ports the last NumPy nditer surface gaps identified by the audit, each with
1-to-1 semantic parity verified against NumPy 2.4.2 via Python harness.

10 items implemented (all battletested):
  1. NpyIter_ResetBasePointers  (nditer_api.c:314)
     - Populate BaseOffsets during FlipNegativeStrides so ResetBasePointers
       can recompute ResetDataPtrs[iop] = baseptrs[iop] + baseoffsets[iop].
     - Public: NpyIterRef.ResetBasePointers(ReadOnlySpan<IntPtr>) and
       ResetBasePointers(NDArray[]) convenience overload.

  2. NPY_ITFLAG_TRANSFERFLAGS_SHIFT packing  (nditer_constr.c:3542)
     - Pack NpyArrayMethodFlags into top 8 bits of ItFlags (shift=24).
     - Public: NpyIterRef.GetTransferFlags() + NpyArrayMethodFlags enum
       + NpyIterConstants.TRANSFERFLAGS_SHIFT/MASK constants.
     - REQUIRES_PYAPI never set in .NET (no Python GIL). SUPPORTS_UNALIGNED
       and NO_FLOATINGPOINT_ERRORS always set (raw pointer loops, .NET casts
       don't raise FPE). IS_REORDERABLE set for numeric casts.

  3. NpyIter_GetGetMultiIndex factory  (nditer_templ.c.src:481)
     - Specialized delegate factory returning NpyIterGetMultiIndexFunc with
       3 dispatches: IDENTPERM (direct copy), positive perm (apply perm[]),
       NEGPERM (apply perm+flip). BUFFER and HASINDEX don't affect coords
       so no specialization needed for them.
     - Public: GetMultiIndexFunc(), GetMultiIndexFunc(out errmsg),
       InvokeMultiIndex(fn, coords) — ref-struct-safe invocation.
     - Also fixes: IDENTPERM flag is now set at construction (after
       AllocateDimArrays). Previously only set post-coalescing, leaving
       MULTI_INDEX iterators without the fast-path flag.

  4. NpyIter_GetInnerFixedStrideArray  (nditer_api.c:1357)
     - Public: GetInnerFixedStrideArray(Span<long>).
     - Buffered: copies BufStrides. Non-buffered: innermost-axis stride per
       operand. Returns BYTE strides (NumPy convention), multiplying
       NumSharp's element-count strides by ElementSizes[op].

  5. NpyIter_GetAxisStrideArray  (nditer_api.c:1309)
     - Public: GetAxisStrideArray(int axis, Span<long>).
     - With HASMULTIINDEX: walks perm to find internal axis (handles both
       positive and NEGPERM-encoded entries). Without: Fortran-order
       (fastest-first) lookup via NDim-1-axis. Byte strides.

  6. NpyIter_CreateCompatibleStrides  (nditer_api.c:1058)
     - Public: CreateCompatibleStrides(long itemsize, Span<long>).
     - Requires HASMULTIINDEX, rejects flipped axes. Walks perm from
       innermost (NDim-1) outward, accumulating itemsize into outStrides[axis]
       in original (C-order) axis slots.

  7. NpyIter_DebugPrint  (nditer_api.c:1402)
     - Public: DebugPrint(), DebugPrint(TextWriter), DebugPrintToString().
     - Faithful port of NumPy's dump format: ItFlags decoded, NDim/NOp,
       IterSize/Start/End/Index, Perm, DTypes, DataPtrs, BaseOffsets,
       OpItFlags, BufferData (when BUFFER), per-axis data.

  8. NPY_ITER_REDUCTION_AXIS encoding  (common.h:347, nditer_constr.c:1431)
     - Additive encoding: axis + (1 << 30). Values >= (1<<30)-1 flagged as
       reduction axes. Value 0x40000000 for axis 0, 0x3FFFFFFF for axis -1.
     - Public: NpyIterUtils.ReductionAxis(int) encoder and GetOpAxis(int,
       out bool) decoder. NpyIterConstants.REDUCTION_AXIS_OFFSET = 1<<30.
     - Integrated into CalculateBroadcastShape (rejects length != 1 on
       reduction axes), ValidateIterShape, and ApplyOpAxes (enforces
       REDUCE_OK + sets REDUCE flag).

  9. WRITEMASKED + ARRAYMASK + check_mask_for_writemasked_reduction
     - TranslateOpFlags now maps NpyIterPerOpFlags.WRITEMASKED ->
       NpyIterOpFlags.WRITEMASKED on op flags.
     - PreCheckMaskOpPairing validates: WRITEMASKED requires one ARRAYMASK,
       ARRAYMASK requires >=1 WRITEMASKED, at most one ARRAYMASK, no
       operand with both flags.
     - SetMaskOpFromFlags sets NpyIterState.MaskOp index of ARRAYMASK operand.
     - CheckMaskForWriteMaskedReduction enforces (nditer_constr.c:1328):
       for any WRITEMASKED + REDUCE operand, no axis may have maskstride!=0
       && opstride==0 (would produce multiple mask values per reduction element).
     - Public: NpyIterRef.MaskOp, HasWriteMaskedOperand.

 10. NPY_ITER_OVERLAP_ASSUME_ELEMENTWISE per-op flag
     - Added NpyIterPerOpFlags.OVERLAP_ASSUME_ELEMENTWISE_PER_OP = 0x40000000
       in the correct per-operand flag slot (NumPy's location). Accepted
       syntactically as a marker for COPY_IF_OVERLAP fast-path elision.

Correctness bugs fixed while battletesting:

  A. SetupBufferedReduction produced inverted strides for non-reduce operands.
     BufStride was set to elemSize (assumed linear buffer); correct value is
     the operand's stride along the REDUCE axis (inner loop = reduce axis
     traversal). ReduceOuterStride was set to elemSize*coreSize; correct is
     stride along the non-reduce axis.

  B. SetupBufferedReduction only worked for 2-axis cases (one reduce, one
     non-reduce). For 3D+ with multiple non-reduce axes, added CoreSize=0
     short-circuit that defers to regular N-D Advance() — which correctly
     carries multiple axes via Coords + per-axis strides. stride=0 on
     reduce axis naturally keeps y's pointer fixed during reduce iteration.

  C. GetDataPtr for BUFFER+REDUCE with CoreSize=0 returned a buffer pointer
     indexed by IterIndex (linear assumption). For reduce this is wrong —
     DataPtrs already track the correct position. Now returns DataPtrs
     whenever REDUCE flag is set.

  D. Reset() didn't reposition to IterStart. IterIndex was set to IterStart
     but DataPtrs/Coords were reset to array origin, desyncing the iterator
     state for ranged iterators with IterStart > 0. Now delegates to
     GotoIterIndex(IterStart) which sets all three consistently.

  E. K-order fallback to C-order was too aggressive — triggered for all
     non-contiguous arrays, defeating NumPy's K-order semantic of iterating
     in memory order. Fixed to fall back only when broadcast axes (stride=0)
     are present; merely non-contiguous (transposed, strided, negative-
     stride) now properly sorts axes by |stride| descending.

  F. CoalesceAxes rejected size-1 axes unless stride==0. Size-1 axes
     contribute no iteration and should always be absorbed into a neighbor.
     Fix restores proper 1D coalescing for shapes like (2,4,1) contiguous.

  G. FlipNegativeStrides now populates BaseOffsets[op] (previously an
     allocated-but-unused field). Prereq for item #1 (ResetBasePointers).

Battletest harness:
  - Python<->NumSharp scenario harness in a temp workspace with 3 structured
    waves (25 scenarios each) plus a 491-scenario random fuzz test with
    deterministic seed (42). All scenarios compare element sequences, stride
    arrays, multi-indices, reduce outputs, and iteration state byte-for-byte
    against NumPy 2.4.2 output.
  - Coverage: 1D-5D shapes; int8/16/32/64, uint16, float32/64 dtypes;
    contiguous, transposed (2D+3D), strided, negative-stride, size-1 axes,
    and all combinations; MULTI_INDEX, C_INDEX, F_INDEX; RANGED + goto;
    explicit/implicit reduction axes; multi-operand broadcast.
  - Result: 566/566 scenarios pass (25+25+25+491). All semantically
    equivalent to NumPy's C-level nditer output.

Added tests (94 new unit tests):
  - NpyIterAxisStrideArrayTests (12)
  - NpyIterCreateCompatibleStridesTests (9)
  - NpyIterDebugPrintTests (12)
  - NpyIterGetMultiIndexFuncTests (10)
  - NpyIterInnerFixedStrideArrayTests (9)
  - NpyIterOverlapAssumeElementwiseTests (5)
  - NpyIterReductionAxisEncodingTests (11)
  - NpyIterResetBasePointersTests (10)
  - NpyIterTransferFlagsTests (8)
  - NpyIterWriteMaskedTests (8)

Regression: 6023/6023 project tests pass (was 5898 before this work),
zero regressions. Project passes ~125 more tests than baseline because
fixes C-F unblocked test cases that were previously failing silently.
…rface

New file: OrderSupport.OpenBugs.Tests.cs (39 tests, 11 marked [OpenBugs])

Comprehensive TDD test file documenting the gap between NumSharp's
current behavior and NumPy 2.x's expected behavior for memory order
support. Each test uses NumPy's exact output as the expected value
(verified via side-by-side Python scripts).

Test sections:
1. Creation APIs (np.zeros/ones/empty/full) — 10 tests
2. Copy/conversion (np.copy, NDArray.copy) — 7 tests
3. Manipulation (flatten, ravel) — 5 tests
4. Arithmetic output layout — 3 tests
5. Reductions on F-contig (math-equivalent) — 6 tests
6. Slicing contiguity preservation — 2 tests
7. Broadcasting output layout — 1 test
8. Transpose behavior — 3 tests
9. Iteration order (C-order via GetOffset) — 1 test
10. Order property derivation — 3 tests

Results (net8.0 and net10.0):
- 28 tests pass (documents working behavior / NumPy parity)
- 11 tests fail (marked [OpenBugs], excluded from CI via filter)

Currently failing [OpenBugs] — API gaps to close in future phases:

Section 2 — np.copy / NDArray.copy ignore order parameter:
  - NpCopy_FOrder_ProducesFContig
  - NpCopy_AOrder_FSource_ProducesFContig
  - NpCopy_KOrder_FSource_ProducesFContig
  - NDArrayCopy_FOrder_ProducesFContig
  - NDArrayCopy_AOrder_FSource_ProducesFContig

Section 3 — flatten/ravel ignore/lack order:
  - Flatten_CContig_FOrder_MatchesNumPy
  - Flatten_FContig_FOrder_MatchesNumPy
  - Ravel_FOrder_ApiGap (ravel has no order parameter at all)

Section 4 — arithmetic always produces C-contig output:
  - Arithmetic_FContig_ScalarMul_PreservesFContig
  - Arithmetic_FPlusF_PreservesFContig

Section 7 — broadcast always produces C-contig output:
  - Broadcast_FContig_PlusFCol_PreservesFContig

Currently passing (NumPy-aligned behavior confirmed):
- np.zeros/ones/full preserve F-contig when given an F-Shape (workaround
  for missing order= parameter, but layout IS preserved)
- np.empty(order='C'/'F'/'A'/'K') — correct behavior; A/K throw
- All reductions (sum, mean, min, max, axis=0, axis=1) work on F-contig
- Transpose toggles C<->F contig correctly
- Slicing: 1-col slice of F-contig is both C and F contig (matches NumPy)
- Slicing: row-slice of F-contig is neither (matches NumPy)
- Shape.Order property reports correct char based on flags
- Scalar multiply on F-contig produces correct values (just loses layout)
- Indexed iteration on F-contig produces C-order logical traversal
  (matches NumPy's arr.flat semantics)

CI verification:
- Full suite with CI filter: 6051 pass, 0 fail (net8.0 and net10.0)
- With TestCategory=OpenBugs: 11 fail (as expected)
- With TestCategory!=OpenBugs: 28 pass (0 regressions)

Next steps: fix each [OpenBugs] by wiring order through the respective
API using OrderResolver. Remove the attribute after verifying the test
passes with NumPy's expected output.
Expands OrderSupport.OpenBugs.Tests.cs from 39 to 67 tests covering
every NumPy function that accepts an 'order' parameter.

NumPy functions covered (15 total that accept order=):
- Creation: empty, empty_like, zeros, zeros_like, ones, ones_like,
  full, full_like, eye
- Conversion: array, asarray, asanyarray, copy
- Manipulation: reshape, ravel
- Method: astype, flatten, copy

New sections added:
- Section 11: np.empty_like (default K, preserves source layout)
- Section 12: np.zeros_like (default K) + values-are-zero test
- Section 13: np.ones_like (default K) + values-are-one test
- Section 14: np.full_like (default K) + values-are-fill test
- Section 15: np.eye (C/F order) + identity values test
- Section 16: np.asarray / np.asanyarray API gaps
- Section 17: astype (default K, preserves source layout)
- Section 18: np.reshape with F-order (column-major fill)
- Section 19: np.ravel with C/F/A/K orders
- Section 20: np.array with order (Array input overload)
- Section 21: np.asfortranarray / np.ascontiguousarray (missing APIs)

Results (net8.0 and net10.0):
- 42 tests pass (document working behavior / NumPy parity)
- 25 tests fail (marked [OpenBugs], excluded from CI via filter)

25 [OpenBugs] documenting gaps:
- *_like don't preserve F-contig (5 tests: empty/zeros/ones/full/astype)
- np.copy/NDArray.copy order ignored (7 tests from prior commit)
- flatten/ravel order ignored (3 tests)
- arithmetic/broadcast don't preserve F-contig (3 tests)
- np.eye has no order param (1 test)
- np.reshape has no order param (1 test)
- np.array order ignored (1 test)
- np.asarray/asanyarray have no NDArray+order overload (2 tests)
- np.asfortranarray/np.ascontiguousarray don't exist (2 tests)

Confirmed NumPy parity (new passing tests):
- np.empty_like/zeros_like/ones_like/full_like on C-contig (K default -> C)
- np.zeros_like/ones_like/full_like produce correct fill values
- np.eye default produces C-contig identity matrix
- astype preserves C-contig from C source (K default)
- astype preserves values during type conversion
- np.reshape default produces row-major fill
- np.ravel default is C-order
- np.array default produces C-contig

CI verification:
- TestCategory!=OpenBugs: 6065 pass, 0 fail (net8.0 and net10.0)
- TestCategory=OpenBugs: 25 fail (as expected bug reproductions)

All NumPy order baselines verified via Python 2.4.2 side-by-side scripts.
Nucs added 3 commits April 22, 2026 23:41
`UnmanagedStorage.DTypeSize` (exposed via `NDArray.dtypesize`) was
delegating to `Marshal.SizeOf(_dtype)`. For every numeric dtype that
matches, but for bool, `Marshal.SizeOf(typeof(bool)) == 4` because bool
is marshaled to win32 BOOL (32-bit). The in-memory layout of `bool[]`
is 1 byte per element, so every caller computing a byte offset as
`ptr + index * arr.dtypesize` was reading/writing 4× too far into the
buffer for bool arrays.

Switches to `_typecode.SizeOf()` which correctly returns 1 for bool and
matches `Marshal.SizeOf` for every other type. 21 existing call sites
(matmul, binary/unary/comparison/reduction ops, nan reductions, std/var,
argmax, random shuffle, boolean mask gather, etc.) now get the right
value without any downstream change.

The bug had been latent until the Phase 2 iterator migration started
routing more code paths through NpyIter.Copy and the new NDIterator
wrapper; it surfaced most visibly as `sliced_bool[mask]` returning the
wrong elements when the source was non-contiguous. With the root fix:

    var full = np.array(new[] { T,F,T,F,T,F,T,F,T });
    var sliced = full["::2"];            // [T,T,T,T,T] non-contig
    var result = sliced[new_bool_mask];  // now correct per-element

np.save.cs already special-cases bool before falling through to
`Marshal.SizeOf`, so serialization was unaffected. Remaining
Marshal.SizeOf references in the codebase are either in comments that
explain this exact issue, or in the `InfoOf<T>.Size` fallback that
only runs for types outside the 12 supported dtypes (e.g. Complex).

Tests: 6,748 / 6,748 passing on net8.0 and net10.0 with the CI filter
(TestCategory!=OpenBugs&TestCategory!=HighMemory).
- Delete 4 NPYITER analysis docs (audit, buffered reduce, deep audit,
  numpy differences) — information consolidated into codebase
- Delete 3 NDIterator.Cast files (Complex, Half, SByte) — casting now
  handled by unified NDIterator<T> backed by NpyIter state
- Update NDIterator.cs: minor adjustments from NpyIter backing refactor
- Update ILKernelGenerator.Scan.cs: scan kernel changes
- Update Default.MatMul.Strided.cs: add INumber<T> constraint support
  for generic matmul dispatch preparation
- Update Default.ClipNDArray.cs: initial NpFunc dispatch refactoring
  replacing 6 switch blocks (~84 cases) with generic dispatch methods
- Update np.full_like.cs: minor fix
- Update RELEASE_0.51.0-prerelease.md release notes
…neric dispatch

NpFunc is a reflection-cached generic dispatch utility that bridges
runtime NPTypeCode values to compile-time generic type parameters.
Hot path (cache hit) runs at ~32ns via Delegate[] array indexed by
NPTypeCode ordinal. Cold path uses MakeGenericMethod + CreateDelegate,
cached after first call per (method, typeCode) pair.

Core NpFunc changes:
- Dynamic table sizing: Delegate[] sized from max NPTypeCode enum value
  (was hardcoded [32], broke for NPTypeCode.Complex=128)
- Overloads for 0-6 args × void/returning × 1-3 NPTypeCodes + 1-2 Types
- SmartMatchTypes for multi-type dispatch (1→broadcast, N=N→positional,
  M<N→type-identity matching)
- Per-arity ConcurrentDictionary caches for multi-type dispatch

Files refactored (12 files, ~400 cases eliminated):

Previous session (5 files, ~196 cases):
- Default.ClipNDArray.cs: 6 dispatch methods for contiguous/general clip
- Default.Clip.cs: 3 dispatch methods for scalar clip with ChangeType
- Default.NonZero.cs: 3 dispatch methods for nonzero/count operations
- Default.BooleanMask.cs: 1 dispatch method for masked copy
- Default.Shift.cs: 2 dispatch methods for array/scalar shift

This session (7 files, ~202 cases):
- NDIteratorExtensions.cs: 5 overloads → 5 dispatch methods creating
  NDIterator<T> from NDArray/UnmanagedStorage/IArraySlice
- Default.Reduction.CumAdd.cs: axis dispatch via CumSumAxisKernel<T>,
  elementwise via IAdditionOperators<T,T,T> with default(T) init
- Default.Reduction.CumMul.cs: axis dispatch via CumProdAxisKernel<T>,
  elementwise via IMultiplyOperators + T.MultiplicativeIdentity init
- np.where.cs: iterator fallback + IL kernel dispatch via pointer cast
- np.random.randint.cs: int/long fill via INumberBase<T>.CreateTruncating
- NDArray.NOT.cs: IEquatable<T>.Equals(default) unifies bool NOT and
  numeric ==0 comparison into single generic method
- Default.LogicalReduction.cs: direct dispatch to ExecuteLogicalAxis<T>

Net: -1243 lines removed across 12 files, replacing repetitive per-type
switch cases with single generic dispatch methods.
Complex does not implement IComparable<T>, so NpFunc.Invoke into
ClipArrayBoundsDispatch/ClipArrayMinDispatch/ClipArrayMaxDispatch
crashed with ArgumentException on MakeGenericMethod.

Fix: add NPTypeCode.Complex pre-checks in ClipNDArrayContiguous,
ClipNDArrayGeneral, and ClipCore that route to dedicated Complex
clip methods using lexicographic comparison (real first, then imag).
NaN handling preserves the NaN-containing element as-is (not
replaced with NaN+NaN*i), matching NumPy np.maximum/np.minimum
behavior where "NaN wins" but the original value is returned.

Half NaN propagation: ILKernelGenerator.ClipArrayBoundsScalar,
ClipArrayMinScalar, ClipArrayMaxScalar fell through to the generic
CompareTo path for Half, which treats NaN as less-than-all (IEEE
totalOrder) instead of propagating it. Added Half-specific scalar
methods that check Half.IsNaN explicitly before comparison.

Also fix NpFunc table sizing: Delegate[] was hardcoded to [32] but
NPTypeCode.Complex=128 caused IndexOutOfRangeException. Now computed
dynamically from max NPTypeCode enum value at static init.

Fixes 14 test failures (12 Complex clip/maximum/minimum constraint
violations, 2 Half NaN propagation in maximum).
Nucs added 23 commits May 13, 2026 09:14
…ast paths

Replaces the broken `PowerInteger` fast-path (which crashed on sliced/broadcast
arrays via `Unsafe.Address`) with a stride-aware integer power emitted by the
existing IL kernel infrastructure. Adds NumPy's "Integers to negative integer
powers are not allowed." ValueError, fast paths for scalar exponents {0,1,2,
0.5,-1.0}, and switches f32 to single-precision `MathF.Pow` (no f64 round-trip).

Audit-v2 tickets resolved:
- T1.3a — np.power(sliced_int32, int32) no longer crashes
- T1.3b — np.power(broadcast_int32, int32) no longer crashes
- T1.36 — int**(-int) now raises ArgumentException matching NumPy ValueError

What changed
============

NEW: src/NumSharp.Core/Utilities/NpyIntegerPower.cs
  Public squared-exponentiation helpers for all 9 integer NumSharp dtypes
  (sbyte/byte/int16/uint16/char/int32/uint32/int64/uint64) — preserves
  dtype-native wraparound (uint8 ** 8 = 0, 15**15 = 437893890380859375).
  Caller validates non-negative exponent.

REWRITE: src/NumSharp.Core/Backends/Default/Math/Default.Power.cs
  - Removes the `Unsafe.Address`-based fast-path that crashed on
    sliced/broadcast operands and ignored strides.
  - Adds pre-scan: for `int**int` with signed-int exponent, scans rhs for
    any negative element and throws `ArgumentException("Integers to negative
    integer powers are not allowed.")`. Matches NumPy's unconditional check
    (rejects base ∈ {±1} too, per NumPy spec).
  - Adds scalar-exponent fast paths when `rhs.size == 1`:
      exp = 0   → ones_like(lhs)
      exp = 1   → lhs.copy() (or cast)
      exp = 2   → lhs * lhs (SIMD-optimized Multiply kernel)
      exp = 0.5 → np.sqrt(lhs)
      exp =-1.0 → np.reciprocal(lhs) (float base only)
    Each path verifies the resolved result dtype matches what the IL kernel
    would produce before substituting, so NEP50 promotion is preserved.
  - Falls through to `ExecuteBinaryOp` for the general case, which now
    walks strides correctly via the IL kernel paths.

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs
  - `EmitPowerOperation(il, resultType)`: dispatches to the matching
    `NpyIntegerPower.Pow*` helper for integer result types (replaces the
    `int → double → Math.Pow → int` round-trip that lost precision for
    values outside [-2^52, 2^52]). float32 → `MathF.Pow`; float64 →
    `Math.Pow`; Boolean and other fallthrough types use the original double
    round-trip to keep the kernel verifiable.
  - Cached `MethodInfo` lookups added for all 9 integer power helpers and
    `MathF.Pow`.

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Binary.cs
  - `EmitPowerOperation<T>(il)` (same-type contiguous kernel path):
    same dispatch as the mixed-type version. Generic `T` is mapped to the
    matching `NpyIntegerPower.Pow*` helper via `GetIntegerPowMethod<T>()`.

src/NumSharp.Core/Backends/Default/Math/DefaultEngine.BinaryOp.cs
  - Updates the Power promotion comment to document NEP50 weak/strict
    behavior accurately (NumSharp matches NumPy in the common cases; the
    one documented misalignment is 0-D integer arrays explicitly constructed
    via `np.array(2, int32)`, which are indistinguishable from C# `int 2`
    after `np.asanyarray`).

Tests
=====

NEW: test/NumSharp.UnitTest/Math/NDArray.power.Comprehensive.Test.cs (24 tests)
  - Integer dtype-native wrapping (uint8/int8/int32 overflow)
  - Stride + broadcast layouts (sliced, broadcast_to, 2D-vs-1D)
  - Signed integer negative exponent throws (incl. base = ±1)
  - Unsigned integer exponent never throws
  - Float special values (0^0, NaN, ±inf base/exp, fractional neg base)
  - NEP50 promotion (f32 ** int{8,16,32}, f64 ** int*, f32 ** scalar)
  - All 9 integer dtypes smoke-tested via 2^3 = 8

REMOVED [Misaligned]: PowerEdgeCaseTests.Power_Integer_LargeValues
  Integer power now preserves exact precision; the test now asserts equality.

UPDATED: NewDtypesCoverageSweep_Arithmetic_Tests.B35_SByte_Power_NegativeExponent*
  Previously documented the wrong (silent 0/±1) behavior; now asserts the
  NumPy-correct ArgumentException.

UPDATED (removed [OpenBugs]):
  - AuditV2_MathReductions.T1_3a_Power_SlicedInt32_ShouldNotCrash
  - AuditV2_MathReductions.T1_3b_Power_BroadcastInt32_ShouldNotCrash
  - AuditV2_ILKernelSimd.T1_36_* (4 tests)

Validation
==========

  cd test/NumSharp.UnitTest
  dotnet test --no-build --framework net10.0 \
    --filter "TestCategory!=OpenBugs&TestCategory!=HighMemory"
  → Passed: 8255, Failed: 0

  dotnet test --no-build --framework net10.0 \
    --filter "FullyQualifiedName~Power"
  → Passed: 129, Failed: 0

Microbench (1M-element float32, x100 iterations):
  power(arr, 2)     121ms  (fast path → mul; matches multiply baseline 117ms)
  power(arr, 0.5)   121ms  (fast path → sqrt)
  power(arr, 2.7)   518ms  (general path via MathF.Pow)

Behavior changes vs. prior NumSharp
===================================
- int**(-int) now throws (was: silently returned 0, 1, or -1).
  Matches NumPy 2.4.2 ValueError exactly.
Adds the iterator-subsystem branch audit documents that drove this
branch's bug-fix and refactor work:

- `NDITER_BRANCH_QUALITY_AUDIT.md` — original (V1) audit walking every
  changed src file and ranking findings by severity (bugs → perf →
  parity gaps → refactors → clean review). Bug catalog includes:
  np.maximum/minimum NaN handling, np.power stride mishandling,
  np.searchsorted incompleteness, np.repeat missing axis, NpyIter
  Iternext+EXLOOP path, nan{mean,std,var} perf, np.argsort LINQ perf,
  linspace/eye boxing.

- `NDITER_BRANCH_QUALITY_AUDIT_V2.md` — V2 (fact-check) audit driven by
  8 parallel agents auditing file-by-file with results verified via
  `python -c` against NumPy 2.4.2 and `dotnet_run` against NumSharp.
  60 of 65 Tier 1 findings confirmed with failing OpenBugs reproducers
  written under `test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs`, plus a
  list of 4 false positives and 4 newly discovered bugs.

- `docs/plans/audit_v2/01..08*.md` — per-batch audit chapters, each
  including: file scope tables (LoC + role), reference to NumPy source,
  reproduction commands, line-precise references, and a findings table
  with severity tags (bug / parity-gap / perf / refactor / clean).
  Chapters cover Iterators, ILKernel+SIMD, Default math/reductions,
  Logic+Shape+Storage, NDArray creation, Manipulation APIs+logic,
  Math ops + selection/sorting/stats, and Casting+random+utilities.

These files are pure documentation and contain no code; they're the
reference material for the bug fixes and tests that follow on the
nditer branch.
Adds the per-batch test classes that the V2 audit fact-check pass
produced to back its Tier 1 findings with concrete failing tests.
Tests are marked `[OpenBugs]` so CI skips them until the underlying
defect is fixed; running them locally with `TestCategory=OpenBugs`
documents each bug's current behavior versus NumPy 2.4.2.

Each test references both the master `NDITER_BRANCH_QUALITY_AUDIT_V2.md`
and the matching `docs/plans/audit_v2/XX_*.md` chapter where the finding
is documented in detail, and includes the file:line of the suspected
defect plus a `python -c` NumPy 2.4.2 expectation.

Test classes added (matching the 6 untracked batches):
- `AuditV2_Iterators.cs` — NpyIter Iternext/EXLOOP issues, buffer refill,
  cast support gaps, NDIterator broadcast strides, etc. (Batch 1).
- `AuditV2_LogicShapeStorage.cs` — Shape mutating indexer on a
  readonly struct, storage and logic op edge cases (Batch 4).
- `AuditV2_NDArrayCreation.cs` — `np.array(NDArray, copy=false)` default
  aliasing, creation API edge cases (Batch 5).
- `AuditV2_ManipulationApis.cs` — np.expand_dims on empty, manipulation
  parity gaps (Batch 6).
- `AuditV2_MathSelectionSorting.cs` — SetIndicesNDNonLinear NIE,
  math/selection/sort bugs (Batch 7).
- `AuditV2_CastingRandomUtilities.cs` — NpFunc/cast/random/utilities
  bugs (Batch 8).

Batches 2 (`AuditV2_ILKernelSimd.cs`) and 3 (`AuditV2_MathReductions.cs`)
already exist on the branch; this commit fills the remaining 6.
Build is verified to pass with the new files included.
Updates `.claude/CLAUDE.md` so the project instructions match the code's
current state:

- "C-order only" entry replaced with "Order-aware layout": Shape tracks
  F-contiguity, and APIs with an `order` parameter resolve NumPy
  `C`/`F`/`A`/`K` through `OrderResolver`. Verified by:
    - `Shape.IsFContiguous` flag (`View/Shape.cs:115-118`)
    - `Shape.Order` property (`View/Shape.cs:437`)
    - F-aware construction (`View/Shape.cs:160`)
- `F_CONTIGUOUS` entry in the flags table updated from "Reserved" to
  "Data is column-major contiguous" (matches `ArrayFlags.F_CONTIGUOUS`
  bit `0x0002` in `View/Shape.cs:24`).
- Added `IsFContiguous — O(1) check via F_CONTIGUOUS flag` to the
  key Shape properties list.
- Missing Functions count corrected from 19 → 18 and `np.where` removed
  from the Selection gap because `APIs/np.where.cs` implements it; new
  `### Selection` section under "Supported np.* APIs" lists `where`.
- Iterators path updated from `MultiIterator.cs` to `NpyIter.cs` and
  `NpyExpr.cs` (verified — `MultiIterator` no longer exists; only
  `NDIterator`, `NpyIter`, `NpyExpr` are present in `Backends/Iterators`).
- Q&A entries for NDIterator and NpyIter rewritten to match the current
  legacy-wrapper / NumPy-aligned multi-operand iterator split.

Pure documentation change — no behavioral impact.
…per / Memory block

Multiple `CopyTo` overloads in the unmanaged memory layer were calling
`Buffer.MemoryCopy(...)` with source/destination swapped — the BCL
signature is `MemoryCopy(void* source, void* destination, long destBytes,
long sourceBytesToCopy)`, but the existing code passed the destination
pointer first. The result was that data was copied *from the destination
buffer into the source slice*, silently corrupting the caller's source
data instead of populating the destination.

ArraySlice`1.cs:
- `TryCopyTo(Span<T>)`, `CopyTo(Span<T>)`, `IArraySlice.CopyTo<T1>(Span<T1>)`,
  `IArraySlice.CopyTo<T1>(UnmanagedSpan<T1>)`: swap source / destination
  pointers so data flows source→destination.
- `CopyTo(IntPtr dst)`: also fix the byte-size argument — previous code
  passed `Count` (element count) for both destination size and bytes to
  copy, leaving non-byte dtypes with under-counted bounds. Replace with
  `Count * ItemLength` for both byte arguments and flip the source /
  destination order.
- `CopyTo(IntPtr dst, long sourceOffset, long sourceCount)`: this
  overload was previously identical to `CopyTo(IntPtr dst)` (ignored the
  offset arguments). Add `sourceOffset` / `sourceCount` bounds checks,
  honour `sourceOffset` when computing the source pointer, and use
  `sourceCount * ItemLength` for the copy.
- `CopyTo(Span<T>, long sourceOffset, long sourceLength)`: previous body
  recursed into itself (`CopyTo(destination, sourceOffset, sourceLength);`)
  causing a stack overflow. Replace with a bounds-checked
  `Buffer.MemoryCopy` from `Address + sourceOffset`.
- `CopyTo(UnmanagedSpan<T>, long sourceOffset, long sourceLength)`:
  same direction swap as above.
- `IArraySlice.CopyTo<T1>(Span<T1>)` / `IArraySlice.CopyTo<T1>(UnmanagedSpan<T1>)`:
  bytes-based comparison (`Count * ItemLength` vs `destination.Length *
  InfoOf<T1>.Size`) instead of element-count comparison, fixing the
  "destination too short" check for reinterpret-cast cases.
- `IArraySlice.Clone<T1>()`: previous code used `UnmanagedMemoryBlock<T1>.
  Copy(Address, Count)` which treats `Count` as the *T1* element count
  while reading from a `T`-element buffer. Now compute the byte size
  and divide by `InfoOf<T1>.Size` so the clone preserves the whole byte
  payload (with a hard error if the byte count is not a multiple of the
  target element size).

UnmanagedHelper.cs:
- `CopyTo(IMemoryBlock src, IMemoryBlock dst, long countOffsetDestination)`:
  validate `countOffsetDestination` against `dst.Count` and ensure the
  source fits in the *remaining* destination capacity. Fix the
  destination-size argument to `(dst.Count - countOffsetDestination) *
  dst.ItemLength` instead of the source byte length (which under-counts
  by the offset when the destination is just big enough).

UnmanagedMemoryBlock`1.cs:
- `CopyTo(UnmanagedMemoryBlock<T> memoryBlock, long arrayIndex)`: swap
  pointers so data is copied source→destination (`memoryBlock.Address +
  arrayIndex` as dst), add null + bounds checks, and use the remaining
  destination capacity for the destination size argument.

All fixes are direct corrections of misuses of `Buffer.MemoryCopy`'s
signature; behavior for legitimate callers now matches the docstrings.
The added regression tests live under
`test/NumSharp.UnitTest/Backends/CloneRegressionTests.cs` (separate
commit) and call each repaired overload to lock the contract in place.
….Clone bugs

Shape.cs:
The `Clone(deep, unview, unbroadcast)` branch logic was inconsistent and
dropped the `offset`/`bufferSize` of scalar views in the most common
call (`Clone()` with default args). Rewrite the cascade so behavior is
predictable:
- Empty shape → `default`.
- Scalar shape:
    - `unview` or `unbroadcast` → return the static `Scalar` (offset=0).
    - Otherwise honour `deep`: copy-constructor preserves both `offset`
      and `bufferSize` for sliced scalar views like `np.arange(10)["5"]`.
- Non-scalar shape:
    - `!deep && !unview && !unbroadcast` → return `this` (the readonly
      struct copy is itself a value-copy).
    - `unview` or `unbroadcast` → `new Shape((long[])dimensions.Clone())`,
      which the constructor fills with C-contiguous strides (no offset).
      This replaces the previous one-off `ComputeContiguousStrides` /
      `deep && !unbroadcast` mixed branches that produced different
      shapes depending on call combination.
    - Plain `deep` → deep copy via the copy constructor.

Old behavior failure: `scalar.Shape.Clone()` on `np.arange(10)["5"]`
returned the canonical `Scalar` shape with `offset == 0`, hiding the
fact that the data lives at index 5. The regression test
`Shape_Clone_PreservesScalarViewOffset` in `CloneRegressionTests`
locks the fix.

ArrayConvert.cs:
- `Clone(Array sourceArray)` had two issues:
  1. It walked `elementType.IsArray` past the array's actual element
     type, so a jagged `int[][]` was treated as a flat `int[]` and the
     subsequent `Array.Copy` produced wrong results (or threw). Now the
     immediate element type is used, preserving jaggedness.
  2. Arrays with a non-zero lower bound (created via
     `Array.CreateInstance(elementType, lengths, lowerBounds)`) were
     not supported — they fell through to the multi-dim branch with
     all-zero lower bounds. Capture each axis' lower bound and use
     `Array.CreateInstance(elementType, lengths, lowerBounds)` whenever
     the source is multi-dim or has any non-zero lower bound.
- `Clone<T>(T[,,,] sourceArray)` had a `GetLength(4)` typo for what
  should be the fourth (zero-indexed: 3) dimension. `GetLength(4)`
  throws `IndexOutOfRangeException` for any 4-D array. Changed to
  `GetLength(3)`. (Coverage: `CloneRegressionTests
  .ArrayConvert_Clone_FourDimensionalArray_UsesFourthDimensionLength`.)
…ontig

NDArray (`Backends/NDArray.cs`):
- All three `UnmanagedStorage`-based constructors now back-fill the
  engine when storage doesn't already have one, and mirror the chosen
  engine onto `Storage.Engine` so the array and storage stay in sync.
  Previously `Storage.Engine` could be null while the NDArray reported a
  valid `TensorEngine`, leaking back through chained constructors that
  read storage.Engine directly.
- `TensorEngine` setter now propagates the resolved engine to
  `Storage.Engine` so changing the engine on an NDArray cascades to
  storage-side consumers.
- `Clone()` is now `virtual` and uses the property setter (instead of
  the private field) so engine assignment propagates to storage.
  `NDArray<TDType>.Clone()` overrides it to preserve the typed wrapper —
  before this commit, `((NDArray<int>)x).Clone()` returned the
  non-generic NDArray base type, breaking generic callers (see
  `CloneRegressionTests.NDArray_Clone_PreservesGenericRuntimeType`).
- `View`/`GetData(int[])`/`GetData(long[])`/the foreach yield path all
  switch from setting the private `tensorEngine` field to the property,
  so storage gets the engine too.

UnmanagedStorage (`Backends/Unmanaged/UnmanagedStorage.cs`):
- `CreateBroadcastedUnsafe(...)` now copies `storage.Engine` onto the
  new broadcast view.

UnmanagedStorage cloning (`Backends/Unmanaged/UnmanagedStorage.Cloning.cs`):
- All `Alias(...)` overloads, the `Slice` builder, both `Cast<T>` /
  `Cast(typeCode)`, both `CastIfNecessary<T>` / `CastIfNecessary(typeCode)`,
  and the empty-storage clone now propagate `Engine`.
- Cast correctness fix: `Cast<T>` / `Cast(typeCode)` /
  `CastIfNecessary<T>` / `CastIfNecessary(typeCode)` used to cast the
  raw backing array via `InternalArray.CastTo(...)`. For strided or
  F-contiguous views that buffer holds elements in the *physical*
  layout, so the cast result contained values in the wrong logical
  order. They now run `CloneData()` first — which materializes the
  logical element sequence (via `NpyIter.Copy` for non-contiguous
  paths) — and cast that, so casting an F-contiguous view of
  `np.arange(6).reshape(2,3).T` yields the same values NumPy produces.
  (Verified by `CloneRegressionTests
  .UnmanagedStorage_CastGeneric_FContiguousSource_CopiesLogicalValuesAndEngine`
  and siblings.)
- `Clone()` gains a fast `CanCloneRawLayout()` path: when the storage
  owns its buffer (no offset, no broadcast, no buffer/size mismatch)
  and is either C- or F-contiguous, the underlying ArraySlice is
  cloned raw and the same `Shape` is reused. Non-trivial slices and
  scalar views still fall back to `CloneData()`. Empty storages return
  a new typed empty storage and preserve the engine instead of trying
  to clone a null buffer.
- `CastIfNecessary` early-return for same-dtype skips the
  `IsEmpty` check so empty storages of the requested dtype don't
  re-materialize.
The DefaultEngine helpers for `astype` and `transpose` created new
`NDArray` instances via the `UnmanagedStorage`-only constructor, which
falls back to `BackendFactory.GetEngine()`. Code that explicitly set
`nd.TensorEngine` on the source (e.g. tests pinning a custom engine)
would silently see its engine swapped for the default after a cast or
transpose.

`Default.Cast.cs` (`DefaultEngine.Cast`):
- Capture `nd.TensorEngine` once at the top.
- Empty/scalar/`(1,)` early returns now carry `engine` forward both on
  the returned `NDArray` and on `nd.Storage` (when reused in-place).
- Both `copy` and in-place branches of the generic cast attach
  `TensorEngine = engine` to the resulting NDArray and to the
  re-assigned `nd.Storage`.

`Default.Transpose.cs` (`DefaultEngine.TransposeAlias`):
- The transpose alias returned via `Storage.Alias(newShape)` now carries
  `nd.TensorEngine` so transposed views keep their engine. Without this
  the call dropped back to the global default, breaking propagation
  through compounded operations.

Coverage: `CloneRegressionTests.NDArray_AstypeCopyPath_PreservesTensorEngine`
and the engine-propagation siblings.
…/ ravel paths

All paths that build a fresh `NDArray` from an existing storage now
preserve the caller's `TensorEngine`. Previously the `NDArray
(UnmanagedStorage)` constructor would fall back to
`BackendFactory.GetEngine()` when the supplied storage didn't carry an
engine (which is common after `Clone()`/`Alias()`/`CloneData()`).

`Creation/NDArray.Copy.cs` (`copy(char physical)`):
- The C-order shortcut now requires the source to already be
  C-contiguous. Before, `copy('C')` on an F-contiguous view returned a
  `Clone()` whose shape preserved the F-strides — yielding a
  non-C-contiguous "copy". Now any non-C-contiguous source falls
  through to the iterator-driven materialization path.
- The destination shape uses the requested `physical` order instead of
  hard-coding `'F'`. Combined with the fix above this gives correct
  C/F selection regardless of source layout.
- Destination NDArray carries `TensorEngine = TensorEngine` of the
  source. Coverage:
  `CloneRegressionTests.NDArray_CopyCOrder_FromFContiguousSource_ProducesCContiguousCopy`
  and `NDArray_CopyFOrder_PreservesTensorEngine`.

`Creation/NdArray.ReShape.cs`:
- The F-order reshape return (`fFlat.Storage.InternalArray`-backed
  storage) and both non-contiguous fallback paths
  (`new NDArray(CloneData(), Shape.Clean())`) now attach the source
  `TensorEngine`. Coverage:
  `CloneRegressionTests.NDArray_ReshapeCopyPath_PreservesTensorEngine`.

`Creation/np.array.cs`:
- `np.array(nd, copy)` propagates `nd.TensorEngine` for both the
  alias (`copy=false`) and clone (`copy=true`) paths. Coverage:
  `NpArray_FromNDArray_PreservesTensorEngineForAliasAndCopy`.

`Manipulation/np.expand_dims.cs`, `Manipulation/np.ravel.cs`,
`Manipulation/NDArray.flatten.cs`:
- The view (`Storage.Alias(...)`) and materialize (`CloneData()`)
  branches both forward the source `TensorEngine`.

No semantic API changes other than the `copy('C')` correctness fix
above; engine propagation is a transparent improvement.
NumPy's np.where iterator allocates the result with an order chosen
from the *full-size* operands' contiguity flags:
- Any full-size, multi-dim operand that is C-contiguous (but not F)
  → output is C-contiguous.
- All full-size, multi-dim operands that are F-contiguous (and at least
  one is strictly F, not also C) → output is F-contiguous.
- Operands that are scalar, 1-D, or broadcasted do not vote.
- Mixed C/F (or any full-size non-contiguous operand) → fall back to C.

Verified against NumPy 2.4.2:

    f = np.arange(12).reshape(3,4).T            # F-contig view
    np.where(f > 5, f, 0)              .flags   # F_CONTIGUOUS=True
    np.where(f > 5, f.copy('C'), f)    .flags   # C_CONTIGUOUS=True
    np.where(np.array([True,False,True]), f, 0).flags  # F_CONTIGUOUS=True

Previously `np.where` always allocated the output as C-contiguous,
losing the F layout that NumPy preserves for F-contiguous inputs.

`APIs/np.where.cs`:
- New `ResolveWhereOrder(params NDArray[] operands)` mirrors the rules
  above. Returns 'C' or 'F'.
- The result `Shape` is now constructed via `new Shape((long[])cond
  .shape.Clone(), resultOrder)` so the resulting strides match the
  resolved order.
- Drop the `NpFunc.Invoke(outType, WhereImpl<int>, ...)` generic
  dispatch: the actual `WhereImpl` body never used its `T` type
  parameter (the iterator-driven IL kernel keys off the runtime dtype
  string), so the switch-per-dtype indirection was dead weight. Replace
  with a direct non-generic `WhereImpl(cond, x, y, result)` call.

`test/NumSharp.UnitTest/Logic/np.where.BattleTest.cs`:
- New "Output Layout" region with three NumPy-anchored tests:
    * `Where_FContiguousInputs_ResultIsFContiguous`
    * `Where_MixedCAndFInputs_ResultFallsBackToC`
    * `Where_BroadcastConditionWithFInput_ResultIsFContiguous`
…sh order tests

NumPy's `np.tile(A, reps)` keeps the source memory order on the "no
expansion" shortcut (`tup == (1,)*outDim`): F-contiguous input stays
F-contiguous, C-contiguous input stays C-contiguous, and views with
strides outside C/F materialize as C-contiguous. Verified against
NumPy 2.4.2:

    src = np.arange(12).reshape(3, 4).T          # F-contig
    np.tile(src, (1, 1)).flags     # F_CONTIGUOUS=True
    np.tile(src, ()).flags         # F_CONTIGUOUS=True
    np.tile(np.arange(12).reshape(3, 4)[:, ::-1], (1, 1)).flags
                                                  # C_CONTIGUOUS=True

`Manipulation/np.tile.cs`:
- The all-ones shortcut previously called `A.copy()` which defaults to
  `'C'` — silently flipping F-contiguous inputs to C-contiguous output.
  Replace with `A.copy('K')` (and the reshape variant gets the same
  treatment) so `OrderResolver.Resolve('K', shape)` picks the source's
  physical order. The comment is updated to describe the keep-order
  semantics.

`test/NumSharp.UnitTest/Manipulation/np.tile.Test.cs`:
- Three new tests covering the F-contig preservation, the `np.tile(a)`
  no-reps overload, and the non-contiguous fall-back. Each test also
  verifies element values against the source via index-based reads to
  guard against logical-order regressions.

`test/NumSharp.UnitTest/View/OrderSupport.OpenBugs.Tests.cs`:
- `Tile_ApiGap` is renamed to `Tile_RepeatsLastAxis_ValuesMatchNumPy`
  and its assertion stays — the API gap comment is replaced with the
  matching NumPy reference output. Header rewritten from
  "Missing functions" to "Manipulation helpers" since this section now
  documents working APIs.
- `Where_ApiGap` (previously `[OpenBugs]` because np.where was thought
  missing) is now `Where_FContig_PreservesFContig`. It asserts that
  `np.where(f_arr > 5, f_arr, 0)` returns an F-contiguous result on
  F-contiguous input — the same property covered by the new where
  layout tests in the prior commit. The `[OpenBugs]` attribute is
  removed because the feature exists and now matches NumPy.
…IterBattleTests

`benchmark/NumSharp.Benchmark.Exploration/Program.cs`:
- `Options.Clone()` reused the same `RemainingArgs` `string[]` reference
  on the cloned `Options` instance. Any post-clone mutation of the
  array (or its elements via index assignment) would have leaked back
  to the original `Options`. Clone the array (`(string[])RemainingArgs
  .Clone()`) so the two `Options` instances are independent.

`test/NumSharp.UnitTest/Backends/Iterators/NpyIterBattleTests.cs`:
- Remove a single trailing blank line at end of file. No code change.
… after RemoveAxis

`NpyIter.Clone()` (in `Backends/Iterators/NpyIter.cs`) previously copied
the `Buffers[op]` pointer field directly from the source state, so the
original and the cloned iterator shared the *same* per-operand buffer.
After `Iternext()` on either iterator the writes from one would clobber
the other's data, and disposing one would free the buffer out from
under the other.

The fix:
- After copying the operand metadata (`ElementSizes`, `BufStrides`, etc.),
  allocate a fresh buffer per operand via `NpyIterBufferManager
  .AllocateAligned(BufferSize, opDtype)` and `Buffer.MemoryCopy` the
  source bytes into it. If allocation fails the catch block calls
  `NpyIterBufferManager.FreeBuffers` for buffered states before
  releasing dim arrays + state memory.
- `DataPtrs[op]` is fixed up: if the source `DataPtrs[op]` pointed into
  the original `Buffers[op]` byte range we translate the offset onto
  the newly allocated buffer so iteration continues at the same logical
  position.
- The clone now calls `AllocateDimArrays(_state->NDim, _state->NOp,
  _state->StridesNDim)` — see below.

`NpyIterState.AllocateDimArrays(int ndim, int nop, int stridesNDim)`:
- Previously, the strides block was always sized as `ndim * nop`. After
  `RemoveAxis` lowers `NDim` but leaves `StridesNDim` at its original
  width, cloning the iterator allocated a strides block that was too
  small, causing later reads from `Strides[k]` (where `k >= NDim*NOp`)
  to access freed or unrelated memory.
- The third parameter defaults to `ndim` (preserving the existing
  contract for all other call sites) but accepts an explicit
  `stridesNDim >= ndim` so `Clone()` can carry the original allocated
  stride width forward. `StridesNDim` is now stored on the state and
  the strides allocation uses `stridesNDim * nop * sizeof(long)`. The
  scalar fast-path now requires both `ndim == 0` and `stridesNDim == 0`
  to skip the allocation.

Also moves the `GetInnerFixedStrideArray` docblock so it sits directly
above its method (it had drifted onto an unrelated method when the
preceding doc was edited).

Coverage:
- `CloneRegressionTests.NpyIterCopy_BufferedIterator_AllocatesIndependentBuffers`
  asserts the two iterators have distinct `DataPtr` addresses and that
  advancing one does not advance the other.
- `CloneRegressionTests.NpyIterCopy_AfterRemoveAxis_PreservesAllocatedStrideWidth`
  builds an iterator over `(2,3,4)`, removes axis 1, clones it, and
  checks `NDim`, `Shape`, and the first value match.
… clone fixes

Adds `test/NumSharp.UnitTest/Backends/CloneRegressionTests.cs`,
which locks in the contracts established by the preceding fix commits.
Each test reproduces a specific bug or contract that previously
regressed and asserts the corrected behavior. 27 tests; all pass on
net8.0 and net10.0.

Coverage map (each pair = test → fix commit):

ArraySlice CopyTo direction / range fixes
→ `fix(unmanaged): correct CopyTo direction + bounds in ArraySlice`
- `ArraySlice_CopyToSpan_CopiesFromSliceToDestination`
- `ArraySlice_TryCopyToSpan_CopiesFromSliceToDestination`
- `ArraySlice_CopyToSpan_WithSourceRange_CopiesRequestedRange`
- `ArraySlice_CopyToIntPtr_WithSourceRange_CopiesRequestedRange`
- `ArraySlice_InterfaceCopyToSpan_CopiesFromSliceToDestination`
- `ArraySlice_InterfaceCloneGeneric_ReinterpretsWholeBytePayload`

ArrayConvert.Clone jagged / non-zero lower-bound / 4-D GetLength fixes
→ `fix(shape+convert): preserve scalar offset on Clone; fix ArrayConvert.Clone bugs`
- `ArrayConvert_Clone_PreservesJaggedElementType`
- `ArrayConvert_Clone_PreservesNonZeroLowerBounds`
- `ArrayConvert_Clone_FourDimensionalArray_UsesFourthDimensionLength`

Shape.Clone scalar view offset preservation
→ same commit as above
- `Shape_Clone_PreservesScalarViewOffset`

UnmanagedStorage.Clone empty + F-contig + engine
→ `fix(storage+ndarray): keep TensorEngine in sync; correct cast for F-contig`
- `UnmanagedStorage_Clone_DtypeOnlyStorage_DoesNotDereferenceMissingData`
- `UnmanagedStorage_Clone_PreservesEngineAndFContiguousShape`

UnmanagedStorage.Cast / CastIfNecessary uses CloneData + propagates engine
→ same commit
- `UnmanagedStorage_CastTypeCode_FContiguousSource_CopiesLogicalValuesAndEngine`
- `UnmanagedStorage_CastGeneric_FContiguousSource_CopiesLogicalValuesAndEngine`
- `UnmanagedStorage_CastIfNecessary_FContiguousSource_CopiesLogicalValuesAndEngine`
- `UnmanagedStorage_CastEmptyStorage_PreservesEngine`

UnmanagedMemoryBlock.CopyTo arrayIndex offset
→ `fix(unmanaged): correct CopyTo direction + bounds in ArraySlice`
- `UnmanagedMemoryBlock_CopyToWithIndex_CopiesToDestinationOffset`

UnmanagedHelper.CopyTo destination-offset bounds
→ same commit
- `UnmanagedHelper_CopyToWithInvalidDestinationOffset_Throws`

NDArray.Clone / engine propagation
→ `fix(storage+ndarray): ...` + `fix(creation+manipulation): ...` +
  `fix(default-engine): ...`
- `NDArray_Clone_PreservesGenericRuntimeType`
- `NDArray_Clone_PreservesTensorEngineOnArrayAndStorage`
- `NpArray_FromNDArray_PreservesTensorEngineForAliasAndCopy`
- `NDArray_CopyFOrder_PreservesTensorEngine`
- `NDArray_CopyCOrder_FromFContiguousSource_ProducesCContiguousCopy`
- `NDArray_ReshapeCopyPath_PreservesTensorEngine`
- `NDArray_AstypeCopyPath_PreservesTensorEngine`

NpyIter.Clone buffered deep copy + RemoveAxis stride width
→ `fix(npyiter): deep-copy buffered Clone buffers; preserve stride
  width after RemoveAxis`
- `NpyIterCopy_BufferedIterator_AllocatesIndependentBuffers`
- `NpyIterCopy_AfterRemoveAxis_PreservesAllocatedStrideWidth`
…aths

Promotes SByte, Half (float16), and Complex from "partially supported" to
first-class dtypes, matching what NPTypeCode already declares and what
NumPy 2.4.2 ships.

The audit (NDITER_BRANCH_QUALITY_AUDIT_V2.md, Tier 1) flagged 9 production
crash sites and 5 perf gaps where these three dtypes silently fell out of
12-dtype switches. After this commit every np.power(lhs, rhs) combination
across the 15x15 dtype matrix works end-to-end, and the existing 12-dtype
fast paths remain intact.

CRASH FIXES (Tier 1):

* NpyIterCasting (T1.9, T1.12, T1.38, T1.39): IsSafeCast / ReadAsDouble /
  WriteFromDouble / ConvertValue / PromoteTypes now handle SByte / Half /
  Complex. Complex routes through a dedicated Complex intermediate so the
  imaginary component is preserved on Complex->Complex copies and dropped
  cleanly (per NumPy's ComplexWarning) on Complex->real. Adds Half/SByte
  to IsFloatingPoint/IsSignedInteger predicates.

* NpyIterBufferManager (related to T1.12): same-type buffered iteration
  was throwing for Complex base case. Adds SByte/Half/Complex branches to
  CopyToBuffer/CopyFromBuffer dtype dispatch.

* UnmanagedStorage (T1.13, T1.57): SetValue(object,int[]/long[]),
  SetData(NDArray,long[]) scalar fast path, and the void*/IMemoryBlock
  CopyTo overloads all gained the three missing dtype branches.

* ArrayConvert.cs (T1.30): 13 ToX(Array) destination switches were
  missing SByte/Half/Complex source cases. Plus ~40 new typed converter
  methods covering the previously-missing (Src -> Dst) corners. Total
  ~550 LOC added.

* np.asanyarray (T1.49): adds IEnumerable<sbyte>, IEnumerable<Half>,
  IEnumerable<Complex> case branches; corresponding Memory<T>/
  ReadOnlyMemory<T> dispatch; ConvertObjectListToNDArray branches;
  and FindCommonNumericType expansion (the seenMask bitset was bounded
  to 12 dtypes; Complex's typecode=128 also previously aliased bit 0
  due to unbounded shift -- now masked by `(int)code & 31`).

* np.copyto T1.55: now passes via the NpyIterCasting fix.

* ILKernelGenerator.EmitDecimalConversion: Half<->Decimal and
  Complex<->Decimal routes were missing. np.power(Half, Decimal) now
  works (was the only np.power(15x15) failure after the casting fixes).

PERF FIXES (Tier 2):

* ILKernelGenerator.Binary.IsSimdSupported<T> (R9): adds sbyte.
  Vector*<sbyte> arithmetic is natively supported in .NET.

* Converts.FindConverter (R18, R33): 12x12 type-pair fast-path ladder
  expanded to 15x15 (225 entries). Eliminates the IConvertible-interface
  boxing and object-cast boxing that the CreateFallbackConverter path
  imposes for SByte/Half/Complex sources or destinations.

* Default.Reduction.ArgMax (R23): the per-slice NDArray view allocation
  in the Half/Complex axis fallback was costing one new NDArray per slice
  (1000 allocations for a (1000,1000) axis-reduce). Replaced with a
  stride-aware loop driven from a stackalloc coord vector. SByte path is
  removed from the fallback entirely since the IL kernel already handles
  it via CreateAxisArgReductionKernelTyped<sbyte>.

* Default.BooleanMask gather (T1.58): the strided/broadcast fallback was
  calling Buffer.MemoryCopy(_, _, elemSize, elemSize) per matched element
  (~1us/element). Specialized on element size (1/2/4/8/16 bytes); all 15
  dtypes hit a typed pointer write now, including Half (2B) and Complex
  (16B as two longs).

VERIFICATION:

* test/Math/NDArray.power.DtypeMatrix.Test.cs (new):
  - 15x15 dtype matrix smoke test (225 (lhs, rhs) combinations).
  - SByte ** -SByte raises ValueError-style ArgumentException.
  - Half ** Half preserves Half.
  - Complex ** Complex preserves Complex.
  - Float ** Complex promotes to Complex.
  - Half ** Single promotes to Single (NEP50).
  - SByte/Half/Complex List/IEnumerable inputs no longer throw.

* Removed [OpenBugs] attribute from 11 AuditV2 tests that are now CI-green:
  T1_9 (3x), T1_12 (2x), T1_13 (2x), T1_30 (3x), T1_49 (3x), T1_55,
  T1_57, T1_58. They now run as regular tests.

* Full suite: 8281 passed, 0 failed (was 8255 before this commit, including
  the new dtype-matrix tests and 26 promoted-from-OpenBugs tests).

DOCS:

* .claude/CLAUDE.md: "Supported Types (12)" -> "Supported Types (15)".
  Adds Half/SByte/Complex rows and a "Perf notes" section that documents
  Half/Complex/Decimal as scalar paths (no Vector<Half> arithmetic in
  .NET BCL; Complex.Pow is the BCL routine).

OUT OF SCOPE FOR THIS COMMIT:

* T1.34 NpyExpr Const/Where/Call SByte/Half/Complex support: not on
  np.power's critical path; left for a separate pass.
* T1.39 Int64/UInt64 -> double precision loss above 2^53: separate
  audit item, unrelated to the three target dtypes.

For full audit context see docs/plans/NDITER_BRANCH_QUALITY_AUDIT_V2.md
section "Major themes" item 2 (missing SByte/Half/Complex).
…s 3000× copy

Audit V2 finding (Section 1.6 / src/NumSharp.Core/Manipulation/np.ravel.cs:30-34):
np.ravel(a, 'F') unconditionally routed through a.flatten('F'), which allocates
fresh F-contiguous memory and runs NpyIter.Copy over the source. NumPy, in
contrast, returns a 1-D view sharing the underlying buffer whenever the source
is already F-contiguous (np.shares_memory(np.ravel(aF, 'F'), aF) == True).

The audit reports a 3000× performance regression on the hot F-order path
(np.arange(12).reshape(3,4).copy('F') -> np.ravel(.,'F')): an O(1) shape-only
aliasing was replaced with an O(N) buffered copy.

Root cause
----------
ravel's F-branch had no fast path for the IsFContiguous case. flatten('F') is
documented to "ALWAYS return a copy" by design, so calling it from ravel forced
materialization even when the linear memory walk would already reproduce the
column-major read-out.

Why a 1-D view is correct for F-contiguous sources
--------------------------------------------------
An F-contiguous array has strides[0] == 1 and strides[i] == dims[i-1] *
strides[i-1] for i > 0, with no broadcast/stride-0 dimensions. Walking the
underlying buffer linearly from `offset` for `size` elements visits values in
F-order (first axis varies fastest), which is exactly what ravel('F') is
specified to produce.

For non-F-contig sources we still fall back to flatten('F') — a strided / C-
contig / sliced source needs the column-major copy to reproduce F-order
correctly.

Implementation
--------------
ravel(a, 'F') with NDim > 1 and size > 1:
  * a.Shape.IsFContiguous → build Shape(new[]{size}, new[]{1}, a.Shape.offset,
    a.Shape.bufferSize) and return new NDArray(a.Storage.Alias(vec)). offset and
    bufferSize are preserved so sliced F-views remain correct; size becomes the
    1-D shape's logical and physical extent.
  * Otherwise → existing flatten('F') copy path (unchanged).

The new shape's flags are recomputed by ComputeFlagsStatic over the 1-D
dims/strides, which trivially marks the result as both C- and F-contiguous and
writeable (a 1-D dense vector is both orders). Storage.Alias chains _baseStorage
to the ultimate owner, so view tracking and the @base property continue to work.

Test coverage
-------------
* AuditV2_ManipulationApis.Ravel_FContiguous_FOrder_ReturnsView is no longer
  marked [OpenBugs(audit-v2-ravel-fcont-fview)] — the documented NumPy
  np.shares_memory invariant is now asserted directly in CI.
* test/Manipulation/np.ravel.Test.cs gains 10 new tests:
    - Ravel_FOrder_FContig2D_IsView                — write-through verifies aliasing.
    - Ravel_FOrder_FContig2D_ValuesMatchColumnMajor — read-out sequence matches NumPy.
    - Ravel_FOrder_FContig3D_IsView                — 3-D F-flat-index decomposition.
    - Ravel_FOrder_CContig_IsCopy                  — C-contig source still copies.
    - Ravel_FOrder_Transpose2D_IsView              — a.T (F-contig view) also aliases.
    - Ravel_FOrder_KOrder_FContigSource_IsView     — 'K' resolves to 'F' for F-source.
    - Ravel_FOrder_AOrder_FContigSource_IsView     — 'A' resolves to 'F' for strict F.
    - Ravel_FOrder_FContig_DtypeFloat              — dtype preserved across the view.
    - Ravel_FOrder_FContig_EquivalentToFlattenF_Values — values match flatten('F').
    - Ravel_FOrder_FContig_PreservesSize           — 2-D / 3-D / 4-D size invariants.

Verified
--------
* New tests pass on net8.0 and net10.0.
* Full CI-filtered suite (TestCategory!=OpenBugs&TestCategory!=HighMemory)
  passes 8292/8292 on both target frameworks.
* The 17 pre-existing F-contig OpenBugs failures (np.flip, np.sort, np.repeat
  axis, reduction F-preservation, save/load fortran_order, etc.) remain
  unchanged — they live in test/View/OrderSupport.OpenBugs.Tests.cs and are
  excluded by the CI filter.

References
----------
* NumPy: https://numpy.org/doc/stable/reference/generated/numpy.ravel.html
* docs/plans/NDITER_BRANCH_QUALITY_AUDIT_V2.md — Section 1.6
* Spec: np.shares_memory(np.ravel(aF, 'F'), aF) == True for IsFContiguous source
…aths

Audit of np.ravel call paths flagged two cases that the initial fix relied on
but did not directly assert in tests. Add explicit coverage so regressions are
caught:

1. Ravel_FOrder_FContigColumnSlice_PreservesOffset_IsView
   aF[:, 1:3] on F-contig (4,5) yields (4,2) F-contig with offset=4. The view
   path must preserve offset and bufferSize so the 1-D Alias reads memory
   [4..11]. Verified:
     * shape (8,)
     * F-order values [1, 6, 11, 16, 2, 7, 12, 17] (column-major read-out)
     * write-through r[0] → s[0,0] and aF[0,1] both updated (shared memory)

2. Ravel_FOrder_FContig_BothCAndFContig_IsView
   A (1, N) shape is simultaneously C- and F-contiguous. ravel('F') enters the
   F-branch (NDim>1, size>1, IsFContiguous=true) and returns an Alias view; this
   was already covered by the implementation but not by an explicit test.
     * shape (4,)
     * values [10, 20, 30, 40] (linear memory walk)
     * write-through r[0] propagates to both[0, 0]

Both cases pass on net8.0 and net10.0 (64/64 tests in the ravel suite).

Background — full ravel coverage matrix audited manually:

  Order  Source layout                              Branch         Result
  -----  -----------------------------------------  -------------  -------------
  'F'    strict F-contig, NDim>1, size>1            view path      view
  'F'    both C+F contig (e.g. (1,N)), NDim>1       view path      view
  'F'    F-contig col-slice, offset!=0              view path      view (offset preserved)
  'F'    transpose of C-contig 2-D (→ F-contig)     view path      view
  'F'    C-contig only, NDim>1                      flatten('F')   copy
  'F'    broadcast / strided / non-contig           flatten('F')   copy
  'F'    1-D (NDim==1)                              C-order path   view if contig
  'F'    scalar / empty / size<=1                   C-order path   trivial
  'C'    C-contig                                   reshape        view
  'C'    F-contig only                              CloneData      C-order copy
  'A'    F-contig (strict)                          resolves to F  view
  'A'    otherwise                                  resolves to C  view/copy
  'K'    F-contig                                   resolves to F  view
  'K'    C-contig                                   resolves to C  view/copy

All 15 dtypes (Boolean, Byte, SByte, Int16, UInt16, Int32, UInt32, Int64, UInt64,
Char, Half, Single, Double, Decimal, Complex) verified end-to-end via in-process
buffer-address comparison and dtype assertion.

NDArray.ravel() and NDArray.ravel(char) instance methods delegate to np.ravel,
so the fix covers both call sites.
…efault-None bounds

Brings np.clip up to NumPy 2.x signature parity. Two missing capabilities are
addressed at the API surface; the underlying engine (Default.ClipNDArray.cs)
already supported null bounds for both legs of the interval.

NumPy 2.x signature mirrored:
    clip(a, a_min=None, a_max=None, out=None, *, min=None, max=None)

Changes:
- src/NumSharp.Core/Math/np.clip.cs
  * Replace the trio of legacy 4-arg overloads with a single unified entry
    point exposing all parameters as optional. Callers may now write:
      np.clip(a)                        — no bounds, returns copy
      np.clip(a, min: 3)                — lower bound only (NEP-rebrand)
      np.clip(a, max: 5)                — upper bound only
      np.clip(a, min: lo, max: hi)      — both via aliases
      np.clip(a, a_min: null, a_max: 5) — explicit None
      np.clip(a, min: 3.5, dtype: NPTypeCode.Double)
    a_min/a_max still accepted (NumPy keeps both names; min=/max= were added
    in 2.0 as keyword-only aliases).
  * Conflict detection mirrors NumPy: passing both a_min and min (or both
    a_max and max) raises ArgumentException rather than silently picking one.
  * Type-dtype overload preserved separately (Type != NPTypeCode?, no merge
    possible). Existing positional-3 call sites (np.clip(a, lo, hi)) and
    named-arg call sites in np.maximum/np.minimum compile unchanged.

- test/NumSharp.UnitTest/NumPyPortedTests/ClipNDArrayTests.cs
  * 9 new tests covering the NumPy 2.x surface:
      - min=/max= keyword aliases (lower-only, upper-only, both)
      - Explicit a_min=null / a_max=null
      - Bare np.clip(a) returns a copy (verifies distinct backing storage)
      - min= keyword with array bound (broadcast verification)
      - Conflict detection (a_min+min, a_max+max throw)
      - min= combined with dtype= promotes result dtype

Verification:
- Reference outputs cross-checked against NumPy 2.4.2 via Python; all 9
  documented behaviors match byte-for-byte.
- ClipNDArrayTests: 26/26 pass (was 17, +9 new).
- ClipEdgeCaseTests + np.maximum/np.minimum suite: 105/105 pass — no
  regressions (np.maximum/minimum use np.clip via named a_min:/a_max:).
- Full unit-test sweep (TestCategory!=OpenBugs&!=HighMemory) on net10.0:
  7202 pass, 0 fail, 11 pre-existing skips.

Audit reference: audit_v2/07_math_ops_selection_sorting_stats.md (Batch 7,
item 12).
…n-int upcast

Brings the np.clip engine path up to NumPy 2.x ufunc parity. Three latent
bugs surfaced while battle-testing edge cases for the min=/max= alias work:

1. Dtype promotion silently demoted to lhs.typecode
   * Before: outType = typeCode ?? lhs.typecode
     - clip(int32, min=3.5) → int32 (NumPy: float64)
     - clip(int32, min=float32) → int32 (NumPy: float64)
   * After: weak-scalar promotion consistent with NumSharp's binary-op
     engine and NEP 50 — a 0-d bound of the same kind (int/float/complex
     /decimal) as lhs is "weak" and does not promote; cross-kind or array
     bounds promote via np.result_type.
   * Examples now matching NumPy:
       clip(int32, min=3.5)              → float64  (was int32)
       clip(int32, min=3.0f)             → float64  (was int32)
       clip(uint8, 50, 75)               → uint8    (preserved, NEP 50 weak)
       clip(int32, min=long_arr)         → int64    (array promotes)
       clip(float32, 3, 7)               → float32  (preserved)
   * NaN bound on int array now upcasts to float64 with all-NaN result
     (was: silently a no-op, value unchanged).

2. @out= with mismatched dtype silently wrote garbage
   * Before: cast lhs/bounds to outType, blit through copyto into @out
     which retained its own (often narrower) dtype — produced truncated
     or pattern-aliased values.
   * After: validate @out.GetTypeCode == outType up front. Mismatch
     raises ArgumentException mirroring NumPy's _UFuncOutputCastingError
     ("Cannot cast ufunc 'clip' output from dtype('X') to dtype('Y')
     with casting rule 'same_kind'").

3. Engine refactor for the both-null + dtype= case
   * np.clip(arr, dtype=Single) with no bounds now properly casts the
     output and respects @out when supplied (previously dtype= without
     bounds returned plain lhs.Clone()).

Implementation details:
- Added PromoteClipBound(outType, bound): no-promotion shortcut for
  0-d same-kind bounds; falls back to np.result_type otherwise.
- Added IsSameKind(a, b): groups Byte/Char/signed-int/unsigned-int as
  integer kind; floats/decimals/complex compare by NPTypeCode group.
- @out validation now runs before any work, so shape/dtype errors fail
  fast without partial mutation of @out.
- np.copyto(@out, Cast(lhs, outType, copy: false)) handles the case
  where lhs needs casting to the promoted output type before writing.

Test additions (test/NumSharp.UnitTest/NumPyPortedTests/ClipNDArrayTests.cs)
— 30 new tests across 8 categories all cross-checked against NumPy 2.4.2:

  Dtype Promotion (NEP-50):
    - uint8 + int scalars preserves uint8
    - int32 + float scalar → float64 (also float32 scalar → float64)
    - float64 + int scalars preserves float64
    - int32 + int64 array bound → int64
    - dtype= with no bounds casts input
    - dtype= override forces narrower type even when bounds promote
    - NaN bound on int array upcasts to float64

  @out= Edge Cases:
    - in-place out=src returns same buffer & mutates
    - out= separate buffer leaves src unchanged
    - shape mismatch throws
    - dtype mismatch throws (previously silent garbage)
    - out= with no bounds copies src

  Special Float Values via kwarg:
    - min=-inf / max=+inf no-op
    - min=NaN / max=NaN propagates

  0-d (Scalar) Input:
    - clip(scalar, lo, hi) preserves ndim=0
    - clip(scalar, max:hi) preserves ndim=0
    - clip(scalar) preserves ndim=0

  Half / Complex via Kwarg:
    - Half min/max preserves Half
    - Complex min= (lex ordering, scalar bound)
    - Complex array min/max bounds (lex ordering)

  Broadcasting via Kwarg:
    - 2D + row vector min → broadcasts along axis 0
    - 2D + column vector max → broadcasts along axis 1
    - 2D + mixed row min + column max

  Strided Inputs via Kwarg:
    - Reversed-slice (negative stride) clipped via min=/max=

  Empty Arrays via Kwarg:
    - Empty + min= only
    - Empty + max= only
    - Empty + dtype= cast

Verification:
- ClipNDArrayTests: 56/56 pass (was 26; +30 new).
- np.clip + np.maximum + np.minimum + ClipEdgeCase + np.clip.Test suites:
  85 pass on net8.0, 55 pass on net10.0 (frameworks differ in shared-class
  counts).
- Full unit-test sweep (TestCategory!=OpenBugs&!=HighMemory) on net10.0:
  7232 pass, 0 fail, 11 pre-existing skips (was 7202 before this commit).
…x bug

Benchmarking np.clip against NumPy 2.4.2 revealed a 48-80× slowdown on
the common case `clip(arr, lo, hi)` with scalar literal bounds. Root
cause: the engine was materializing every scalar bound via
`np.broadcast_to(scalar, lhs.Shape).astype(outType)`, which for a 10M
int32 input allocated and memset two 40MB bound arrays per call (then
ran an element-wise array-bounds kernel that re-read both buffers).

Investigation also surfaced a pre-existing kernel bug exposed once the
new fast path routed scalar-bound calls through ClipScalar / ClipStrided
/ ClipScalarTail: the integer scalar fallbacks used `if / else if` to
apply the two clamps, so when `minVal > maxVal` values below `minVal`
incorrectly stuck at `minVal` instead of capping to `maxVal` (NumPy
guarantees `min(max(x, lo), hi)` — i.e. `maxVal` wins when bounds are
inverted). SIMD paths and Math.Min(Math.Max,...) float paths were
already correct.

Changes
=======

src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs
- Add scalar-bounds fast path: detect 0-d min/max (or null) and
  dispatch directly to ClipUnified / ClipMinUnified / ClipMaxUnified
  (the kernel family that broadcasts the scalar inside the vector loop).
  Skips broadcast_to + astype materialization entirely.
- ClipNDArrayScalarBounds: type-switch on outType to call the right
  generic kernel; uses a small delegate-based helper (ClipScalar<T>) so
  the dispatch logic isn't duplicated 12 times.
- ClipNDArrayScalarBoundsFallback: Half and Complex still go through
  the array-bounds path — their scalar SIMD kernels aren't wired and
  Complex has lex-ordering NaN semantics already implemented there.
  Cost is just the 0-d→shape broadcast (stride-0 view, O(1)) plus a
  1-element astype.
- Array bounds (any non-0-d min or max) flow into the existing path
  unchanged.

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs
- ClipScalar<T> (generic integer fallback, called by ClipHelper): replace
  `if (val > maxVal) val = maxVal; else if (val < minVal) val = minVal;`
  with two sequential `if`s. Now matches NumPy semantics when min > max.
- ClipScalarTail<T> (non-float tail after SIMD bulk loop): same fix.
- ClipStrided<T> (coordinate-iterated path for non-contiguous arrays):
  same fix.
- Added comments explaining why two sequential clamps are required.

Performance (Windows 11, .NET 10.0.1, AVX2-class CPU; 50 iterations,
min of timings reported; same array shapes/dtypes on both runtimes):

Scalar bounds, contiguous
                                NumPy 2.4.2   NumSharp BEFORE   NumSharp AFTER
  int32  size=1K                    3.4 µs         37.8 µs           3.3 µs
  int32  size=100K                  8.4 µs       2980.4 µs          66.5 µs
  int32  size=10M               6 741   µs    323 557   µs      10 094   µs
  int64  size=10M              14 519   µs    698 077   µs      34 860   µs
  float32 size=10M              6 917   µs    570 707   µs      22 441   µs
  float64 size=10M             14 228   µs    612 228   µs      30 926   µs

Single-sided scalar bound (min= or max= only)
  int32  size=10M min=         12 451   µs    285 434   µs      10 532   µs   (faster than NumPy)
  int32  size=10M max=         12 024   µs    294 756   µs      10 720   µs   (faster than NumPy)
  float64 size=10M min=        23 155   µs    300 770   µs      23 043   µs   (parity)

out= parameter
  int32 10M, out=arr in-place   7 038   µs    562 393   µs       7 465   µs   (parity)
  int32 10M, out=preallocated   7 794   µs    557 192   µs      12 539   µs

No bounds (clip(a))
  int32 10M                    12 126   µs      7 437   µs       7 158   µs   (faster than NumPy — Cast(copy:true))

Speedups range 20-75× over the previous NumSharp implementation; the
common `clip(arr, lo, hi)` path now sits at 1.5-3× NumPy or matches it
for small arrays. Remaining gaps:
  * Array bounds (lo_arr, hi_arr same-size): 3.5× slower — kernel is
    memory-bandwidth bound on three arrays; expected gap given .NET
    Vector<T> vs hand-tuned NumPy AVX2 inner loop.
  * Strided input (a[::2], a[::-1]): 15-20× slower — ClipStrided uses
    Shape.TransformOffset per element; NumPy's ufunc has a strided
    inner loop with stride-aware SIMD where possible.
  * Half (float16): 11× slower — .NET's `Vector<Half>` arithmetic is
    not supported, scalar Half→double→Half path required.
  * 2D broadcast (row vec): 33× slower — still goes through array path
    after broadcast_to materializes the row vector.

These remaining gaps are tracked for future kernel work and are not
addressed in this commit.

Verification
============
- All 7232 unit tests pass on net10.0 (TestCategory!=OpenBugs&!=HighMemory),
  including the regression test for min > max which now exercises the
  scalar fast path through the fixed ClipScalar/ClipStrided kernels.
- Bench harness: $CLAUDE_JOB_DIR/clip_bench.py and clip_bench.cs (50
  iterations each, min of timings).
Two further optimizations on top of the scalar-bounds fast path. Both
target the gap to NumPy that benchmarking surfaced.

Findings from the breakdown profiler ($CLAUDE_JOB_DIR/clip_breakdown.cs)
on int32 size=10M:

  Step                                            Time (µs)
  ──────────────────────────────────────────────  ─────────
  Pure ClipArrayBounds kernel (3R + 1W)             ~7,700
  Cast(lhs, int32, copy:true)  alloc + 1R+1W       ~6,100
  np.broadcast_to(lo_arr, same_shape)              ~negligible
  np.broadcast_to(lo_arr).astype(int32) — same dtype  ~7,700  ← wasted clone
  np.clip(arr, lo_arr, hi_arr) full path           37,700
  np.clip(arr, 2, 7) scalar fast path              17,100
  ClipHelper kernel only (1R + 1W in-place)         ~9,800

The two wasted passes: (1) `astype(same_dtype)` cloning the bounds even
when no cast is needed (15ms wasted on two array bounds), (2) the
Cast-then-clip pattern doing 4 memory streams (2R + 2W) when 2 streams
(1R src + 1W dst) suffice.

Changes
=======

src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs

1. PrepareBound(bound, targetShape, outType) helper:
   When the bound is already same-shape, same-dtype, contiguous, offset
   zero, return it directly instead of running broadcast_to + astype
   (which clones via UnmanagedStorage.Clone). Wins for the common case
   where users pass arrays that already match the input layout.

2. ClipNDArrayFusedScalarBounds: new fast-fast path for the dominant
   `np.clip(a, lo, hi)` shape — contiguous lhs, scalar literal bounds,
   no @out, no dtype promotion. Allocates a fresh `np.empty(shape)` and
   runs the fused CopyAndClip kernel in a single pass. Replaces the
   classic Cast-then-clip pattern (which ran two passes over memory).
   Falls through to the existing in-place scalar path when @out is
   supplied or the lhs needs casting.

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs

3. CopyAndClip / CopyAndClipMin / CopyAndClipMax (and their *Simd256 /
   *Scalar / *ScalarTail variants for 10 SIMD-supported dtypes): fused
   read-clip-write kernels. Each loop iteration loads a Vector256, runs
   Min(Max(v, lo), hi) in registers, and stores to the destination
   buffer — never spilling intermediate values to memory. Halves the
   memory bandwidth requirement vs the in-place "copy then clip"
   pattern on memory-bandwidth-bound input sizes.

Performance vs NumPy 2.4.2 (Windows 11, .NET 10.0.1, AVX2-only CPU,
50 iterations, min reported)

                                          NumPy    NumSharp BEFORE   NumSharp AFTER   AFTER/NumPy
Scalar bounds, contiguous
  int32  size=1K                          3.4 µs        37.8 µs           3.1 µs       0.91× (FASTER)
  int32  size=100K                        8.4 µs      2980   µs          68.2 µs       8.1×
  int32  size=10M                       6741   µs   323557   µs        9336   µs       1.4×
  int64  size=10M                      14519   µs   698077   µs       19287   µs       1.3×
  float32 size=10M                      6917   µs   570707   µs       11002   µs       1.6×
  float64 size=10M                     14228   µs   612228   µs       26969   µs       1.9×

Array bounds, contiguous (PrepareBound win)
  int32  size=10M                       9488   µs    38259   µs       13898   µs       1.5×    (was 4.0×)
  float64 size=10M                     24712   µs    83863   µs       42137   µs       1.7×    (was 3.4×)

Single-sided
  int32 10M min=                       12451   µs   285434   µs       11200   µs       0.90×   (FASTER)
  int32 10M max=                       12024   µs   294756   µs       11351   µs       0.94×   (FASTER)

out=  (in-place / preallocated)
  10M in-place out=arr                  7038   µs   562393   µs        4567   µs       0.65×   (35% FASTER than NumPy)
  10M out=preallocated                  7794   µs   557192   µs       10101   µs       1.3×

Both bounds None
  10M, clip(a)                         12126   µs     7437   µs        5778   µs       0.48×   (2× FASTER)

Combined effect of all four perf commits (3505edc, 79c1894, 9334bd7,
this one): the common `np.clip(arr, lo, hi)` path went from 48-80×
slower than NumPy to within 1.4-1.9× across dtypes, with several cases
matching or beating NumPy outright.

Discussion of the user's two questions
──────────────────────────────────────

1. Vector<T> vs Vector256<T> — measured both on this CPU; identical
   wall time (5527 vs 5559 µs/10M int32 in micro-bench, see
   $CLAUDE_JOB_DIR/clip_micro.cs). Vector<T> picks the widest hardware
   register at JIT time, so on AVX-512 hardware it'd be 512 bits = 2×
   throughput. On THIS AVX2 machine, no gain. Switching the existing
   Vector256<T> kernels to Vector<T> is a low-risk forward-compat move
   for AVX-512 hosts but no measurable win here. Not changed in this
   commit (would touch the whole kernel file ecosystem; out of scope).

2. IL Generation via DynamicMethod — the existing binary kernels
   (ILKernelGenerator.Binary.cs) emit 4× unrolled SIMD loops via
   System.Reflection.Emit. Tested whether porting that pattern to clip
   would help: micro-benchmarked manually-unrolled 4× and 8×
   Vector256<int> loops against the simple 1× variant. Results
   (10M int32):
     1× unrolled:   5559 µs
     4× unrolled:   6494 µs   (SLOWER — register pressure)
     8× unrolled:   5428 µs   (2% faster — within noise)
   The .NET JIT already auto-unrolls the simple SIMD loop well
   enough that hand-unrolling doesn't help and can hurt. IL emission
   for this op would add significant complexity for ~no perf win.
   Not pursued.

   The wins came from algorithmic changes (fused single-pass kernel,
   skipping redundant clones) — not from instruction-level tuning.

Verification
============
- All 7232 unit tests pass on net10.0 (TestCategory!=OpenBugs&!=HighMemory).
- Includes the regression test for `min > max` semantics through the
  new fused kernel path (which goes through CopyAndClip's scalar tail
  for the size<32 case).
- Bench harness: $CLAUDE_JOB_DIR/{clip_bench.py,clip_bench.cs,
  clip_breakdown.cs,clip_micro.cs}.

Remaining gaps
==============
- Strided / negative-stride / broadcast inputs: ~12-15× slower than
  NumPy. ClipStrided iterates with Shape.TransformOffset per element
  (~50ns/element overhead). NumPy ufunc has stride-aware SIMD inner
  loops. Would require a stride-aware clip kernel similar to NumPy's.
- Half / float16: ~9× slower. .NET's Vector<Half> arithmetic is not
  supported; falls back to scalar Half-via-double round-trip.
- 100K size scalar bounds (8.1×): allocation/dispatch overhead is
  amortized over fewer elements; gap shrinks at larger sizes.
…-adaptive

Per the user's directive, the entire clip code path now goes through a
single ILKernelGenerator entry point that dispatches internally and
emits all loops as DynamicMethod IL using the runtime-detected vector
width (V128 / V256 / V512). No hardcoded Vector256 references remain;
no hand-written C# loops remain in the engine or kernel files.

Files
=====

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs
  Rewritten from scratch (~1900 → ~360 lines, -81%).
  Public surface — what the engine actually calls:

      public enum ClipMode      { BothBounds, MinOnly, MaxOnly }
      public enum ClipBoundsKind { Scalar, Array }
      public unsafe delegate void ClipKernel(
          void* src, void* dst, long size, void* lo, void* hi);

      public static unsafe void Clip(
          NPTypeCode dtype, ClipMode mode, ClipBoundsKind kind,
          void* src, void* dst, long size, void* lo, void* hi);

  All dispatch happens inside ILKernelGenerator:
    - Cache key = (dtype << 16) | (mode << 8) | kind
    - On first miss, Generate(dtype, mode, kind) builds a DynamicMethod
      and stores the resulting delegate in a ConcurrentDictionary.

  The IL emitter:
    - Uses GetVectorContainerType() / GetVectorType() / GetVectorCount()
      so the SIMD loop body adapts to V512 on AVX-512 hosts, V256 on
      AVX2, V128 on SSE2. There is no `Vector256` or `Vector128` token
      anywhere in the kernel file.
    - Hoists the scalar bound load and Vector.Create() out of the SIMD
      loop (one broadcast per kernel call, not per iteration).
    - Computes `byteOff = i * sz` once per iteration into a local and
      reuses it for src/lo/hi/dst pointer arithmetic — avoids the
      O(N × pointer_count) multiplications the prior C# kernels had.
    - Falls back to a pure scalar IL loop for dtypes without
      Vector<T>.Min/Max (Char, Decimal, Half, Complex). Half and
      Complex delegate the per-element clamp to static helper methods
      (NaN-aware / lex-order); the loop is still IL.

src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs
  Stripped from 1222 → 207 lines (-83%). Everything but policy is gone:
    - dtype promotion (NEP-50 weak scalar via PromoteClipBound)
    - @out validation (shape, writeable, dtype)
    - scalar-vs-array kind detection (min.ndim == 0)
    - NaN-in-scalar-bound short-circuit for float dtypes
    - dst materialization choice: in-place vs fused-fresh vs cast-copy
    - single call to ILKernelGenerator.Clip(...)

  The previous ClipNDArrayContiguous / ClipNDArrayGeneral /
  ClipNDArrayScalarBounds / ClipNDArrayFusedScalarBounds / 12 per-dtype
  switches / delegate-based generic dispatchers / 14 Generated*Core
  methods are all deleted. One call site, one cache, one IL emitter.

src/NumSharp.Core/Backends/Default/Math/Default.Clip.cs
  Deleted (251 lines). Dead code — internal `ClipScalar(NDArray, object,
  object)` had no callers anywhere in the codebase, was a parallel
  hand-coded path that the IL kernel now subsumes.

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs
  Added EmitVectorMinOrMax() to the existing emit-primitives section
  (sibling of EmitVectorOperation). Resolves Vector{Width}.Min<T> /
  Max<T> by reflection on whatever container the runtime selected at
  startup — same width-adaptive pattern used by the binary kernels.

Performance vs NumPy 2.4.2 (Windows 11, .NET 10.0.1, AVX2 CPU, 50 iters,
min reported)
                                              NumPy      Before IL    After IL
Scalar bounds, contiguous
  int32  size=1K                              3.4 µs       3.1 µs       2.9 µs   (0.85× — faster)
  int32  size=100K                            8.4 µs      68.2 µs      47.3 µs
  int32  size=10M                          6 741 µs     9 336 µs     7 509 µs    (1.11×)
  int64  size=10M                         14 519 µs    19 287 µs    22 279 µs
  float32 size=10M                         6 917 µs    11 002 µs    13 057 µs
  float64 size=10M                        14 228 µs    26 969 µs    28 842 µs

Single-sided (min= or max= only)
  int32 10M min=                          12 451 µs    11 200 µs    10 944 µs   (0.88× — faster)
  int32 10M max=                          12 024 µs    11 351 µs     8 009 µs   (0.67× — faster)
  float64 10M min=                        23 155 µs    23 043 µs    19 776 µs   (0.85× — faster)

out=
  10M in-place out=arr                     7 038 µs     4 567 µs     3 954 µs   (0.56× — faster)
  10M out=preallocated                     7 794 µs    10 101 µs     9 175 µs

Both bounds None
  10M clip(a)                             12 126 µs     5 778 µs     6 025 µs   (0.50× — faster)

Half (float16) — IL emit cut the gap by 3×
  10M                                     66 969 µs   602 219 µs   212 024 µs   (was 9×, now 3.2×)

Verification
============
- All 7232 unit tests pass on net10.0 (TestCategory!=OpenBugs&!=HighMemory).
- The 85 clip-family tests (ClipNDArrayTests, ClipEdgeCaseTests,
  np.clip.Test, NewDtypes Half/Complex clip tests) cover:
    * Scalar literal bounds, array bounds, both-None, min-only, max-only
    * 14 dtypes (Byte, SByte, Int16/32/64, UInt16/32/64, Char, Half,
      Single, Double, Decimal, Complex)
    * Contiguous, transposed, strided (every other), reversed slices
    * Broadcast inputs (the OpenBug test)
    * NaN propagation in float arrays
    * NaN in scalar bound → all-NaN result (short-circuited in engine)
    * min > max → result all = max
    * @out= validation (shape & dtype mismatch throws)
    * NEP-50 weak-scalar promotion (uint8 + 50 stays uint8)
    * Cross-kind promotion (int32 + 3.5 → float64)
- Cache correctness: each (dtype, mode, kind) combination generates
  its kernel once on first call, then reuses the cached delegate. Re-
  running the test suite a second time keeps the same delegates (no
  re-emit per call).

Remaining gaps (deferred)
=========================
- Strided / negative-stride contiguity (~15-20× NumPy): the engine
  materializes a contiguous copy first via Cast(copy:true). A proper
  fix would IL-emit a stride-aware kernel, but that doubles the code
  size and is rarely the hot path.
- Array-bounds slightly worse than the prior hand-coded V256 inner
  loop (~2× NumPy vs ~1.5× before): the IL emit doesn't 4×-unroll
  like the binary kernels do. Measured earlier in the conversation,
  manual 4× unroll on the simple clip loop hurt rather than helped
  on the JIT auto-unrolled C# baseline; effect on IL-emitted code
  may differ but not investigated.
…) into nditer

Brings in 5 commits from worktree-clip-min-max-aliases that rebuild np.clip
end-to-end. Replaces and supersedes the in-flight clip work on nditer
(c3bbe9a "fix(clip): Complex IComparable + Half NaN propagation") whose
root cause — generic CompareTo / NpFunc routing for clip — no longer
exists after this merge.

Summary of incoming work (3505edc..10064ab)
=============================================

1. feat(np.clip): NumPy 2.x parity — min=/max= keyword aliases and
   default-None bounds. New signature mirrors NumPy 2.x:
     clip(a, a_min=None, a_max=None, out=None, *, min=None, max=None,
          dtype=None)

2. fix(np.clip): NumPy 2.x dtype promotion (NEP 50 weak-scalar via
   np.result_type), out= dtype validation, NaN-on-int upcast.

3. perf(np.clip): scalar-bounds fast path + fixed a latent ClipScalar
   min>max kernel bug (`if/else if` instead of two sequential clamps).

4. perf(np.clip): fused copy+clip kernel + skip the redundant astype
   clone when the bound already matches lhs shape/dtype/contiguity.

5. refactor(np.clip): all kernels routed through a single
   ILKernelGenerator.Clip() entry. Every loop is now emitted as a
   DynamicMethod via System.Reflection.Emit. The SIMD width is
   resolved at runtime (V128/V256/V512) — no Vector256 token remains
   anywhere in the clip path.

Conflict resolution
===================

* src/NumSharp.Core/Backends/Default/Math/Default.Clip.cs
    Deleted in worktree. nditer's c3bbe9a modified it (Complex pre-
    checks). The IL kernels in this merge handle Complex natively via
    ComplexMaxNaN/ComplexMinNaN helpers called from the generated loop,
    so the Default.Clip.cs path becomes redundant. Took the deletion.

* src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs
    Both sides modified. nditer's version had c3bbe9a + 574a0d8
    refactor (NpFunc generic dispatch, ~400 switch cases replaced).
    Worktree's version (this branch) is the IL-routed engine (207 lines
    of pure policy + one ILKernelGenerator.Clip call). Took worktree.
    Half / Complex correctness preserved by the new IL kernel — verified
    via the existing battletest suite (NewDtypes Half + Complex tests
    all pass).

* src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs
    Both modified. nditer added Half-specific scalar paths in the
    old kernel API. Worktree rewrote the file from ~1900 → ~360 lines
    of IL emission. Took worktree — Half NaN handling now lives inside
    the IL-emitted scalar tail via HalfMaxNaN/HalfMinNaN helper calls.

* src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs
    Auto-merged cleanly. Worktree added EmitVectorMinOrMax helper
    alongside nditer's dtype-parity expansion.

* src/NumSharp.Core/Math/np.clip.cs
    Manually merged: kept worktree's new public signature
    (a_min/a_max/out/dtype/min/max with NumPy-2.x semantics) and
    nditer's PreserveFContigFromSource wrapper (39ef08c "F-contig
    preservation across ILKernel dispatch"). Output now keeps F-order
    when the input was F-contiguous.

* test/NumSharp.UnitTest/NumPyPortedTests/ClipNDArrayTests.cs
    Auto-merged cleanly — 39 new tests from worktree (NEP-50 promotion,
    min/max aliases, out= edge cases, etc.) sit alongside nditer's
    existing coverage.

Verification
============
- Full build: 0 errors, 17 warnings (unchanged from each side).
- Test sweep (TestCategory!=OpenBugs&!=HighMemory) on net10.0:
  8334 pass, 0 fail, 11 pre-existing skips. nditer was at ~7232 tests
  pre-merge in the worktree's view; the actual count on nditer-only is
  higher and the merge brings the combined total up.
- All 85 clip-family tests pass (39 new + 46 pre-existing).
- The Complex IComparable issue that c3bbe9a addressed is verified
  fixed by the merge: the failing tests in that commit's "Fixes 14
  test failures" list all pass through the new IL kernel (Complex
  takes the scalar-IL path with ComplexMaxNaN/ComplexMinNaN helpers).

API behaviour for callers
=========================
- np.clip(arr, lo, hi)  — works exactly as before.
- np.clip(arr, lo, hi, dtype: NPTypeCode.Double)  — new dtype= override.
- np.clip(arr, lo, hi, @out: dst)  — supported as named arg.
- np.clip(arr, min: 3)        — NEW: NumPy 2.x kwarg alias.
- np.clip(arr, max: 7)        — NEW: kwarg alias.
- np.clip(arr, a_min: null, a_max: 5)  — NEW: explicit None bound.
- Promotion: clip(int32, 3.5) → float64 (was int32 — bug pre-merge).
- Out= dtype mismatch now throws ArgumentException (was silent
  garbage pre-merge).
@Nucs Nucs mentioned this pull request May 17, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant