feat: support pyarrow float16 by widening to float on read/write#3590
Open
anxkhn wants to merge 1 commit into
Open
feat: support pyarrow float16 by widening to float on read/write#3590anxkhn wants to merge 1 commit into
anxkhn wants to merge 1 commit into
Conversation
PyArrow's float16 (halffloat) raised UnsupportedPyArrowTypeException during schema conversion because _ConvertToIceberg.primitive only handled float32/float64. Iceberg has no half-precision float, but float16 -> float32 is lossless, mirroring how int8/int16 already widen to IntegerType. Map float16 to FloatType, and widen smaller float arrays to the target type in ArrowProjectionVisitor._cast_if_needed (parallel to the integer-widening branch) so float16 columns write as float32.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
PyArrow
float16(halffloat) currently raisesUnsupportedPyArrowTypeExceptionduring schema conversion, because_ConvertToIceberg.primitiveonly handlesfloat32andfloat64:Iceberg has no half-precision float, but
float16 -> float32is lossless: every IEEE 754 half value (including the maximum finite value 65504) is exactly representable in single precision. This mirrors how the same method already widensint8/int16toIntegerType, and howArrowProjectionVisitor._cast_if_neededalready widens smaller integers up for cross-platform compatibility. Mappingfloat16 -> FloatTypeis the float analogue, sofloat16columns round-trip instead of erroring.Changes (
pyiceberg/io/pyarrow.py):_ConvertToIceberg.primitive: mappa.float16()->FloatType().ArrowProjectionVisitor._cast_if_needed: widen smaller float types to the target type on write (parallel to the existing integer-widening branch), sofloat16arrays are cast tofloat32. Narrowing falls through to the existingpromote()handling.No dependency changes;
pyproject.toml/uv.lockare untouched and the imports used were already present.A note on a design choice, deferring to maintainers: widening
float16silently (rather than erroring or gating behind a config flag) follows the existingint8/int16 -> Integerprecedent. Happy to gate it behind a config option instead if you'd prefer. The new float-widening branch also makesfloat32 -> DoubleTypeactually cast the array (parallel to int widening), so it slightly tightens float promotion in general, not justfloat16.Are these changes tested?
Yes:
tests/io/test_pyarrow_visitor.py::test_pyarrow_float16_to_icebergasserts the schema mappingpa.float16() -> FloatType().tests/io/test_pyarrow.py::test__to_requested_schema_float_promotionis parametrized overf16 -> Float,f16 -> Double, andf32 -> Double, asserting both the written PyArrow type and that the values are preserved.Both pass locally, the surrounding visitor suite and the sibling integer-promotion test still pass, and
make lint(ruff, ruff-format, mypy, pydocstyle, codespell, uv-lock) is clean. The integration suite (Docker/Spark) was not run locally.Are there any user-facing changes?
Yes. PyArrow tables with
float16columns can now be converted/written through PyIceberg (they map to Icebergfloatand are stored asfloat32), where they previously raisedUnsupportedPyArrowTypeException. This is purely additive; existingfloat32/float64behavior is unchanged.