feat(arrow): resolve S3-compatible schemes in ArrowFileSystemFileIO#703
feat(arrow): resolve S3-compatible schemes in ArrowFileSystemFileIO#703plusplusjiajia wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates iceberg::arrow::ArrowFileSystemFileIO::ResolvePath to better handle S3-compatible (and other “foreign”) URI schemes by resolving them to the underlying Arrow filesystem path, avoiding failures when the wrapped Arrow FileSystem only accepts its native scheme.
Changes:
- Adjusted
ArrowFileSystemFileIO::ResolvePathto fall back fromPathFromUri()to a scheme-less path when URI parsing fails. - Added a regression test ensuring reads succeed when a non-native scheme is used with the underlying filesystem.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/iceberg/arrow/arrow_io.cc | Implements URI scheme fallback behavior in ResolvePath. |
| src/iceberg/test/arrow_io_test.cc | Adds test coverage for resolving a foreign scheme to the underlying path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (auto pos = file_location.find("://"); pos != std::string::npos) { | ||
| auto path = arrow_fs_->PathFromUri(file_location); | ||
| if (path.ok()) { | ||
| return path.ValueOrDie(); | ||
| } | ||
| // PathFromUri rejects S3-compatible schemes (s3a/s3n, gs://, oss://); | ||
| // fall back to the scheme-less bucket/key. | ||
| return file_location.substr(pos + 3); | ||
| } |
There was a problem hiding this comment.
Addressed: the fallback now triggers only on Arrow's scheme-mismatch error; other PathFromUri failures (malformed URI, unsupported authority, …) propagate. ?query/#fragment are stripped. Added tests for the propagate path and for preserving percent-encoded keys.
| } | ||
| // PathFromUri rejects S3-compatible schemes (s3a/s3n, gs://, oss://); | ||
| // fall back to the scheme-less bucket/key. | ||
| return file_location.substr(pos + 3); |
There was a problem hiding this comment.
I think this fallback is too broad. As implemented, any URI scheme rejected by PathFromUri will be stri
pped and interpreted as a path in the wrapped filesystem. For example, with an S3FileSystem, a mismatch ed URI such as file://bucket/prefix would be interpreted as bucket/prefix instead of failing fast.
Can we make the fallback explicit and narrow?
- The scheme should be in an allowlist for the wrapped filesystem. For S3FileSystem, I think we can
support well-known aliases such as s3a and s3n. Vendor schemes such as oss or gs should probably be opt- in or discussed separately, because they may also represent native OSS/GCS locations. - The PathFromUri failure should match the expected scheme-mismatch / unsupported-scheme case. Other errors, such as malformed URI, authority handling errors, or local-path errors, should still be returned to the caller.
In other words, this should be guarded by both a scheme allowlist and the expected error condition,
rather than falling back for every PathFromUri failure.
There was a problem hiding this comment.
@MisterRaindrop Agreed — the scheme allowlist belongs at the selection layer (like Java's ResolvingFileIO.SCHEME_TO_FILE_IO), so DetectBuiltinFileIO now accepts s3a/s3n. For the error condition: ResolvePath now only falls back on Arrow's scheme-mismatch error — any other PathFromUri failure (malformed URI, unsupported authority, …) is propagated — and it strips ?query/#fragment. oss/gs left for separate discussion, as you suggested.
0b65bf6 to
c6886c7
Compare
c6886c7 to
5c7656b
Compare
| } | ||
| // Scheme-less bucket/key, dropping any ?query / #fragment. | ||
| std::string bucket_key = file_location.substr(pos + 3); | ||
| return bucket_key.substr(0, bucket_key.find_first_of("?#")); |
There was a problem hiding this comment.
One subtle concern: s3:// and s3a:// currently use different parsing semantics. s3:// goes through Arrow's PathFromUri, while s3a/s3n use the fallback substring parser. This means percent-encoded keys may resolve differently.
example:
s3://bucket/a%20b -> Arrow PathFromUri -> bucket/a b
s3a://bucket/a%20b -> fallback -> bucket/a%20b
There was a problem hiding this comment.
@MisterRaindrop Good catch — fixed. The fallback now percent-decodes the key via arrow::util::UriUnescape, so s3a/s3n resolve identically to native s3:// (a%20b → a b).
ArrowFileSystemFileIO::ResolvePathpassed the full URI to the wrapped FileSystem'sPathFromUri, which only accepts that FileSystem's native scheme(s3://forS3FileSystem). S3-compatible object stores are commonly addressed with other schemes —s3a/s3n, or vendor schemes such asgs://(GCS) andoss://(Alibaba OSS) — served by anarrow::fs::S3FileSystemconfigured with anendpoint_override. Every read and write of such a location failed with: