Skip to content

feat(arrow): resolve S3-compatible schemes in ArrowFileSystemFileIO#703

Open
plusplusjiajia wants to merge 3 commits into
apache:mainfrom
plusplusjiajia:fix-arrow-s3-compatible-scheme
Open

feat(arrow): resolve S3-compatible schemes in ArrowFileSystemFileIO#703
plusplusjiajia wants to merge 3 commits into
apache:mainfrom
plusplusjiajia:fix-arrow-s3-compatible-scheme

Conversation

@plusplusjiajia
Copy link
Copy Markdown
Member

ArrowFileSystemFileIO::ResolvePath passed the full URI to the wrapped FileSystem's PathFromUri, which only accepts that FileSystem's native scheme(s3:// for S3FileSystem). S3-compatible object stores are commonly addressed with other schemes — s3a/s3n, or vendor schemes such as gs:// (GCS) andoss:// (Alibaba OSS) — served by an arrow::fs::S3FileSystem configured with an endpoint_override. Every read and write of such a location failed with:

expected a URI with one of the schemes (s3) but received <scheme>://...

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates iceberg::arrow::ArrowFileSystemFileIO::ResolvePath to better handle S3-compatible (and other “foreign”) URI schemes by resolving them to the underlying Arrow filesystem path, avoiding failures when the wrapped Arrow FileSystem only accepts its native scheme.

Changes:

  • Adjusted ArrowFileSystemFileIO::ResolvePath to fall back from PathFromUri() to a scheme-less path when URI parsing fails.
  • Added a regression test ensuring reads succeed when a non-native scheme is used with the underlying filesystem.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/iceberg/arrow/arrow_io.cc Implements URI scheme fallback behavior in ResolvePath.
src/iceberg/test/arrow_io_test.cc Adds test coverage for resolving a foreign scheme to the underlying path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/iceberg/arrow/arrow_io.cc Outdated
Comment on lines 476 to 484
if (auto pos = file_location.find("://"); pos != std::string::npos) {
auto path = arrow_fs_->PathFromUri(file_location);
if (path.ok()) {
return path.ValueOrDie();
}
// PathFromUri rejects S3-compatible schemes (s3a/s3n, gs://, oss://);
// fall back to the scheme-less bucket/key.
return file_location.substr(pos + 3);
}
Copy link
Copy Markdown
Member Author

@plusplusjiajia plusplusjiajia Jun 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed: the fallback now triggers only on Arrow's scheme-mismatch error; other PathFromUri failures (malformed URI, unsupported authority, …) propagate. ?query/#fragment are stripped. Added tests for the propagate path and for preserving percent-encoded keys.

Comment thread src/iceberg/arrow/arrow_io.cc Outdated
}
// PathFromUri rejects S3-compatible schemes (s3a/s3n, gs://, oss://);
// fall back to the scheme-less bucket/key.
return file_location.substr(pos + 3);
Copy link
Copy Markdown
Contributor

@MisterRaindrop MisterRaindrop Jun 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fallback is too broad. As implemented, any URI scheme rejected by PathFromUri will be stri
pped and interpreted as a path in the wrapped filesystem. For example, with an S3FileSystem, a mismatch ed URI such as file://bucket/prefix would be interpreted as bucket/prefix instead of failing fast.

Can we make the fallback explicit and narrow?

  1. The scheme should be in an allowlist for the wrapped filesystem. For S3FileSystem, I think we can
    support well-known aliases such as s3a and s3n. Vendor schemes such as oss or gs should probably be opt- in or discussed separately, because they may also represent native OSS/GCS locations.
  2. The PathFromUri failure should match the expected scheme-mismatch / unsupported-scheme case. Other errors, such as malformed URI, authority handling errors, or local-path errors, should still be returned to the caller.

In other words, this should be guarded by both a scheme allowlist and the expected error condition,
rather than falling back for every PathFromUri failure.

Copy link
Copy Markdown
Member Author

@plusplusjiajia plusplusjiajia Jun 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MisterRaindrop Agreed — the scheme allowlist belongs at the selection layer (like Java's ResolvingFileIO.SCHEME_TO_FILE_IO), so DetectBuiltinFileIO now accepts s3a/s3n. For the error condition: ResolvePath now only falls back on Arrow's scheme-mismatch error — any other PathFromUri failure (malformed URI, unsupported authority, …) is propagated — and it strips ?query/#fragment. oss/gs left for separate discussion, as you suggested.

@plusplusjiajia plusplusjiajia force-pushed the fix-arrow-s3-compatible-scheme branch 4 times, most recently from 0b65bf6 to c6886c7 Compare June 8, 2026 09:12
@plusplusjiajia plusplusjiajia force-pushed the fix-arrow-s3-compatible-scheme branch from c6886c7 to 5c7656b Compare June 8, 2026 09:32
Comment thread src/iceberg/arrow/arrow_io.cc Outdated
}
// Scheme-less bucket/key, dropping any ?query / #fragment.
std::string bucket_key = file_location.substr(pos + 3);
return bucket_key.substr(0, bucket_key.find_first_of("?#"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One subtle concern: s3:// and s3a:// currently use different parsing semantics. s3:// goes through Arrow's PathFromUri, while s3a/s3n use the fallback substring parser. This means percent-encoded keys may resolve differently.
example:
s3://bucket/a%20b -> Arrow PathFromUri -> bucket/a b
s3a://bucket/a%20b -> fallback -> bucket/a%20b

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MisterRaindrop Good catch — fixed. The fallback now percent-decodes the key via arrow::util::UriUnescape, so s3a/s3n resolve identically to native s3:// (a%20b → a b).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants