Skip to content

feat(connectors): add S3 sink connector#3103

Open
atharvalade wants to merge 10 commits into
apache:masterfrom
atharvalade:feat/s3-sink-connector
Open

feat(connectors): add S3 sink connector#3103
atharvalade wants to merge 10 commits into
apache:masterfrom
atharvalade:feat/s3-sink-connector

Conversation

@atharvalade
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #2956

Rationale

Iggy lacks a native way to write stream messages to Amazon S3 and S3-compatible stores (MinIO, Cloudflare R2, Backblaze B2, DigitalOcean Spaces). This is a frequently requested capability for data lake ingestion and long-term archival pipelines.

What changed?

There was no connector for persisting Iggy messages to object storage. Users had to build custom consumers and upload logic to get data into S3.

This PR adds a new iggy_connector_s3_sink crate that implements the Sink trait. It buffers messages in-memory per stream/topic/partition, rotates files by size or message count, renders S3 keys from a configurable path template ({stream}/{topic}/{date}/{hour}/...), and uploads with retry + exponential backoff. Supports json_lines, json_array, and raw output formats with optional Iggy metadata and header embedding. Uses rust-s3 (already in workspace) with path-style addressing auto-enabled for custom endpoints.

Key implementation details:

  • 6 source modules: lib.rs (config + entry point), client.rs (S3 client init + bucket verification), buffer.rs (in-memory accumulation + rotation logic), formatter.rs (JSON/raw output + metadata/header inclusion), path.rs (template engine for S3 keys with offset-based filenames), sink.rs (Sink trait: open/consume/close lifecycle)
  • 36 unit tests covering config deserialization, buffer rotation, path template rendering, all output formats, credential validation, and edge cases
  • CI integration: added to _build_rust_artifacts.yml and edge-release.yml for cdylib plugin builds and release notes
  • Error handling: warnings logged on invalid config fallbacks, explicit buffer management on upload failure, close() warns if S3 client was never initialized
  • End-to-end tested locally with MinIO in Docker, Iggy server, CLI producer, and connector runtime — verified messages flow from Iggy stream into S3 bucket as properly formatted JSON

Local Execution

  • Passed
  • Pre-commit hooks ran
  • Full CI checklist passed locally:
    • cargo fmt --check -- pass
    • cargo clippy --tests -D warnings -- pass (zero warnings)
    • cargo test -p iggy_connector_s3_sink -- 36/36 pass
    • markdownlint --check -- pass
    • trailing-whitespace -- pass
    • trailing-newline -- pass
    • license-headers -- pass

AI Usage

  1. Opus 4.6
  2. used for scaffolding boilerplate and initial file structure, all logic was reviewed and iterated manually
  3. Verified through full local compilation, 36 unit tests, clippy with -D warnings, and end-to-end testing with MinIO Docker + Iggy server + CLI producer + connector runtime
  4. Yes

Here are all the relevant screenshots:

  • MinIO Docker container running and accessible at localhost:9000
  • MinIO web console showing the created iggy-test bucket
  • Iggy server started with root credentials configured
  • Iggy CLI creating stream application_logs and topic api_requests
  • Iggy CLI sending test messages to the topic
  • Connector runtime loading the S3 sink plugin and connecting to MinIO
  • Connector runtime consuming messages and uploading to S3
  • MinIO console showing the uploaded .jsonl file in the correct path structure (application_logs/api_requests/{date}/{hour}/)
  • Contents of the uploaded file showing properly formatted JSON lines with metadata (offset, timestamp, stream, topic, partition_id, payload)
  • All 36 unit tests passing
  • cargo clippy --tests -D warnings passing with zero warnings
Screenshot 2026-04-13 at 1 37 24 AM Screenshot 2026-04-13 at 1 36 38 AM Screenshot 2026-04-13 at 1 36 30 AM Screenshot 2026-04-13 at 1 36 12 AM Screenshot 2026-04-13 at 1 35 34 AM Screenshot 2026-04-13 at 1 28 47 AM Screenshot 2026-04-13 at 1 28 25 AM Screenshot 2026-04-13 at 1 28 16 AM Screenshot 2026-04-13 at 1 28 06 AM Screenshot 2026-04-13 at 1 27 52 AM

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 71.29032% with 267 lines in your changes missing coverage. Please review.
✅ Project coverage is 13.93%. Comparing base (7aa4539) to head (bb233a2).

Files with missing lines Patch % Lines
core/connectors/sinks/s3_sink/src/sink.rs 0.00% 120 Missing ⚠️
core/connectors/sinks/s3_sink/src/lib.rs 69.60% 69 Missing ⚠️
core/connectors/sinks/s3_sink/src/formatter.rs 83.40% 38 Missing and 3 partials ⚠️
core/connectors/sinks/s3_sink/src/client.rs 75.00% 21 Missing and 6 partials ⚠️
core/connectors/sinks/s3_sink/src/path.rs 93.96% 4 Missing and 3 partials ⚠️
core/connectors/sinks/s3_sink/src/buffer.rs 97.32% 3 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3103       +/-   ##
=============================================
- Coverage     74.16%   13.93%   -60.23%     
  Complexity      943      943               
=============================================
  Files          1237     1241        +4     
  Lines        112641    98557    -14084     
  Branches      89201    75148    -14053     
=============================================
- Hits          83536    13736    -69800     
- Misses        26309    84676    +58367     
+ Partials       2796      145     -2651     
Components Coverage Δ
Rust Core 1.25% <71.29%> (-74.05%) ⬇️
Java SDK 58.44% <ø> (ø)
C# SDK 19.71% <ø> (-50.94%) ⬇️
Python SDK 81.43% <ø> (ø)
Node SDK 91.53% <ø> (+0.12%) ⬆️
Go SDK 13.11% <ø> (-26.80%) ⬇️
Files with missing lines Coverage Δ
core/connectors/sinks/s3_sink/src/buffer.rs 97.32% <97.32%> (ø)
core/connectors/sinks/s3_sink/src/path.rs 93.96% <93.96%> (ø)
core/connectors/sinks/s3_sink/src/client.rs 75.00% <75.00%> (ø)
core/connectors/sinks/s3_sink/src/formatter.rs 83.40% <83.40%> (ø)
core/connectors/sinks/s3_sink/src/lib.rs 69.60% <69.60%> (ø)
core/connectors/sinks/s3_sink/src/sink.rs 0.00% <0.00%> (ø)

... and 829 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs.

If you need a review, please ensure CI is green and the PR is rebased on the latest master. Don't hesitate to ping the maintainers - either @core on Discord or by mentioning them directly here on the PR.

Thank you for your contribution!

@github-actions github-actions Bot added the S-stale Inactive issue or pull request label Apr 21, 2026
Copy link
Copy Markdown
Contributor

@slbotbm slbotbm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.
Also, do you plan to support using parquet files as well in the future?

Comment thread core/connectors/sinks/s3_sink/src/lib.rs Outdated
Comment thread core/connectors/sinks/s3_sink/src/sink.rs Outdated
Comment thread core/connectors/sinks/s3_sink/src/client.rs Outdated
Comment thread core/connectors/sinks/s3_sink/src/lib.rs Outdated
@slbotbm
Copy link
Copy Markdown
Contributor

slbotbm commented Apr 21, 2026

I also feel data loss due to maximum retries being exceeded should be mentioned in readme.md as a precaution.

@github-actions github-actions Bot removed the S-stale Inactive issue or pull request label Apr 22, 2026
@atharvalade atharvalade force-pushed the feat/s3-sink-connector branch from 96ed8d1 to 87e0cc0 Compare April 24, 2026 15:16
@atharvalade
Copy link
Copy Markdown
Contributor Author

I left some comments. Also, do you plan to support using parquet files as well in the future?

oh yes absolutely.. parquet support is on the roadmap as a future output_format option

@atharvalade
Copy link
Copy Markdown
Contributor Author

I also feel data loss due to maximum retries being exceeded should be mentioned in readme.md as a precaution.

I agree, I'll add that

@atharvalade atharvalade force-pushed the feat/s3-sink-connector branch from 0a37619 to 00fc2df Compare April 28, 2026 17:34
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs.

If you need a review, please ensure CI is green and the PR is rebased on the latest master. Don't hesitate to ping the maintainers - either @core on Discord or by mentioning them directly here on the PR.

Thank you for your contribution!

@github-actions github-actions Bot added S-stale Inactive issue or pull request and removed S-stale Inactive issue or pull request labels May 6, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs.

If you need a review, please ensure CI is green and the PR is rebased on the latest master. Don't hesitate to ping the maintainers - either @core on Discord or by mentioning them directly here on the PR.

Thank you for your contribution!

@github-actions github-actions Bot added the S-stale Inactive issue or pull request label May 14, 2026
@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented May 14, 2026

/ready

@github-actions github-actions Bot added S-waiting-on-review PR is waiting on a reviewer and removed S-stale Inactive issue or pull request labels May 14, 2026
Comment thread core/connectors/sinks/s3_sink/README.md Outdated
- Buffered uploads with configurable file rotation (by size or message count)
- Multiple output formats: JSON Lines, JSON Array, Raw
- Configurable path templates with variables for stream, topic, date, hour, partition
- Deterministic S3 keys based on offset ranges for idempotent crash recovery
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"deterministic S3 keys ... idempotent crash recovery" is false on three grounds:

either remove the claim, or remove {timestamp} from the template and document the in-memory loss path honestly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed claim, documented honestly instead

Comment thread core/connectors/sinks/s3_sink/README.md Outdated
- Deterministic S3 keys based on offset ranges for idempotent crash recovery
- Optional metadata and header inclusion in output
- Support for custom endpoints (MinIO, R2) and path-style addressing
- Retry with exponential backoff on upload failures
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"retry with exponential backoff" - code at sink.rs:260 is retry_delay * attempts, which is linear (1s, 2s, 3s). either update the doc to "linear backoff" or implement retry_delay * 2u32.pow(attempts - 1) with jitter. also AFAIR there is backoff in connectors SDK, please check it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now uses SDK exponential_backoff + jitter

Comment thread core/connectors/sinks/s3_sink/README.md Outdated

## Data Delivery Guarantees

This connector provides **at-least-once** delivery under normal operation. However, **data loss can occur** if all upload retries are exhausted (controlled by `max_retries`). When an upload fails after all retry attempts, the affected messages are dropped and an error is logged. Monitor your connector logs for `failed to upload` errors in production. Increase `max_retries` and `retry_delay` if transient S3 failures are common in your environment.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this paragraph claims at-least-once and then admits "data loss can occur" in the same sentence - that is self-contradictory. given #2927 + #2928, no sink connector can deliver at-least-once today. the canonical in-tree wording is http_sink/README.md:790-800 which honestly documents at-most-once and cites both bugs. recommend copying that section verbatim.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with http_sink's at-most-once wording, citing #2927/#2928

chrono = { workspace = true }
dashmap = { workspace = true }
humantime = { workspace = true }
iggy_common = { workspace = true }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iggy_common is declared under [dependencies] but the only usage in this crate is formatter.rs:233-235 under #[cfg(test)]. move it to [dev-dependencies] so it does not bloat the cdylib build.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to [dev-dependencies]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction to my earlier reply: iggy_common remains in [dependencies] (not [dev-dependencies]) because the header-serialization fix (your comment on formatter.rs:75) introduced runtime usage of HeaderKey, HeaderValue, and HeaderKind via serialize_headers at formatter.rs:124-169. The original premise (test-only) was accurate for v1 but is no longer true after the HeaderKind match dispatch was added

# under the License.

[package]
name = "iggy_connector_s3_sink"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing publish = false. every other connector sink (postgres_sink, delta_sink, http_sink, elasticsearch_sink, stdout_sink) declares publish = false because they are cdylib plugins not meant for crates.io. without it the crate would publish on the next workspace release.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, matches all other sinks

}

let mut state = self.state.lock().await;
state.messages_processed += messages.len() as u64;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state.messages_processed += messages.len() as u64 runs unconditionally outside the rotate loop, so when a mid-batch flush dropped N messages this counter still claims it processed them. upload_errors increments separately at :209. result: the close-log at :155-156 reports more messages processed than actually landed in S3. either decrement on drop or split into messages_buffered / messages_uploaded / messages_lost.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into messages_received / messages_uploaded / messages_lost

// Reset buffer even on failure to prevent unbounded growth.
// Messages are lost but offsets will be re-delivered by the
// runtime on next poll since consume() returned Ok.
buffer.reset();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the worst data-loss path in the PR. on retry-exhaust the failure branch logs an error and then calls buffer.reset() at :214 to drop the messages, while consume at :126 still returns Ok(()). the comment at :211-213 claims "offsets will be re-delivered by the runtime on next poll since consume() returned Ok" - that is doubly false:

net result: a single transient S3 hiccup that exhausts max_retries (default 3) permanently loses every buffered message. the README at line 139 acknowledges this but still claims at-least-once on the same line.

minimum: drop the false comment, propagate Err to the runtime, and align the README with http_sink/README.md:790-800 (at-most-once + cite #2927 / #2928).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed false re-delivery comment, propagate error, aligned README with known runtime limitations

return Ok(());
}
attempts += 1;
if attempts >= max_retries {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attempts starts at 0 and increments before the >= max_retries check, so max_retries = 3 yields 3 total attempts (2 retries past the initial one). this matches the postgres_sink pattern but the field name is misleading. either rename to max_attempts or use attempts > max_retries so the field name lines up with semantics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to max_attempts for clarity

let mut attempts = 0u32;

loop {
match bucket.put_object(s3_key, data).await {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the retry loop treats every non-2xx status uniformly. AWS permanent failures - AccessDenied (403), NoSuchBucket (404), InvalidBucketName (400), MalformedPolicy - get retried 3 times, wasting retry_delay * (1 + 2) = 3s before the final Err. classify retriable vs not: 5xx + 408 + 429 + 503 SlowDown -> retry; other 4xx -> fail fast.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added is_retriable_status: only 5xx/408/429 retry, other 4xx fail fast

);
}
}
tokio::time::sleep(retry_delay * attempts).await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry_delay * attempts is linear backoff (1s, 2s, 3s), but the README at line 13 advertises exponential. either implement retry_delay * 2u32.pow(attempts - 1) with jitter, or update the doc to "linear".

separately, Duration::Mul<u32> panics on overflow - default config is safe but pathological values (e.g. retry_delay = "1h" + large max_retries) would panic. saturating_mul is cheap defensive practice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented true exponential with jitter, capped at 60s via SDK helpers

@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented May 20, 2026

/author

@github-actions github-actions Bot added S-waiting-on-author PR is waiting on author response and removed S-waiting-on-review PR is waiting on a reviewer labels May 20, 2026
Write Iggy stream messages to Amazon S3 and S3-compatible stores with buffered uploads, configurable rotation, and deterministic offset-based keys.
@atharvalade atharvalade force-pushed the feat/s3-sink-connector branch from 980d64d to c0a5722 Compare May 23, 2026 05:17
@atharvalade atharvalade force-pushed the feat/s3-sink-connector branch from c0a5722 to 099de6f Compare May 23, 2026 05:19
…ctness

DashMap per-partition buffers (no lock held during upload), contiguous
buffer Vec<u8>+sidecar, SecretString credentials, HeaderKind-aware
serialization, owned_value_to_serde_json, byte-concat JsonArray finalize,
20-digit offset padding, partition in filename, deterministic timestamps,
strict config validation, .lost marker on flush failure, error propagation.
@atharvalade atharvalade force-pushed the feat/s3-sink-connector branch from f15cc53 to 6cfac12 Compare May 23, 2026 06:02
@atharvalade
Copy link
Copy Markdown
Contributor Author

The biggest thing I caught was a credential leak where the derived Debug on S3Sink would dump the full AWS access key and secret key into logs whenever the struct got printed. I replaced that with a manual Debug impl that just shows the bucket name and added a regression test so it never sneaks back in. On the metrics side, the received count was being incremented after processing the batch, which meant if something failed halfway through you could lose track of how many messages actually got dropped. I moved that counter up front and added logic to correctly mark the unprocessed remainder as lost. I also cleaned up the dependency situation since once_cell was listed but never used, simd-json was only needed in tests but lived in regular deps, and the cargo-machete ignore list was papering over all of it.

The max_retries field got renamed to max_attempts with a serde alias so existing configs still work, and the README had a contradiction where it said timestamp was wall clock time when the code actually derives it from the first message in the buffer, so I fixed the docs to match reality. For the lost marker file that records data gaps, I routed it through the retry logic so a transient S3 hiccup does not silently eat your gap record. Then I wrote about twenty unit tests covering the pure functions like retry status classification, flush payload extraction, config validation, and the Debug redaction. Finally I stood up a full integration test suite using MinIO in a container that validates the sink actually writes jsonl to S3 with the correct path layout and rotates files properly when the message count threshold is hit.

/ready

@github-actions github-actions Bot added S-waiting-on-review PR is waiting on a reviewer and removed S-waiting-on-author PR is waiting on author response labels May 23, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

Codecov Report

❌ Patch coverage is 71.29032% with 267 lines in your changes missing coverage. Please review.
✅ Project coverage is 18.27%. Comparing base (7aa4539) to head (cd1102b).
⚠️ Report is 36 commits behind head on master.

Files with missing lines Patch % Lines
core/connectors/sinks/s3_sink/src/sink.rs 0.00% 120 Missing ⚠️
core/connectors/sinks/s3_sink/src/lib.rs 69.60% 69 Missing ⚠️
core/connectors/sinks/s3_sink/src/formatter.rs 83.40% 38 Missing and 3 partials ⚠️
core/connectors/sinks/s3_sink/src/client.rs 75.00% 21 Missing and 6 partials ⚠️
core/connectors/sinks/s3_sink/src/path.rs 93.96% 4 Missing and 3 partials ⚠️
core/connectors/sinks/s3_sink/src/buffer.rs 97.32% 3 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3103       +/-   ##
=============================================
- Coverage     74.16%   18.27%   -55.89%     
  Complexity      943      943               
=============================================
  Files          1237     1241        +4     
  Lines        112641    98557    -14084     
  Branches      89201    75149    -14052     
=============================================
- Hits          83536    18010    -65526     
- Misses        26309    80107    +53798     
+ Partials       2796      440     -2356     
Components Coverage Δ
Rust Core 1.25% <71.29%> (-74.05%) ⬇️
Java SDK 58.44% <ø> (ø)
C# SDK 70.12% <ø> (-0.53%) ⬇️
Python SDK 81.43% <ø> (ø)
Node SDK 91.53% <ø> (+0.12%) ⬆️
Go SDK 39.91% <ø> (ø)
Files with missing lines Coverage Δ
core/connectors/sinks/s3_sink/src/buffer.rs 97.32% <97.32%> (ø)
core/connectors/sinks/s3_sink/src/path.rs 93.96% <93.96%> (ø)
core/connectors/sinks/s3_sink/src/client.rs 75.00% <75.00%> (ø)
core/connectors/sinks/s3_sink/src/formatter.rs 83.40% <83.40%> (ø)
core/connectors/sinks/s3_sink/src/lib.rs 69.60% <69.60%> (ø)
core/connectors/sinks/s3_sink/src/sink.rs 0.00% <0.00%> (ø)

... and 712 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@atharvalade
Copy link
Copy Markdown
Contributor Author

/author

@github-actions github-actions Bot added S-waiting-on-author PR is waiting on author response and removed S-waiting-on-review PR is waiting on a reviewer labels May 24, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs.

If you need a review, please ensure CI is green and the PR is rebased on the latest master. Don't hesitate to ping the maintainers - either @core on Discord or by mentioning them directly here on the PR.

Thank you for your contribution!

@github-actions github-actions Bot added the S-stale Inactive issue or pull request label Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-stale Inactive issue or pull request S-waiting-on-author PR is waiting on author response

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Amazon S3 Sink Connector

3 participants