Skip to content

fix(ingestion): refresh RDS IAM auth token per connection for MySQL#28730

Merged
harshsoni2024 merged 4 commits into
mainfrom
refresh_iam_auth_token
Jun 8, 2026
Merged

fix(ingestion): refresh RDS IAM auth token per connection for MySQL#28730
harshsoni2024 merged 4 commits into
mainfrom
refresh_iam_auth_token

Conversation

@harshsoni2024

@harshsoni2024 harshsoni2024 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Problem

Fix: #28686

  • MySQL ingestion against AWS RDS using IAM authentication fails partway
    through long runs with Access denied.

  • The IAM auth token is generated once, at engine-build time, and baked into
    the SQLAlchemy connection URL (get_password_secretget_connection_url_common
    in builders.py). RDS IAM tokens expire after ~15 minutes, and the token is
    never refreshed for the lifetime of the engine. Because the engine is created with
    max_overflow=-1 (unlimited overflow connections) and ingestion is multi-threaded,
    any new pooled connection opened after the token expires — large schemas, the
    profiler, reconnects — authenticates with a stale credential and dies mid-run.

  • The failure is time-dependent and intermittent: a small/single-connection run can
    reuse one still-valid connection and pass, which masks the bug; a large catalog
    reliably hits it.

Fix

  • clients/aws_client.py — add RdsIamAuthTokenManager: generates the RDS IAM
    token, derives expiry from the presigned-URL params (X-Amz-Date / X-Amz-Expires),
    caches it, and regenerates shortly before expiry. Falls back to a conservative TTL
    if the token can't be parsed so it still refreshes rather than living forever.
  • source/database/mysql/connection.py — route IamAuthConfigurationSource to a
    new _get_iam_engine that:
    • builds the engine with no token in the URL (scheme://user@host:port), and
    • attaches a SQLAlchemy do_connect event listener that injects a fresh token on
      every connection
      and enables SSL (PyMySQL requires SSL for RDS IAM), preserving
      any existing SSL config.

This keeps the connector-specific wiring in the MySQL connection handler while the
reusable token manager lives with AWSClient.

Scope / follow-up

Scoped to MySQL. The shared builders.py IAM path remains token-frozen for the
other RDS connectors that go through it (Postgres, Redshift, Greenplum, Timescale).
tests/unit/connections/test_iam_token_refresh.py documents that remaining gap and is
set up to flip to green once the shared path adopts RdsIamAuthTokenManager.

Tests

  • tests/unit/source/database/test_mysql_iam.py (new):
    • Connector: token not baked into the URL, do_connect listener registered,
      fresh token minted per connection, SSL enabled, existing SSL preserved,
      host/port split correctly.
    • Token manager: first call generates, cached within TTL, refreshed after
      expiry
      , expiry parsed from the presigned URL, malformed-token fallback.
  • tests/unit/connections/test_iam_token_refresh.py (new): characterizes the shared
    builders.py IAM path as the remaining gap for the non-MySQL RDS connectors.

RDS IAM auth tokens expire after ~15 minutes. The token was generated
once in get_connection_url_common and baked into the SQLAlchemy URL, so
a single engine reused one frozen token for every pooled connection and
never refreshed it. With max_overflow=-1 and multi-threaded extraction,
any connection opened after the 15-minute TTL — common on large catalogs
and during profiling — authenticated with a stale token and failed mid
run with "Access denied".

Add RdsIamAuthTokenManager (caches the token, parses its expiry from the
presigned-URL params, refreshes before it lapses) and wire a do_connect
event listener in the MySQL connection handler that injects a fresh token
on every new connection instead of embedding it in the URL. SSL is forced
on for IAM (PyMySQL requires it) while preserving any existing SSL config.

Scoped to the MySQL connector. The shared builders.py IAM path is still
token-frozen for the other RDS connectors (Postgres, Redshift, Greenplum,
Timescale); the reusable token manager lives in aws_client.py so they can
adopt it next.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 5, 2026 05:49
@harshsoni2024 harshsoni2024 requested a review from a team as a code owner June 5, 2026 05:49
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

✅ PR checks passed

The linked issue has a description and all required Shipping project fields set. Thanks!

@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Jun 5, 2026
Comment thread ingestion/src/metadata/clients/aws_client.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes time-dependent failures in MySQL ingestion against AWS RDS using IAM authentication by ensuring an RDS IAM auth token is refreshed for each new SQLAlchemy connection (instead of being generated once and embedded into the engine URL).

Changes:

  • Added RdsIamAuthTokenManager to cache and refresh RDS IAM tokens based on presigned-token expiry.
  • Updated MySQL connection handling to build an IAM-specific engine and inject a fresh token via a SQLAlchemy do_connect listener (and enable SSL for PyMySQL).
  • Added unit tests covering MySQL IAM engine wiring and token-manager refresh behavior, plus characterization tests documenting the remaining shared builders.py gap.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
ingestion/src/metadata/ingestion/source/database/mysql/connection.py Adds IAM-specific engine path with do_connect listener to inject fresh IAM token per connection.
ingestion/src/metadata/clients/aws_client.py Introduces RdsIamAuthTokenManager that parses token expiry and refreshes before expiry.
ingestion/tests/unit/source/database/test_mysql_iam.py New unit tests validating MySQL IAM engine URL behavior, listener registration, token injection, and SSL handling.
ingestion/tests/unit/connections/test_iam_token_refresh.py Characterization tests documenting current shared IAM URL behavior (token baked into URL) as a follow-up gap.

Comment thread ingestion/src/metadata/ingestion/source/database/mysql/connection.py Outdated
Comment thread ingestion/tests/unit/source/database/test_mysql_iam.py
Comment thread ingestion/tests/unit/connections/test_iam_token_refresh.py
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (17 flaky)

✅ 4256 passed · ❌ 0 failed · 🟡 17 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🟡 Shard 2 801 0 2 9
🟡 Shard 3 798 0 6 8
🟡 Shard 4 853 0 1 12
🟡 Shard 5 717 0 3 47
🟡 Shard 6 788 0 5 8
🟡 17 flaky test(s) (passed on retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › EditAll User: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseResultPermissions.spec.ts › User with only VIEW cannot PATCH results (shard 2, 1 retry)
  • Features/IncidentManager.spec.ts › Resolving incident & re-run pipeline (shard 3, 1 retry)
  • Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Test upvote and downvote buttons (shard 3, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 2 retries)
  • Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
  • Flow/ExploreAggregationCountsMatching.spec.ts › should verify left panel counts and tab search results for normal search (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Should clear search and show all properties for apiCollection in right panel (shard 4, 1 retry)
  • Pages/EntityDataConsumer.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/EntityDataSteward.spec.ts › User as Owner Add, Update and Remove (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage service type filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/ODCSImportExport.spec.ts › Multi-object ODCS contract - object selector shows all schema objects (shard 6, 1 retry)
  • Pages/Tag.spec.ts › Verify Owner Add Delete (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI review requested due to automatic review settings June 8, 2026 06:47
Comment thread ingestion/src/metadata/ingestion/source/database/mysql/connection.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings June 8, 2026 07:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread ingestion/tests/unit/source/database/test_mysql_iam.py
The do_connect listener set cparams["ssl"] = {} when no SSL config was
present. An empty dict is falsy, so PyMySQL only enables TLS in PREFERRED
mode (self.ssl=True, _ssl_required=False), which silently falls back to
plaintext if the server doesn't offer TLS. RDS IAM auth mandates TLS, so
this is the wrong guarantee.

Inject {"check_hostname": True} instead: a truthy dict makes PyMySQL treat
SSL as required (_ssl_required=True) and verifies the RDS server cert.
Explicitly provided ssl config is still preserved.

Update test_listener_enables_ssl_required_by_pymysql_for_iam to assert a
truthy value, and add test_injected_ssl_makes_pymysql_require_tls which
drives the value through real PyMySQL and asserts conn.ssl is True and
conn._ssl_required is True, so the test verifies actual TLS-required
behavior rather than the mock value.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@harshsoni2024 harshsoni2024 merged commit c850111 into main Jun 8, 2026
60 of 62 checks passed
@harshsoni2024 harshsoni2024 deleted the refresh_iam_auth_token branch June 8, 2026 13:29
@gitar-bot

gitar-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown
Code Review ✅ Approved 3 resolved / 3 findings

Implements a per-connection RDS IAM token refresh mechanism for MySQL to prevent authentication timeouts during long ingestion runs. Resolves thread-safety concerns, connection string parsing issues, and TLS configuration gaps.

✅ 3 resolved
Bug: RdsIamAuthTokenManager token cache is not thread-safe

📄 ingestion/src/metadata/clients/aws_client.py:300-314 📄 ingestion/src/metadata/ingestion/source/database/mysql/connection.py:116-120
RdsIamAuthTokenManager.get_token() is invoked from the SQLAlchemy do_connect listener registered in _get_iam_engine. The PR's own description notes the engine is built with max_overflow=-1 and ingestion is multi-threaded, so do_connect (and thus get_token) fires concurrently from many threads. get_token -> _needs_refresh -> _refresh_token mutate the shared self._token / self._expires_at with no lock.

Consequences:

  • When the token nears expiry, multiple threads can simultaneously evaluate _needs_refresh() as True and each call _refresh_token(), issuing redundant generate_db_auth_token calls (each spins up a fresh AWSClient/boto3 client).
  • _refresh_token updates _token and _expires_at non-atomically; a concurrent reader can observe a new _token paired with a stale _expires_at (or vice-versa), undermining the refresh accounting this manager exists to guarantee.

Guard the read-check-refresh sequence with a threading.Lock so refresh happens once and state is updated atomically.

Edge Case: hostPort split and unencoded username are fragile in _get_iam_engine

📄 ingestion/src/metadata/ingestion/source/database/mysql/connection.py:101 📄 ingestion/src/metadata/ingestion/source/database/mysql/connection.py:109
In _get_iam_engine, host, port = connection.hostPort.split(":") will raise ValueError if hostPort lacks a port or contains more than one colon (e.g. an IPv6 literal or an accidental trailing colon), producing an opaque unpacking error instead of a clear configuration message. Additionally, base_url = f"{scheme}://{connection.username}@{connection.hostPort}" interpolates the username without URL-encoding; usernames containing reserved characters (@, /, :) would corrupt the URL. Consider rsplit(":", 1) with a validation/error message and urllib.parse.quote for the username.

Bug: Empty ssl dict disables TLS, breaking RDS IAM auth

📄 ingestion/src/metadata/ingestion/source/database/mysql/connection.py:114-118 📄 ingestion/tests/unit/source/database/test_mysql_iam.py:163-167
In inject_iam_token, when no SSL config is present the code now sets cparams["ssl"] = {} (changed from the previous {"ssl": True}). PyMySQL's Connection.__init__ gates SSL setup on if ssl:, and an empty dict is falsy — so ssl={} is passed to pymysql.connect() and TLS is not enabled (self.ssl stays False).

RDS IAM authentication requires a TLS connection; without it the server rejects the login. For the default IAM path (no explicit ssl in cparams) this is a guaranteed failure, which defeats the purpose of this PR. The comment even says "PyMySQL requires SSL for RDS IAM auth", but the value chosen does not enable it.

Note also that test_listener_enables_ssl_required_by_pymysql_for_iam asserts cparams["ssl"] == {}, so the test passes while the actual connection has SSL disabled — the test is misleading and should assert a truthy/SSL-enabling value.

Fix: set a truthy dict that enables SSL (e.g. restore {"ssl": True}, or better, supply a real config such as the RDS CA bundle / {"check_hostname": True}).

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud

sonarqubecloud Bot commented Jun 8, 2026

Copy link
Copy Markdown

harshsoni2024 added a commit that referenced this pull request Jun 17, 2026
…28730)

* fix(ingestion): refresh RDS IAM auth token per connection for MySQL

RDS IAM auth tokens expire after ~15 minutes. The token was generated
once in get_connection_url_common and baked into the SQLAlchemy URL, so
a single engine reused one frozen token for every pooled connection and
never refreshed it. With max_overflow=-1 and multi-threaded extraction,
any connection opened after the 15-minute TTL — common on large catalogs
and during profiling — authenticated with a stale token and failed mid
run with "Access denied".

Add RdsIamAuthTokenManager (caches the token, parses its expiry from the
presigned-URL params, refreshes before it lapses) and wire a do_connect
event listener in the MySQL connection handler that injects a fresh token
on every new connection instead of embedding it in the URL. SSL is forced
on for IAM (PyMySQL requires it) while preserving any existing SSL config.

Scoped to the MySQL connector. The shared builders.py IAM path is still
token-frozen for the other RDS connectors (Postgres, Redshift, Greenplum,
Timescale); the reusable token manager lives in aws_client.py so they can
adopt it next.

* fix: Thread-safety in RdsIamAuthTokenManager, username not URL-encoded

* fix: base_url drops databaseSchema + connectionOptions

* fix(ingestion): enforce required TLS for MySQL RDS IAM connections

The do_connect listener set cparams["ssl"] = {} when no SSL config was
present. An empty dict is falsy, so PyMySQL only enables TLS in PREFERRED
mode (self.ssl=True, _ssl_required=False), which silently falls back to
plaintext if the server doesn't offer TLS. RDS IAM auth mandates TLS, so
this is the wrong guarantee.

Inject {"check_hostname": True} instead: a truthy dict makes PyMySQL treat
SSL as required (_ssl_required=True) and verifies the RDS server cert.
Explicitly provided ssl config is still preserved.

Update test_listener_enables_ssl_required_by_pymysql_for_iam to assert a
truthy value, and add test_injected_ssl_makes_pymysql_require_tls which
drives the value through real PyMySQL and asserts conn.ssl is True and
conn._ssl_required is True, so the test verifies actual TLS-required
behavior rather than the mock value.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(ingestion): RDS IAM auth token generated once and never refreshed. Long MySQL ingestions fail mid-run with "Access denied"

3 participants