Skip to content

ConnectionError from _cancel() during CancelledError not caught, crashes callers #1310

@yemreck

Description

@yemreck

Environment

  • asyncpg version: 0.31.0 (also reproduced on 0.30.x)
  • PostgreSQL version: 16
  • Python version: 3.11.14
  • Platform: Linux (Kubernetes)
  • pgbouncer: No
  • SQLAlchemy: 2.0.23

Summary

When an asyncpg operation is cancelled via asyncio.CancelledError while mid-query, the cancellation mechanism in connect_utils._cancel can raise a built-in ConnectionError that escapes to the caller. This is problematic because:

  1. Callers (e.g. SQLAlchemy) expect asyncpg-specific exception types and don't handle built-in ConnectionError
  2. The cancel operation is inherently best-effort — if the cancel connection fails, the error should be suppressed or wrapped, not propagated

This is related to #1211 but occurs on non-direct_tls connections via the cancel request code path.

Reproduction flow

  1. An asyncpg connection is executing a query (e.g. inside SQLAlchemy's session.execute())
  2. The asyncio task is cancelled (task.cancel())
  3. CancelledError propagates into protocol.query() / bind_execute
  4. asyncpg's cancellation handler tries to send a PostgreSQL cancel request by opening a new SSL connection via connect_utils._cancel_create_ssl_connection
  5. The new connection fails (server already closed the original, or network issue)
  6. TLSUpgradeProto.connection_lost() raises built-in ConnectionError('unexpected connection_lost() call')
  7. This escapes through connect_utils._cancel (which has no error handling around _create_ssl_connection)
  8. Caller receives ConnectionError instead of CancelledError

Traceback

asyncio.exceptions.CancelledError  (original exception)

During handling of the above exception, another exception occurred:

  File "asyncpg/transaction.py", line 206, in __rollback
    await self._connection.execute(query)
  File "asyncpg/connection.py", line 350, in execute
    result = await self._protocol.query(query, timeout)
  File "asyncpg/connection.py", line 1584, in _cancel
    await connect_utils._cancel(
  File "asyncpg/connect_utils.py", line 1040, in _cancel
    tr, pr = await _create_ssl_connection(
  File "asyncpg/connect_utils.py", line 752, in _create_ssl_connection
    do_ssl_upgrade = await pr.on_data
                     ^^^^^^^^^^^^^^^^
ConnectionError: unexpected connection_lost() call

Root cause

Two issues in connect_utils.py:

1. _cancel() has no error handling around _create_ssl_connection

async def _cancel(*, loop, addr, params, backend_pid, backend_secret):
    ...
    if params.ssl and params.sslmode != SSLMode.allow:
        tr, pr = await _create_ssl_connection(...)  # ← no try/except!
    ...

The cancel request is best-effort (we're telling PostgreSQL to cancel a query on a connection that may already be dead). If opening the cancel connection fails, the error should be suppressed or wrapped in asyncpg.InterfaceError, not propagated as a raw ConnectionError.

2. TLSUpgradeProto.connection_lost() raises built-in ConnectionError

def connection_lost(self, exc):
    if not self.on_data.done():
        if exc is None:
            exc = ConnectionError('unexpected connection_lost() call')
        self.on_data.set_exception(exc)

This raises a built-in Python ConnectionError, not an asyncpg exception type. Callers like SQLAlchemy check for asyncpg.InterfaceError or asyncpg.PostgresError to detect disconnects. A built-in ConnectionError bypasses all those checks, which means:

  • SQLAlchemy's is_disconnect() doesn't recognize it
  • SQLAlchemy's pool pre-ping handler (_do_ping_w_event) only catches self.loaded_dbapi.Error, so ConnectionError escapes
  • The pool's retry logic (which would create a fresh connection) never triggers

Suggested fix

Option A (minimal): Catch OSError (parent of ConnectionError) in connect_utils._cancel() and suppress it — cancel is best-effort:

async def _cancel(*, loop, addr, params, backend_pid, backend_secret):
    ...
    try:
        if params.ssl and params.sslmode != SSLMode.allow:
            tr, pr = await _create_ssl_connection(...)
        ...
    except OSError:
        # Cancel is best-effort. If we can't reach the server, the
        # connection is dead anyway.
        return

Option B (comprehensive): Also change TLSUpgradeProto.connection_lost() to raise asyncpg.InterfaceError instead of built-in ConnectionError, so callers can handle it consistently:

def connection_lost(self, exc):
    if not self.on_data.done():
        if exc is None:
            exc = InterfaceError('unexpected connection_lost() call')
        self.on_data.set_exception(exc)

Impact

This causes process crashes in production services. When a task is cancelled during a DB query, the ConnectionError escapes all exception handlers (which expect either CancelledError or asyncpg-specific exceptions) and terminates the process.

This is 100% correlated with CancelledError in our logs — every ConnectionError: unexpected connection_lost() we've seen is triggered by task cancellation.

Additional context

We use Google CloudSQL with SSL connections. The PostgreSQL server is accessed over SSL (non-direct_tls), which means the cancel code path goes through _create_ssl_connection to establish a new SSL connection for sending the cancel request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions