Skip to content

perf: set TCP_NODELAY, cheap allocs, socket tuning, postprocess deferral on engine_* path#27

Open
KA-ROM wants to merge 13 commits into
mainfrom
perf/cheap-allocs
Open

perf: set TCP_NODELAY, cheap allocs, socket tuning, postprocess deferral on engine_* path#27
KA-ROM wants to merge 13 commits into
mainfrom
perf/cheap-allocs

Conversation

@KA-ROM

@KA-ROM KA-ROM commented Jun 5, 2026

Copy link
Copy Markdown

Summary

Hot-path tuning for the engine_* proxy path: removes per-request allocations and redundant lookups, tunes backend TCP socket options, and defers/skips postprocess when not needed. One file changed (proxy.rs) plus a small helper. No flag or config surface added.

Removed from the hot path

  • No heap alloc per header check: is_hop_by_hop_header matches HeaderName constants directly; per-call lowercased String is gone.
  • ~5 fewer hashmap lookups per request: in-flight Gauge/Counter + proxy_failure_count handles resolved once per worker, cached on ProxyHttp.
  • No metric lookup on UA path: per-worker scc::HashMap<String, Counter> keyed by &str, soft-capped at 100 entries to bound memory under hostile UAs.
  • connection_info() bound once/request instead of 5x.
  • No per-header revalidation on append: headers_mut().append(...) skips reparsing names through awc's builder.
  • No re-parse on response forwarding: dropped HeaderName::from_str (backend names already valid).
  • No full HeaderMap clone: capture only content-encoding (the one field read downstream).
  • JSON parse reuse: parsed JrpcRequestMetaMaybeBatch stashed on ProxiedHttpRequest, reused by finalise_proxying.

Behavior changes (observable on wire / in ops)

  • TCP_NODELAY on backend conns: disables Nagle; cuts up to ~40ms tail on small engine_* pairs (Nagle + Linux TCP_DELACK_MIN=40ms).
  • TCP_KEEPALIVE on backend conns (60s): catches silently-dead backends so the idle pool stops handing out broken sockets.
  • awc disconnect_timeout tightened to 100ms: faster idle-conn reclamation. Safe for loopback (op-rbuilder); code comment flags it if backend moves off-host. Per awc 3.8.1, this only affects the idle pool, not in-flight requests.
  • Defer + skip postprocess: bookkeeping insert reordered after stream_to_client (intra-handler reorder, not a spawn; relies on actix not polling response bodies until the handler returns). Response parse is skipped entirely when neither logging nor mirroring is enabled; metrics path preserved.

Unchanged

  • Which requests get intercepted, mirrored, or logged: only when postprocess runs.
  • No flag, no config surface.

History

Squashed three explore-then-revert commit pairs (17=>13) for review clarity.

@KA-ROM KA-ROM marked this pull request as draft June 5, 2026 13:10
@KA-ROM KA-ROM changed the title Perf/cheap allocs Small optimizations Jun 5, 2026
@KA-ROM KA-ROM force-pushed the perf/cheap-allocs branch from d7d7f94 to f86fd2d Compare June 9, 2026 12:40
@KA-ROM KA-ROM changed the title Small optimizations perf: set TCP_NODELAY, cheap allocs, socket tuning, postprocess deferral on engine_* path Jun 9, 2026
@KA-ROM KA-ROM requested a review from 0x416e746f6e June 9, 2026 12:54
@KA-ROM KA-ROM marked this pull request as ready for review June 9, 2026 12:54
@KA-ROM KA-ROM requested review from akundaz, avalonche and julio4 June 9, 2026 12:56
// Per-worker cache of (user_agent -> Counter) so that we only pay the
// `Family::get_or_create` cost the first time a UA is seen on this
// worker. Lookup is `&str`-keyed (no allocation on hit).
client_info_cache: HashMap<String, Counter>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't the user agent just be anything the client sets? Do you need a metric that uses that as a label?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally, yes, it can.

practically, it's what rollup-boost sends (which is its version that doesn't change often)

.conn_keep_alive(2 * timeout)
// 100ms grace is fine for loopback (op-rbuilder); raise if backend
// is ever non-loopback (would risk truncating responses on shutdown).
.disconnect_timeout(Duration::from_millis(100))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be configurable?

Comment on lines +83 to +88
// Per-worker cached metric handles. These are cheap to clone (each
// wraps an `Arc<AtomicU64>` internally) and bypass the per-request
// `Family::get_or_create` lookup on the hot path.
in_flight_client: Gauge,
in_flight_backend: Gauge,
proxy_failure_count: Counter,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

perhaps better to create ProxyHttpMetrics type, intantiate it from shared.metrics, and make it a member of ProxyHttp => this way adding new metrics will require less boiler-plate.

Comment on lines 9 to +14
matches!(
name.as_str().to_ascii_lowercase().as_str(),
"connection" | "host" | "keep-alive" | "transfer-encoding"
)
name,
&header::CONNECTION |
&header::HOST |
&header::TRANSFER_ENCODING
) || name.as_str().eq_ignore_ascii_case("keep-alive")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:

HTTP headers are case-insensitive. (e.g. CONNECTION, connection, and CoNnEcTiOn are all valid and all refer to the same header).

does this change respect that?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, doc says:

Header names are normalized to lower case. In other words, when creating a HeaderName with a string, even if upper case characters are included, when getting a string representation of the HeaderName, it will be all lower case.

https://docs.rs/http/latest/http/header/index.html

Comment on lines +433 to +455
// Hot path: read-only lookup keyed by &str (no allocation).
// Cold path: allocate the owned String key and resolve the
// counter from the metrics family exactly once per worker per
// distinct UA.
if let Some(entry) = this.client_info_cache.read_sync(user_agent, |_, c| c.clone()) {
entry.inc();
} else {
let counter = metrics
.client_info
.get_or_create(&LabelsProxyClientInfo {
proxy: P::name(),
user_agent: user_agent.to_string(),
})
.clone();
counter.inc();
// soft cap to prevent unbounded growth from hostile UA values
if this.client_info_cache.len() < 100 {
// Best-effort insert; races with another task on the same
// worker thread shouldn't happen given !Send actix workers,
// but tolerate insert errors regardless.
let _ = this.client_info_cache.insert_sync(user_agent.to_string(), counter);
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

this would be nice to encapsulate in ProxyHttpMetrics too

Comment on lines +1409 to +1419
let tcp_nodelay = actix_tls::connect::Connector::new(
actix_tls::connect::Resolver::default(),
)
.service()
.map(|conn: actix_tls::connect::Connection<awc::http::Uri, tokio::net::TcpStream>| {
let _ = conn.io_ref().set_nodelay(true);
let _ = socket2::SockRef::from(conn.io_ref()).set_tcp_keepalive(
&socket2::TcpKeepalive::new().with_time(Duration::from_secs(60)),
);
conn
});

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:

w.d.y.t. about putting this behind CLI-switch? (so that the user could choose on case-by-case).

(TCP_NODELAY could still be the default).


// Build the inner TCP connector ourselves so we can flip
// TCP_NODELAY on after each connect
use actix_service::ServiceExt as _;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

imports should go to the top

.connector(Connector::new().conn_keep_alive(2 * timeout).limit(connections_limit))
.connector(
Connector::new()
.connector(tcp_nodelay)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:

what about client side? originally I thought about setting TCP_NODELAY up there

Comment on lines +708 to +718
// Initiate the response stream first so the client sees no extra
// bookkeeping latency on the critical path, then file the request
// into the in-flight map for the eventual response postprocessor.
// The insert must complete before the response body stream finishes
// (which is when `postprocess_backend_response` calls `remove_sync`);
// this is guaranteed because actix won't begin polling the streaming
// body until after this synchronous handler returns.
let this_clone = this.clone();
let res = Self::stream_to_client(this, req_id, conn_id, bknd_res);
this_clone.postprocess_client_request(req);
res

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue:

client request is post-processed (i.e. inserted into in-flight requests hashmap) first (i.e. before streaming the request to the backend) for a reason.

otherwise, there is a potential for race-condition when the response arrives while the request info is still not inserted.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm ok, good catch, will revert. + realized I disabled dedup half way through iterations, so its not worth the risk at all as the improvements seen in my tests weren't related to this change

Comment on lines +802 to 842
// Fast path: when nobody is logging the proxied request/response and
// there are no mirror peers configured, skip the (expensive) response
// decompress + parse and the (always-evaluated) `info!` formatting in
// `maybe_log_proxied_request_and_response`. We still parse the request
// for the jrpc method label so metrics keep their correct dimensions.
let log_off =
!inner.config().log_proxied_requests() && !inner.config().log_proxied_responses();
let no_mirror = mirroring_peers.is_empty();
if log_off && no_mirror {
// If the request body has already been parsed (intercept path)
// we can avoid a second decompress + parse.
if let Some(jrpc) = clnt_req.jrpc_meta.clone() {
Self::emit_metrics_on_proxy_success(&jrpc, &clnt_req, &bknd_res, metrics);
return;
}
if clnt_req.decompressed_size < clnt_req.size {
(clnt_req.decompressed_body, clnt_req.decompressed_size) = decompress(
clnt_req.body.clone(),
clnt_req.size,
clnt_req.info.content_encoding(),
);
}
match serde_json::from_slice::<JrpcRequestMetaMaybeBatch>(&clnt_req.decompressed_body) {
Ok(jrpc) => {
Self::emit_metrics_on_proxy_success(&jrpc, &clnt_req, &bknd_res, metrics);
}
Err(err) => {
warn!(
proxy = P::name(),
request_id = %clnt_req.info.req_id,
connection_id = %clnt_req.info.conn_id,
worker_id = %worker_id,
error = ?err,
"Failed to parse json-rpc request",
);
}
}
return;
}

if clnt_req.decompressed_size < clnt_req.size {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question/issue:

finalise_proxying happens off the hot-path (request received/forwarded, response received/forwarded).

why do we need to optimise things here? (it makes things more complicated, but what do we win with this?)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. + I missed that both are gated on configs we don't run currently (0bf93f2 only fires with logging fully off, bcef4af only when dedup is on). when/if either flips, will revisit.

Comment on lines +651 to +653
// Stash the parsed jrpc so `finalise_proxying` can avoid
// re-parsing the same body.
Some(Arc::new(jrpc))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:

how often is this code-path hit in practice?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah will revert, for same reason as here: #27 (comment)

let mut clnt_res = Self::to_client_response(&bknd_res);

let preallocate = this.shared.config().prealloacated_response_buffer_size();
let content_encoding = bknd_res.headers().get(header::CONTENT_ENCODING).cloned();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:

are we sure content-encoding is the only header we want to send back to the client?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still forward every header to the client (the loop in to_client_response). this change only affects what we keep around internally for our own processing. before, it copied the whole header bag, now we only copy the one header (content-encoding), to decompress the body for logging. nothing the client sees should be different

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants