Skip to content

feat: label SessionTime histogram by process type#173

Closed
GregTheGreek wants to merge 1 commit intomainfrom
feat/label-mpc-metrics-by-type
Closed

feat: label SessionTime histogram by process type#173
GregTheGreek wants to merge 1 commit intomainfrom
feat/label-mpc-metrics-by-type

Conversation

@GregTheGreek
Copy link
Copy Markdown

Summary

`relayer.SessionTime` has always lumped all three TSS workloads into one distribution, which makes its quantiles meaningless:

Workload Typical duration Dominant cost
Keygen tens of seconds Paillier key generation
Signing hundreds of ms CMP rounds + libp2p
Resharing seconds reshared key derivation

Any dashboard p50/p95 computed over this histogram is a weighted average of three wildly different curves.

Change

  • Added `Type() string` to `TssProcess` (returns `"keygen"`, `"signing"`, `"resharing"`)
  • Widened `Metrics.StartProcess` to `StartProcess(sessionID, processType string)`
  • `MpcMetrics` now stores `{at, processType}` per session and emits the `type` attribute on `SessionTime.Record`
  • Regenerated coordinator mock; updated `CoordinatorTestSuite.SetupTest` expectation

No change to cache.Metrics or the `EndProcess` signature - the type is recovered from the stored session entry.

Grafana note

After this lands, query p95 as:

```
histogram_quantile(0.95, sum by (le) (rate(relayer_SessionTime_bucket{type="signing"}[5m])))
```

Dashboards that don't filter on `type` will silently aggregate all three workloads, same as today.

Test plan

  • `go build ./...` clean
  • Go tests in affected packages pass locally
  • CI green
  • Staging: confirm `SessionTime` metric shows `type` label in the collector

Notes for reviewers

  • `EndProcess` only emits the type present at `StartProcess`. If you ever call `EndProcess` for a session we didn't start, the emit is skipped (existing warning path).
  • Process types are plain strings rather than a typed enum to keep cardinality obvious and avoid a cross-package dep for consumers.

relayer.SessionTime currently lumps keygen, signing, and resharing
into a single distribution, so any dashboard quantile is a mix of
three very different workloads (Paillier keygen in the tens of
seconds, signing in the hundreds of milliseconds, resharing in
between).

Add Type() to TssProcess, have each implementation return a stable
string, and thread it through Metrics.StartProcess so EndProcess
can emit the histogram with a "type" attribute. Grafana queries
should now filter by type (e.g. keygen|signing|resharing) to get
meaningful p50/p95.

Co-Authored-By: Claude
@github-actions
Copy link
Copy Markdown

Go Test coverage is 53.3 %\ ✨ ✨ ✨

@mpetrun5
Copy link
Copy Markdown
Collaborator

mpetrun5 commented Apr 20, 2026

The other workloads basically never happen so I don't think for the purposes of metrics it does anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants