feat(experimentation): experiment results model, task and endpoints by gagantrivedi · Pull Request #7796 · Flagsmith/flagsmith

gagantrivedi · 2026-06-16T10:44:00Z

Thanks for submitting a PR! Please check the boxes below:

I have read the Contributing Guide.
I have added information to docs/ if required so people know about the feature.
I have filled in the "Changes" section below.
I have filled in the "How did you test this code" section below.

Changes

Contributes to the experiment results stats layer (stacked on #7781; merge after it).

So the experiment detail page can show per-metric results — lift, chance-to-win, and a sample-ratio-mismatch check — this computes them from the warehouse on demand and stores one row per experiment, updated in place (mirroring the exposures panel).

ExperimentResults model (migration 0008), on a new abstract ExperimentComputation base now shared with ExperimentExposures: as_of / payload / last_error_at / refresh_requested_at, is_final, and record_refresh / record_failure / record_refresh_request.
compute_results_summary (services.py): derives the metric specs and the expected SRM split from the environment's multivariate allocations (control = unallocated remainder), then runs the feat(experimentation): results aggregation query and payload builder #7781 aggregation and feat(experimentation): Bayesian stats kernel #7769 kernel. SRM is skipped (and srm.unkeyed_variant logged) when an option has no variant key to attribute its share to.
compute_experiment_results task: recomputes the full window; on warehouse failure keeps the last good payload and logs results.compute_failed.
GET …/results/ and POST …/results/refresh/: read the row, or enqueue a refresh (202). _validate_refresh_request raises ValidationError before start / once final (400) and Throttled within the refresh interval (429 + Retry-After).

How did you test this code?

make test for the experimentation app — unit tests for the model, task, compute_results_summary / _experiment_metric_specs / _expected_variant_shares, and both endpoints, at 100% diff coverage. mypy and ruff clean.

vercel · 2026-06-16T10:44:07Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
docs	Ready	Preview, Comment	Jun 17, 2026 10:38am

2 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
flagsmith-frontend-preview	Ignored	Preview	Jun 17, 2026 10:38am
flagsmith-frontend-staging	Ignored	Preview	Jun 17, 2026 10:38am

codecov · 2026-06-16T10:50:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.58%. Comparing base (1cdb491) to head (17e9032).

Additional details and impacted files

@@                       Coverage Diff                       @@
##           feat/experiment-results-query    #7796    +/-   ##
===============================================================
  Coverage                          98.57%   98.58%            
===============================================================
  Files                               1462     1463     +1     
  Lines                              56762    57084   +322     
===============================================================
+ Hits                               55955    56277   +322     
  Misses                               807      807

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Persist Bayesian results in an ExperimentResults row (the ExperimentExposures pattern: OneToOne, as_of/payload/last_error_at/refresh_requested_at, is_final freezing completed experiments before the warehouse TTL). compute_results_summary orchestrates the warehouse aggregation: it derives the metric specs from the attached metrics and the expected variant split from the environment's multivariate allocations (control takes the unallocated remainder), then runs the kernel. compute_experiment_results runs it off the refresh endpoint and records the row, preserving the last good payload on warehouse failure. GET/POST .../results/ and .../results/refresh/ clone the exposures pair; the shared guard ladder (pre-start, finality, throttle) moves into _refresh_panel.

ExperimentExposures and ExperimentResults had identical fields and lifecycle. Hoist them into a generic abstract ExperimentComputation[SummaryT]: the subclass binds which summary record_refresh stores, so it stays type-safe per panel. Each concrete model keeps only its experiment OneToOneField (and so its related_name); no schema change.

- Rename the view-layer PanelT/_refresh_panel to ComputationT/ _refresh_computation so the shared abstraction has one name everywhere (ExperimentComputation), rather than reintroducing UI "panel" vocabulary. - _refresh_computation takes the finished 400 messages instead of a noun, dropping the unwritten assumption that the noun is a plural subject. - _expected_variant_shares selects the current feature state by highest id (matching Environment's Max("id") convention) instead of relying on the default ascending ordering, which picked the oldest version; note the coupling to features' multivariate representation. - Declare experiment on ExperimentComputation under TYPE_CHECKING so is_final is type-checked rather than suppressed with a type: ignore.

compute_experiment_exposures and compute_experiment_results had identical bodies. Hoist the lifecycle (load experiment, skip if not started or final, compute over the full window, preserve the last good payload on failure) into _refresh_computation_row; each task becomes a thin adapter supplying its model, summary function and failure logger. The helper is generic over SummaryT so record_refresh stays checked against the payload type. The failure log stays in the per-task adapter, not the helper, so its event name remains a literal the docs generator can discover.

Inline the two refresh endpoints but extract the precondition checks into _validate_refresh_request, which raises ValidationError (not started / final) or Throttled (cooling down) and lets DRF render the response — including the Retry-After header. Keeps the drift-prone validation in one place while each endpoint's happy path stays inline, with no callbacks or generic TypeVar.

Emit srm.unkeyed_variant before bailing out of _expected_variant_shares so the skipped check is visible to operators. Trim the docstring to the useful part and note the TODO to source the split from the percentage-split segment override feature state once that exists.

Add is_final to ExperimentExposuresSerializer and ExperimentResultsSerializer so the UI can decide whether to show the refresh button without re-deriving it from the experiment's ended_at.

A misconfigured feature whose multivariate options over-allocate would give control a negative expected share. Bail out of _expected_variant_shares with an srm.overallocated log instead of relying on the downstream zero-share guard.

github-actions · 2026-06-17T10:42:26Z

Docker builds report

Image	Build Status	Security report
`ghcr.io/flagsmith/flagsmith-e2e:pr-7796`	Finished ✅	Skipped
`ghcr.io/flagsmith/flagsmith-api-test:pr-7796`	Finished ✅	Skipped
`ghcr.io/flagsmith/flagsmith:pr-7796`	Finished ✅	Results ✅
`ghcr.io/flagsmith/flagsmith-private-cloud:pr-7796`	Finished ✅	Results ✅
`ghcr.io/flagsmith/flagsmith-api:pr-7796`	Finished ✅	Results ✅

github-actions · 2026-06-17T10:48:52Z

Playwright Test Results (oss - depot-ubuntu-latest-16)

1 passed

Details

1 test across 1 suite
41.5 seconds
17e9032
🔄 Run: #17579 (attempt 1)

Playwright Test Results (oss - depot-ubuntu-latest-arm-16)

1 passed

Details

1 test across 1 suite
37 seconds
17e9032
🔄 Run: #17579 (attempt 1)

Playwright Test Results (private-cloud - depot-ubuntu-latest-16)

1 failed