VPR-141 feat(healthchecks): add /health endpoints and UI dashboard#159
VPR-141 feat(healthchecks): add /health endpoints and UI dashboard#159
Conversation
rlorenzo
commented
Apr 22, 2026
- /health (anonymous liveness for Jenkins) + /health/detail (tagged "ready", IP-gated to SVM /20 and infra /24, not CAS-gated so it stays reachable when auth is degraded).
- HealthChecks.UI dashboard at /healthchecks with UC Davis branding, duration humanizer, and a campus-status banner that appears when any campus-* check is non-healthy.
- Adaptive polling decorator on campus checks (LDAP/CAS/SMTP/VMACs): healthy results cached 1 hour, failures re-probe every 5 min (one UI poll cycle). Cuts external traffic from 12/hour to 1/hour per instance while healthy.
- Real LDAPS bind, MailKit SMTP connect, AWS SSM probe, disk checks for app/photos/CMS/logs, EF DbContext checks for all contexts.
- Adopts DotNetDiag.HealthChecks.UI 10.0.7 (fork of abandoned Xabaril packages; upstream does not build on .NET 10). Pinned exactly.
- Jenkins Deploy stages now poll /health post-deploy.
- /health (anonymous liveness for Jenkins) + /health/detail (tagged "ready", IP-gated to SVM /20 and infra /24, not CAS-gated so it stays reachable when auth is degraded). - HealthChecks.UI dashboard at /healthchecks with UC Davis branding, duration humanizer, and a campus-status banner that appears when any campus-* check is non-healthy. - Adaptive polling decorator on campus checks (LDAP/CAS/SMTP/VMACs): healthy results cached 1 hour, failures re-probe every 5 min (one UI poll cycle). Cuts external traffic from 12/hour to 1/hour per instance while healthy. - Real LDAPS bind, MailKit SMTP connect, AWS SSM probe, disk checks for app/photos/CMS/logs, EF DbContext checks for all contexts. - Adopts DotNetDiag.HealthChecks.UI 10.0.7 (fork of abandoned Xabaril packages; upstream does not build on .NET 10). Pinned exactly. - Jenkins Deploy stages now poll /health post-deploy.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #159 +/- ##
==========================================
- Coverage 43.27% 42.87% -0.41%
==========================================
Files 862 869 +7
Lines 50319 50839 +520
Branches 4696 4735 +39
==========================================
+ Hits 21777 21795 +18
- Misses 28019 28521 +502
Partials 523 523
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Pull request overview
Adds operational health checking to the web app: anonymous liveness, IP-gated readiness details, and an internal HealthChecks.UI dashboard with UC Davis branding and reduced external probe traffic via adaptive polling.
Changes:
- Introduces
/healthand/health/detailendpoints plus the/healthchecksUI, including IP allowlisting and CSP bypass for the UI bundle. - Adds multiple new health checks (DB contexts, disk space, LDAP, SMTP, CAS/VMACs HTTP probes, AWS SSM) and an adaptive polling decorator to reduce probe frequency when healthy.
- Updates Jenkins deploy stages to poll
/healthafter deploy; adds UI branding assets and a small injected JS enhancer.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| web/wwwroot/js/healthchecks-ui-extras.js | Injected UI script to humanize duration cells and show a campus-status banner. |
| web/wwwroot/healthchecks-ui-logo.png | New logo asset for HealthChecks.UI branding. |
| web/wwwroot/css/healthchecks-ui-branding.css | UC Davis palette + UI CSS tweaks (contrast, layout, banner styling). |
| web/appsettings.json | Expands InternalAllowlist to CIDR ranges for health detail/UI access. |
| web/Viper.csproj | Adds DotNetDiag HealthChecks.UI packages and EF health checks package reference. |
| web/Program.cs | Hooks health check DI + pipeline wiring; conditionally applies CSP outside UI paths. |
| web/Classes/HealthChecks/SmtpHealthCheck.cs | MailKit-based SMTP reachability/TLS probe. |
| web/Classes/HealthChecks/LdapHealthCheck.cs | Real LDAPS bind probe for directory health. |
| web/Classes/HealthChecks/HttpEndpointHealthCheck.cs | Generic HTTP endpoint reachability probe. |
| web/Classes/HealthChecks/HealthCheckExtensions.cs | Centralized health checks registration + endpoint/UI mapping + UI HTML injection. |
| web/Classes/HealthChecks/DiskSpaceHealthCheck.cs | Drive space (and optional writability) checks for app/photos/CMS/log paths. |
| web/Classes/HealthChecks/AwsSsmHealthCheck.cs | AWS SSM reachability probe using a lightweight DescribeParameters call. |
| web/Classes/HealthChecks/AdaptivePollingHealthCheck.cs | Caches health results with status-dependent TTLs to reduce probe load. |
| JenkinsFile | Adds post-deploy /health polling for test and prod. |
- Dispose StreamReader with leaveOpen so the response buffer stays usable after the HTML-injection middleware reads it. - Accept trailing slash on the UI path via StartsWithSegments so the extras script injects for "/healthchecks/" (Xabaril serves both forms); still gated on text/html content type. - Collapse empty UnauthorizedAccessException catch into the IOException best-effort handler in DiskSpaceHealthCheck cleanup. - Fix formatDuration "1m60s" rollover: round to whole seconds first, then split, so 59.6s promotes to the next minute. - Use DateTime.Now (DateTimeKind.Local per project convention) for cache timestamps and the injected-script cache-buster, with a scoped S6561 pragma where we use it for elapsed-time math.
There was a problem hiding this comment.
Pull request overview
Adds first-class health check endpoints and an operator-facing HealthChecks.UI dashboard to VIPER, including custom probes (DB, disk, LDAP/CAS/SMTP/SSM) and Jenkins post-deploy verification.
Changes:
- Introduces
/healthliveness,/health/detailreadiness JSON, and/healthchecksUI with IP allowlisting. - Adds multiple custom
IHealthCheckimplementations plus an adaptive polling decorator to reduce external traffic. - Updates Jenkins deploy stages to poll
/healthafter deployment.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| web/wwwroot/js/healthchecks-ui-extras.js | UI-side DOM tweaks (duration humanizer + campus-status banner). |
| web/wwwroot/healthchecks-ui-logo.png | Adds UC Davis-branded logo asset for the dashboard. |
| web/wwwroot/css/healthchecks-ui-branding.css | Custom palette/branding + minor UI layout/accessibility tweaks. |
| web/appsettings.json | Expands InternalAllowlist to CIDR ranges for readiness/UI access. |
| web/Viper.csproj | Adds HealthChecks.UI + EF Core health check package references. |
| web/Program.cs | Hooks health check DI + pipeline wiring; skips CSP on UI paths. |
| web/Classes/HealthChecks/SmtpHealthCheck.cs | Adds SMTP relay probe via MailKit connect/noop/disconnect. |
| web/Classes/HealthChecks/LdapHealthCheck.cs | Adds LDAPS bind probe matching existing LDAP service settings. |
| web/Classes/HealthChecks/HttpEndpointHealthCheck.cs | Adds generic HTTP reachability probe for CAS/VMACs. |
| web/Classes/HealthChecks/HealthCheckExtensions.cs | Centralizes health check registration, endpoint mapping, UI config, and response-body script injection. |
| web/Classes/HealthChecks/DiskSpaceHealthCheck.cs | Adds disk free-space (and optional writability) probe for key volumes. |
| web/Classes/HealthChecks/AwsSsmHealthCheck.cs | Adds lightweight SSM reachability probe. |
| web/Classes/HealthChecks/AdaptivePollingHealthCheck.cs | Adds status-based caching to reduce expensive probe frequency. |
| JenkinsFile | Adds post-deploy /health polling for test and prod stages. |
Bundle ReportBundle size has no change ✅ |
Root-relative URLs in the dashboard wiring broke in TEST/PROD where the app is hosted under /2 - the collector 404'd /health/detail, the injected script 404'd /js/healthchecks-ui-extras.js, and the logo 404'd /healthchecks-ui-logo.png. - Script src injected into the dashboard HTML now prefixes ctx.Request.PathBase so it resolves to /2/js/... in TEST/PROD and /js/... in dev. - Health-detail endpoint URL is built from EmailSettings:BaseUrl (already configured per-env with the /2 path base); dev falls back to a relative URL which Xabaril resolves against the Kestrel listening address. - Logo inlined as a base64 data URI in the custom stylesheet so path base is no longer part of its URL; dropped the now-unused PNG from wwwroot.
80e474c to
24f416c
Compare
There was a problem hiding this comment.
Pull request overview
Adds first-class health checking to VIPER, including liveness/readiness endpoints for deploy automation and an IP-gated HealthChecks.UI dashboard tailored to campus ops needs.
Changes:
- Introduces
/health(anonymous liveness) and/health/detail(IP-gated readiness with tagged checks) plus HealthChecks.UI at/healthchecks. - Adds health check implementations (LDAP, SMTP, HTTP endpoint probes, disk space, AWS SSM) and an adaptive polling decorator to reduce external probe traffic.
- Updates Jenkins deploy stages to poll
/2/healthpost-deploy; adds UC Davis branding + UI tweaks (duration humanizer + campus-status banner).
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| web/wwwroot/js/healthchecks-ui-extras.js | Injected UI tweaks (duration humanizer, campus-status banner) via MutationObserver. |
| web/wwwroot/css/healthchecks-ui-branding.css | UC Davis palette + UI readability adjustments + campus-status banner styling. |
| web/appsettings.json | Replaces a single internal allowlisted IP with CIDR ranges for staff + infra. |
| web/Viper.csproj | Adds DotNetDiag HealthChecks.UI packages and EF Core health check package. |
| web/Program.cs | Hooks health check DI + pipeline wiring; skips CSP on HealthChecks.UI paths. |
| web/Classes/HealthChecks/SmtpHealthCheck.cs | MailKit-based SMTP reachability probe. |
| web/Classes/HealthChecks/LdapHealthCheck.cs | Real LDAPS bind probe using existing LDAP service credentials. |
| web/Classes/HealthChecks/HttpEndpointHealthCheck.cs | HTTP(S) reachability probe (treats non-5xx as healthy). |
| web/Classes/HealthChecks/HealthCheckExtensions.cs | Centralizes health check registration, endpoint mapping, UI wiring, and IP gating. |
| web/Classes/HealthChecks/DiskSpaceHealthCheck.cs | Disk free-space (and optional writability) probe for key volumes/paths. |
| web/Classes/HealthChecks/AwsSsmHealthCheck.cs | AWS SSM reachability probe via DescribeParameters. |
| web/Classes/HealthChecks/AdaptivePollingHealthCheck.cs | Caches healthy vs unhealthy results for different durations to reduce probe load. |
| JenkinsFile | Adds post-deploy polling of /2/health for TEST and PROD. |
| builder.Services.AddScoped<Viper.EmailTemplates.Services.IEmailTemplateRenderer, Viper.EmailTemplates.Services.EmailTemplateRenderer>(); | ||
|
|
||
| // All health-check DI wiring lives in HealthCheckExtensions; see that file | ||
| // (and PLAN-hangfire.md PR 0) for design rationale. |
| /// See PLAN-hangfire.md PR 0 for the design rationale (liveness vs IP-gated | ||
| /// detail, CSP branching, Xabaril UI fork choice, etc.). |
| /// run on /health/detail; /health is bare liveness. Hangfire checks layer | ||
| /// onto the "ready" tag in PR 3. |
| // reachability with the same SDK. | ||
| builder.AddCheck( | ||
| "aws-ssm", | ||
| new AwsSsmHealthCheck(), |
There was a problem hiding this comment.
Keeping healthyWhenMissing: false here. AWS SSM is a hard dependency in local dev - the app fetches database passwords from SSM Parameter Store at startup via .AddSystemsManager in Program.cs, so developers must have AWS credentials configured. If SSM is unreachable in dev, DB connections fail, so Unhealthy is the correct signal. The photos/CMS healthyWhenMissing=true pattern is specifically for network drives that aren't mounted on dev machines; SSM is a different case.
Removed references to PLAN-hangfire.md and future PR numbers from comments in HealthCheckExtensions.cs and Program.cs. The plan file is an untracked working note and won't exist on main; PR-number references rot once the branch is merged.