feat: close preprod observability loop

2026-05-08 15:48:56 -04:00
parent 8bcff96821
commit 986c7efea6
14 changed files with 618 additions and 2 deletions
--- a/docs/FEATURES/observability.md
+++ b/docs/FEATURES/observability.md
@@ -78,3 +78,17 @@ Initial alerts should cover:
 - email delivery failures
 - blob storage failures
 - background job failures
+
+## Workflow Health Gauges
+
+Database-derived workflow health metrics should be sampled periodically instead of emitted per request.
+
+Initial gauges should cover:
+
+- content item counts by status
+- feedback report counts by status
+- pending workspace invites
+- content stale in approval
+- active workspace counts over 24-hour and 7-day windows
+
+These are operator health signals. They should stay aggregate enough to avoid high-cardinality metric labels.
--- a/docs/OPERATIONS/observability-runbook.md
+++ b/docs/OPERATIONS/observability-runbook.md
@@ -0,0 +1,163 @@
+# Observability Runbook
+
+## Purpose
+
+This runbook is for preproduction operation of Socialize's self-hosted observability stack.
+
+The goal is to answer:
+
+- Is the app reachable?
+- Is the API healthy?
+- Are errors or latency rising?
+- Are users exercising core workflows?
+- Are emails, blob storage, and background jobs failing?
+- Is work getting stuck?
+
+## Start The Stack
+
+Run from the repository root on the preproduction host:
+
+```bash
+docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml up -d
+```
+
+Grafana listens on `127.0.0.1:3000` by default. Set `GRAFANA_HTTP_BIND=0.0.0.0`
+only when Grafana is protected by a reverse proxy, VPN, firewall rule, or SSH tunnel.
+
+Set these before exposing Grafana:
+
+```bash
+GRAFANA_ADMIN_USER=admin
+GRAFANA_ADMIN_PASSWORD=<strong-password>
+```
+
+## Alert Delivery
+
+Prometheus sends alerts to Alertmanager. Alertmanager sends alerts to the webhook
+configured by:
+
+```bash
+ALERTMANAGER_WEBHOOK_URL=<private-alert-webhook-url>
+```
+
+If no webhook URL is configured, Alertmanager still starts but alert delivery points
+to a local discard endpoint.
+
+Critical alerts repeat every 30 minutes. Other alerts repeat every 4 hours.
+
+## Secure Grafana With Caddy
+
+An optional Caddy snippet is available at:
+
+```txt
+deploy/observability/caddy/grafana.Caddyfile
+```
+
+Generate a Caddy password hash:
+
+```bash
+caddy hash-password --plaintext '<password>'
+```
+
+Configure:
+
+```bash
+OBSERVABILITY_HOST=observability.example.com
+GRAFANA_BASIC_AUTH_USER=<user>
+GRAFANA_BASIC_AUTH_HASH=<hash>
+```
+
+Keep Grafana private unless the hostname is protected.
+
+## First Bring-Up Checks
+
+1. Confirm containers are running:
+
+```bash
+docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml ps
+```
+
+2. Check API health:
+
+```bash
+curl -i http://127.0.0.1:8080/health
+curl -i http://127.0.0.1:8080/health/ready
+```
+
+3. Open Grafana and check the `Socialize Overview` dashboard.
+
+4. Generate a few real actions:
+
+- log in
+- create a content item
+- add a comment
+- submit feedback
+- create a workspace invite
+
+5. Confirm metrics appear in the dashboard:
+
+- API request rate
+- usage signals
+- workflow backlog
+- operational events
+
+## Alert Triage
+
+`SocializePreprodEndpointDown`
+
+- Check `docker compose ps`.
+- Check `docker compose logs api web`.
+- Check `/health/ready`.
+
+`SocializeApiTelemetryMissing`
+
+- Check that `api` has `OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317`.
+- Check `docker compose logs alloy`.
+- Check whether the API is receiving traffic.
+
+`SocializeApiHighErrorRate`
+
+- Open the API logs panel.
+- Filter by recent `5xx` requests.
+- Open Tempo traces for slow or failing requests if available.
+
+`SocializeApiHighLatency`
+
+- Check the p95 latency by endpoint panel.
+- Inspect slow traces.
+- Check database health and recent deploy activity.
+
+`SocializeEmailDeliveryFailures`
+
+- Check API logs for Resend failures.
+- Confirm `RESEND_API_KEY` and `RESEND_FROM_EMAIL`.
+- Confirm Resend service status outside this stack if needed.
+
+`SocializeBlobStorageFailures`
+
+- Confirm `./blob-storage` volume permissions on the preprod host.
+- Check local disk space.
+- Check API logs for validation or filesystem errors.
+
+`SocializeBackgroundJobFailures`
+
+- Check the operational events panel for the failing job name.
+- Check API logs for the same time window.
+
+`SocializeContentStaleInApproval`
+
+- Use the app to inspect content currently in approval.
+- Contact the relevant internal owner or client contact outside the app if needed.
+
+`SocializeCoreUsageQuiet` or `SocializeNoActiveWorkspaces`
+
+- Confirm whether quiet usage is expected for the period.
+- If not expected, check login events and API reachability.
+
+## Retention Defaults
+
+- Prometheus keeps 15 days by default through `PROMETHEUS_RETENTION`.
+- Tempo keeps traces for 168 hours.
+- Loki uses local filesystem storage for preproduction.
+
+Tune retention before heavy customer usage or long-running demos.
--- a/docs/TASKS/observability/003-preprod-operations-loop.md
+++ b/docs/TASKS/observability/003-preprod-operations-loop.md
@@ -0,0 +1,34 @@
+# Observability 003: Preprod Operations Loop
+
+## Goal
+
+Close the preproduction operations loop by adding alert delivery scaffolding, uptime probes, workflow health gauges, secured Grafana guidance, and an operator runbook.
+
+## Feature Spec
+
+- `docs/FEATURES/observability.md`
+
+## Scope
+
+- Add Alertmanager to the optional observability compose overlay.
+- Add Blackbox Exporter uptime probes for the web container and API readiness endpoint.
+- Add backend database-derived workflow health gauges.
+- Add Prometheus alerts for uptime probes and workflow health.
+- Add an optional Caddy snippet for protected Grafana exposure.
+- Add an operator runbook for bring-up, alert triage, and security defaults.
+
+## Out Of Scope
+
+- Operating the remote preproduction host.
+- Choosing the final alert destination.
+- Client-facing status page.
+- External third-party uptime monitoring.
+
+## Validation
+
+```bash
+dotnet build backend/Socialize.slnx
+dotnet test backend/Socialize.slnx
+docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml config
+jq empty deploy/observability/grafana/dashboards/socialize-overview.json
+```