# Observability Runbook ## Purpose This runbook is for preproduction operation of Socialize's self-hosted observability stack. The goal is to answer: - Is the app reachable? - Is the API healthy? - Are errors or latency rising? - Are users exercising core workflows? - Are emails, blob storage, and background jobs failing? - Is work getting stuck? ## Start The Stack Run from the repository root on the preproduction host: ```bash docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml up -d ``` Grafana listens on `127.0.0.1:3000` by default. Set `GRAFANA_HTTP_BIND=0.0.0.0` only when Grafana is protected by a reverse proxy, VPN, firewall rule, or SSH tunnel. Set these before exposing Grafana: ```bash GRAFANA_ADMIN_USER=admin GRAFANA_ADMIN_PASSWORD= ``` ## Alert Delivery Prometheus sends alerts to Alertmanager. Alertmanager sends alerts to the webhook configured by: ```bash ALERTMANAGER_WEBHOOK_URL= ``` If no webhook URL is configured, Alertmanager still starts but alert delivery points to a local discard endpoint. Critical alerts repeat every 30 minutes. Other alerts repeat every 4 hours. ## Secure Grafana With Caddy An optional Caddy snippet is available at: ```txt deploy/observability/caddy/grafana.Caddyfile ``` Generate a Caddy password hash: ```bash caddy hash-password --plaintext '' ``` Configure: ```bash OBSERVABILITY_HOST=observability.example.com GRAFANA_BASIC_AUTH_USER= GRAFANA_BASIC_AUTH_HASH= ``` Keep Grafana private unless the hostname is protected. ## First Bring-Up Checks 1. Confirm containers are running: ```bash docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml ps ``` 2. Check API health: ```bash curl -i http://127.0.0.1:8080/health curl -i http://127.0.0.1:8080/health/ready ``` 3. Open Grafana and check the `Socialize Overview` dashboard. 4. Generate a few real actions: - log in - create a content item - add a comment - submit feedback - create a workspace invite 5. Confirm metrics appear in the dashboard: - API request rate - usage signals - workflow backlog - operational events ## Alert Triage `SocializePreprodEndpointDown` - Check `docker compose ps`. - Check `docker compose logs api web`. - Check `/health/ready`. `SocializeApiTelemetryMissing` - Check that `api` has `OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317`. - Check `docker compose logs alloy`. - Check whether the API is receiving traffic. `SocializeApiHighErrorRate` - Open the API logs panel. - Filter by recent `5xx` requests. - Open Tempo traces for slow or failing requests if available. `SocializeApiHighLatency` - Check the p95 latency by endpoint panel. - Inspect slow traces. - Check database health and recent deploy activity. `SocializeEmailDeliveryFailures` - Check API logs for Resend failures. - Confirm `RESEND_API_KEY` and `RESEND_FROM_EMAIL`. - Confirm Resend service status outside this stack if needed. `SocializeBlobStorageFailures` - Confirm `./blob-storage` volume permissions on the preprod host. - Check local disk space. - Check API logs for validation or filesystem errors. `SocializeBackgroundJobFailures` - Check the operational events panel for the failing job name. - Check API logs for the same time window. `SocializeContentStaleInApproval` - Use the app to inspect content currently in approval. - Contact the relevant internal owner or client contact outside the app if needed. `SocializeCoreUsageQuiet` or `SocializeNoActiveWorkspaces` - Confirm whether quiet usage is expected for the period. - If not expected, check login events and API reachability. ## Retention Defaults - Prometheus keeps 15 days by default through `PROMETHEUS_RETENTION`. - Tempo keeps traces for 168 hours. - Loki uses local filesystem storage for preproduction. Tune retention before heavy customer usage or long-running demos.