3.9 KiB
Observability Runbook
Purpose
This runbook is for preproduction operation of Socialize's self-hosted observability stack.
The goal is to answer:
- Is the app reachable?
- Is the API healthy?
- Are errors or latency rising?
- Are users exercising core workflows?
- Are emails, blob storage, and background jobs failing?
- Is work getting stuck?
Start The Stack
Run from the repository root on the preproduction host:
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml up -d
Grafana listens on 127.0.0.1:3000 by default. Set GRAFANA_HTTP_BIND=0.0.0.0
only when Grafana is protected by a reverse proxy, VPN, firewall rule, or SSH tunnel.
Set these before exposing Grafana:
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=<strong-password>
Alert Delivery
Prometheus sends alerts to Alertmanager. Alertmanager sends alerts to the webhook configured by:
ALERTMANAGER_WEBHOOK_URL=<private-alert-webhook-url>
If no webhook URL is configured, Alertmanager still starts but alert delivery points to a local discard endpoint.
Critical alerts repeat every 30 minutes. Other alerts repeat every 4 hours.
Secure Grafana With Caddy
An optional Caddy snippet is available at:
deploy/observability/caddy/grafana.Caddyfile
Generate a Caddy password hash:
caddy hash-password --plaintext '<password>'
Configure:
OBSERVABILITY_HOST=observability.example.com
GRAFANA_BASIC_AUTH_USER=<user>
GRAFANA_BASIC_AUTH_HASH=<hash>
Keep Grafana private unless the hostname is protected.
First Bring-Up Checks
- Confirm containers are running:
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml ps
- Check API health:
curl -i http://127.0.0.1:8080/health
curl -i http://127.0.0.1:8080/health/ready
-
Open Grafana and check the
Socialize Overviewdashboard. -
Generate a few real actions:
- log in
- create a content item
- add a comment
- submit feedback
- create a workspace invite
- Confirm metrics appear in the dashboard:
- API request rate
- usage signals
- workflow backlog
- operational events
Alert Triage
SocializePreprodEndpointDown
- Check
docker compose ps. - Check
docker compose logs api web. - Check
/health/ready.
SocializeApiTelemetryMissing
- Check that
apihasOTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317. - Check
docker compose logs alloy. - Check whether the API is receiving traffic.
SocializeApiHighErrorRate
- Open the API logs panel.
- Filter by recent
5xxrequests. - Open Tempo traces for slow or failing requests if available.
SocializeApiHighLatency
- Check the p95 latency by endpoint panel.
- Inspect slow traces.
- Check database health and recent deploy activity.
SocializeEmailDeliveryFailures
- Check API logs for Resend failures.
- Confirm
RESEND_API_KEYandRESEND_FROM_EMAIL. - Confirm Resend service status outside this stack if needed.
SocializeBlobStorageFailures
- Confirm
./blob-storagevolume permissions on the preprod host. - Check local disk space.
- Check API logs for validation or filesystem errors.
SocializeBackgroundJobFailures
- Check the operational events panel for the failing job name.
- Check API logs for the same time window.
SocializeContentStaleInApproval
- Use the app to inspect content currently in approval.
- Contact the relevant internal owner or client contact outside the app if needed.
SocializeCoreUsageQuiet or SocializeNoActiveWorkspaces
- Confirm whether quiet usage is expected for the period.
- If not expected, check login events and API reachability.
Retention Defaults
- Prometheus keeps 15 days by default through
PROMETHEUS_RETENTION. - Tempo keeps traces for 168 hours.
- Loki uses local filesystem storage for preproduction.
Tune retention before heavy customer usage or long-running demos.