Files
social-media/docs/OPERATIONS/observability-runbook.md

3.9 KiB

Observability Runbook

Purpose

This runbook is for preproduction operation of Socialize's self-hosted observability stack.

The goal is to answer:

  • Is the app reachable?
  • Is the API healthy?
  • Are errors or latency rising?
  • Are users exercising core workflows?
  • Are emails, blob storage, and background jobs failing?
  • Is work getting stuck?

Start The Stack

Run from the repository root on the preproduction host:

docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml up -d

Grafana listens on 127.0.0.1:3000 by default. Set GRAFANA_HTTP_BIND=0.0.0.0 only when Grafana is protected by a reverse proxy, VPN, firewall rule, or SSH tunnel.

Set these before exposing Grafana:

GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=<strong-password>

Alert Delivery

Prometheus sends alerts to Alertmanager. Alertmanager sends alerts to the webhook configured by:

ALERTMANAGER_WEBHOOK_URL=<private-alert-webhook-url>

If no webhook URL is configured, Alertmanager still starts but alert delivery points to a local discard endpoint.

Critical alerts repeat every 30 minutes. Other alerts repeat every 4 hours.

Secure Grafana With Caddy

An optional Caddy snippet is available at:

deploy/observability/caddy/grafana.Caddyfile

Generate a Caddy password hash:

caddy hash-password --plaintext '<password>'

Configure:

OBSERVABILITY_HOST=observability.example.com
GRAFANA_BASIC_AUTH_USER=<user>
GRAFANA_BASIC_AUTH_HASH=<hash>

Keep Grafana private unless the hostname is protected.

First Bring-Up Checks

  1. Confirm containers are running:
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml ps
  1. Check API health:
curl -i http://127.0.0.1:8080/health
curl -i http://127.0.0.1:8080/health/ready
  1. Open Grafana and check the Socialize Overview dashboard.

  2. Generate a few real actions:

  • log in
  • create a content item
  • add a comment
  • submit feedback
  • create a workspace invite
  1. Confirm metrics appear in the dashboard:
  • API request rate
  • usage signals
  • workflow backlog
  • operational events

Alert Triage

SocializePreprodEndpointDown

  • Check docker compose ps.
  • Check docker compose logs api web.
  • Check /health/ready.

SocializeApiTelemetryMissing

  • Check that api has OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317.
  • Check docker compose logs alloy.
  • Check whether the API is receiving traffic.

SocializeApiHighErrorRate

  • Open the API logs panel.
  • Filter by recent 5xx requests.
  • Open Tempo traces for slow or failing requests if available.

SocializeApiHighLatency

  • Check the p95 latency by endpoint panel.
  • Inspect slow traces.
  • Check database health and recent deploy activity.

SocializeEmailDeliveryFailures

  • Check API logs for Resend failures.
  • Confirm RESEND_API_KEY and RESEND_FROM_EMAIL.
  • Confirm Resend service status outside this stack if needed.

SocializeBlobStorageFailures

  • Confirm ./blob-storage volume permissions on the preprod host.
  • Check local disk space.
  • Check API logs for validation or filesystem errors.

SocializeBackgroundJobFailures

  • Check the operational events panel for the failing job name.
  • Check API logs for the same time window.

SocializeContentStaleInApproval

  • Use the app to inspect content currently in approval.
  • Contact the relevant internal owner or client contact outside the app if needed.

SocializeCoreUsageQuiet or SocializeNoActiveWorkspaces

  • Confirm whether quiet usage is expected for the period.
  • If not expected, check login events and API reachability.

Retention Defaults

  • Prometheus keeps 15 days by default through PROMETHEUS_RETENTION.
  • Tempo keeps traces for 168 hours.
  • Loki uses local filesystem storage for preproduction.

Tune retention before heavy customer usage or long-running demos.