Files
social-media/docs/OPERATIONS/observability-runbook.md

164 lines
3.9 KiB
Markdown

# Observability Runbook
## Purpose
This runbook is for preproduction operation of Socialize's self-hosted observability stack.
The goal is to answer:
- Is the app reachable?
- Is the API healthy?
- Are errors or latency rising?
- Are users exercising core workflows?
- Are emails, blob storage, and background jobs failing?
- Is work getting stuck?
## Start The Stack
Run from the repository root on the preproduction host:
```bash
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml up -d
```
Grafana listens on `127.0.0.1:3000` by default. Set `GRAFANA_HTTP_BIND=0.0.0.0`
only when Grafana is protected by a reverse proxy, VPN, firewall rule, or SSH tunnel.
Set these before exposing Grafana:
```bash
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=<strong-password>
```
## Alert Delivery
Prometheus sends alerts to Alertmanager. Alertmanager sends alerts to the webhook
configured by:
```bash
ALERTMANAGER_WEBHOOK_URL=<private-alert-webhook-url>
```
If no webhook URL is configured, Alertmanager still starts but alert delivery points
to a local discard endpoint.
Critical alerts repeat every 30 minutes. Other alerts repeat every 4 hours.
## Secure Grafana With Caddy
An optional Caddy snippet is available at:
```txt
deploy/observability/caddy/grafana.Caddyfile
```
Generate a Caddy password hash:
```bash
caddy hash-password --plaintext '<password>'
```
Configure:
```bash
OBSERVABILITY_HOST=observability.example.com
GRAFANA_BASIC_AUTH_USER=<user>
GRAFANA_BASIC_AUTH_HASH=<hash>
```
Keep Grafana private unless the hostname is protected.
## First Bring-Up Checks
1. Confirm containers are running:
```bash
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml ps
```
2. Check API health:
```bash
curl -i http://127.0.0.1:8080/health
curl -i http://127.0.0.1:8080/health/ready
```
3. Open Grafana and check the `Socialize Overview` dashboard.
4. Generate a few real actions:
- log in
- create a content item
- add a comment
- submit feedback
- create a workspace invite
5. Confirm metrics appear in the dashboard:
- API request rate
- usage signals
- workflow backlog
- operational events
## Alert Triage
`SocializePreprodEndpointDown`
- Check `docker compose ps`.
- Check `docker compose logs api web`.
- Check `/health/ready`.
`SocializeApiTelemetryMissing`
- Check that `api` has `OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317`.
- Check `docker compose logs alloy`.
- Check whether the API is receiving traffic.
`SocializeApiHighErrorRate`
- Open the API logs panel.
- Filter by recent `5xx` requests.
- Open Tempo traces for slow or failing requests if available.
`SocializeApiHighLatency`
- Check the p95 latency by endpoint panel.
- Inspect slow traces.
- Check database health and recent deploy activity.
`SocializeEmailDeliveryFailures`
- Check API logs for Resend failures.
- Confirm `RESEND_API_KEY` and `RESEND_FROM_EMAIL`.
- Confirm Resend service status outside this stack if needed.
`SocializeBlobStorageFailures`
- Confirm `./blob-storage` volume permissions on the preprod host.
- Check local disk space.
- Check API logs for validation or filesystem errors.
`SocializeBackgroundJobFailures`
- Check the operational events panel for the failing job name.
- Check API logs for the same time window.
`SocializeContentStaleInApproval`
- Use the app to inspect content currently in approval.
- Contact the relevant internal owner or client contact outside the app if needed.
`SocializeCoreUsageQuiet` or `SocializeNoActiveWorkspaces`
- Confirm whether quiet usage is expected for the period.
- If not expected, check login events and API reachability.
## Retention Defaults
- Prometheus keeps 15 days by default through `PROMETHEUS_RETENTION`.
- Tempo keeps traces for 168 hours.
- Loki uses local filesystem storage for preproduction.
Tune retention before heavy customer usage or long-running demos.