social-media/docs/OPERATIONS/observability-runbook.md

# Observability Runbook

## Purpose

This runbook is for preproduction operation of Socialize's self-hosted observability stack.

The goal is to answer:

- Is the app reachable?
- Is the API healthy?
- Are errors or latency rising?
- Are users exercising core workflows?
- Are emails, blob storage, and background jobs failing?
- Is work getting stuck?

## Start The Stack

Run from the repository root on the preproduction host:

```bash
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml up -d
```

Grafana listens on `127.0.0.1:3000` by default. Set `GRAFANA_HTTP_BIND=0.0.0.0`
only when Grafana is protected by a reverse proxy, VPN, firewall rule, or SSH tunnel.

Set these before exposing Grafana:

```bash
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=<strong-password>
```

## Alert Delivery

Prometheus sends alerts to Alertmanager. Alertmanager sends alerts to the webhook
configured by:

```bash
ALERTMANAGER_WEBHOOK_URL=<private-alert-webhook-url>
```

If no webhook URL is configured, Alertmanager still starts but alert delivery points
to a local discard endpoint.

Critical alerts repeat every 30 minutes. Other alerts repeat every 4 hours.

## Secure Grafana With Caddy

An optional Caddy snippet is available at:

```txt
deploy/observability/caddy/grafana.Caddyfile
```

Generate a Caddy password hash:

```bash
caddy hash-password --plaintext '<password>'
```

Configure:

```bash
OBSERVABILITY_HOST=observability.example.com
GRAFANA_BASIC_AUTH_USER=<user>
GRAFANA_BASIC_AUTH_HASH=<hash>
```

Keep Grafana private unless the hostname is protected.

## First Bring-Up Checks

1. Confirm containers are running:

```bash
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml ps
```

2. Check API health:

```bash
curl -i http://127.0.0.1:8080/health
curl -i http://127.0.0.1:8080/health/ready
```

3. Open Grafana and check the `Socialize Overview` dashboard.

4. Generate a few real actions:

- log in
- create a content item
- add a comment
- submit feedback
- create a workspace invite

5. Confirm metrics appear in the dashboard:

- API request rate
- usage signals
- workflow backlog
- operational events

## Alert Triage

`SocializePreprodEndpointDown`

- Check `docker compose ps`.
- Check `docker compose logs api web`.
- Check `/health/ready`.

`SocializeApiTelemetryMissing`

- Check that `api` has `OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317`.
- Check `docker compose logs alloy`.
- Check whether the API is receiving traffic.

`SocializeApiHighErrorRate`

- Open the API logs panel.
- Filter by recent `5xx` requests.
- Open Tempo traces for slow or failing requests if available.

`SocializeApiHighLatency`

- Check the p95 latency by endpoint panel.
- Inspect slow traces.
- Check database health and recent deploy activity.

`SocializeEmailDeliveryFailures`

- Check API logs for Resend failures.
- Confirm `RESEND_API_KEY` and `RESEND_FROM_EMAIL`.
- Confirm Resend service status outside this stack if needed.

`SocializeBlobStorageFailures`

- Confirm `./blob-storage` volume permissions on the preprod host.
- Check local disk space.
- Check API logs for validation or filesystem errors.

`SocializeBackgroundJobFailures`

- Check the operational events panel for the failing job name.
- Check API logs for the same time window.

`SocializeContentStaleInApproval`

- Use the app to inspect content currently in approval.
- Contact the relevant internal owner or client contact outside the app if needed.

`SocializeCoreUsageQuiet` or `SocializeNoActiveWorkspaces`

- Confirm whether quiet usage is expected for the period.
- If not expected, check login events and API reachability.

## Retention Defaults

- Prometheus keeps 15 days by default through `PROMETHEUS_RETENTION`.
- Tempo keeps traces for 168 hours.
- Loki uses local filesystem storage for preproduction.

Tune retention before heavy customer usage or long-running demos.