feat: close preprod observability loop
This commit is contained in:
163
docs/OPERATIONS/observability-runbook.md
Normal file
163
docs/OPERATIONS/observability-runbook.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# Observability Runbook
|
||||
|
||||
## Purpose
|
||||
|
||||
This runbook is for preproduction operation of Socialize's self-hosted observability stack.
|
||||
|
||||
The goal is to answer:
|
||||
|
||||
- Is the app reachable?
|
||||
- Is the API healthy?
|
||||
- Are errors or latency rising?
|
||||
- Are users exercising core workflows?
|
||||
- Are emails, blob storage, and background jobs failing?
|
||||
- Is work getting stuck?
|
||||
|
||||
## Start The Stack
|
||||
|
||||
Run from the repository root on the preproduction host:
|
||||
|
||||
```bash
|
||||
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml up -d
|
||||
```
|
||||
|
||||
Grafana listens on `127.0.0.1:3000` by default. Set `GRAFANA_HTTP_BIND=0.0.0.0`
|
||||
only when Grafana is protected by a reverse proxy, VPN, firewall rule, or SSH tunnel.
|
||||
|
||||
Set these before exposing Grafana:
|
||||
|
||||
```bash
|
||||
GRAFANA_ADMIN_USER=admin
|
||||
GRAFANA_ADMIN_PASSWORD=<strong-password>
|
||||
```
|
||||
|
||||
## Alert Delivery
|
||||
|
||||
Prometheus sends alerts to Alertmanager. Alertmanager sends alerts to the webhook
|
||||
configured by:
|
||||
|
||||
```bash
|
||||
ALERTMANAGER_WEBHOOK_URL=<private-alert-webhook-url>
|
||||
```
|
||||
|
||||
If no webhook URL is configured, Alertmanager still starts but alert delivery points
|
||||
to a local discard endpoint.
|
||||
|
||||
Critical alerts repeat every 30 minutes. Other alerts repeat every 4 hours.
|
||||
|
||||
## Secure Grafana With Caddy
|
||||
|
||||
An optional Caddy snippet is available at:
|
||||
|
||||
```txt
|
||||
deploy/observability/caddy/grafana.Caddyfile
|
||||
```
|
||||
|
||||
Generate a Caddy password hash:
|
||||
|
||||
```bash
|
||||
caddy hash-password --plaintext '<password>'
|
||||
```
|
||||
|
||||
Configure:
|
||||
|
||||
```bash
|
||||
OBSERVABILITY_HOST=observability.example.com
|
||||
GRAFANA_BASIC_AUTH_USER=<user>
|
||||
GRAFANA_BASIC_AUTH_HASH=<hash>
|
||||
```
|
||||
|
||||
Keep Grafana private unless the hostname is protected.
|
||||
|
||||
## First Bring-Up Checks
|
||||
|
||||
1. Confirm containers are running:
|
||||
|
||||
```bash
|
||||
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml ps
|
||||
```
|
||||
|
||||
2. Check API health:
|
||||
|
||||
```bash
|
||||
curl -i http://127.0.0.1:8080/health
|
||||
curl -i http://127.0.0.1:8080/health/ready
|
||||
```
|
||||
|
||||
3. Open Grafana and check the `Socialize Overview` dashboard.
|
||||
|
||||
4. Generate a few real actions:
|
||||
|
||||
- log in
|
||||
- create a content item
|
||||
- add a comment
|
||||
- submit feedback
|
||||
- create a workspace invite
|
||||
|
||||
5. Confirm metrics appear in the dashboard:
|
||||
|
||||
- API request rate
|
||||
- usage signals
|
||||
- workflow backlog
|
||||
- operational events
|
||||
|
||||
## Alert Triage
|
||||
|
||||
`SocializePreprodEndpointDown`
|
||||
|
||||
- Check `docker compose ps`.
|
||||
- Check `docker compose logs api web`.
|
||||
- Check `/health/ready`.
|
||||
|
||||
`SocializeApiTelemetryMissing`
|
||||
|
||||
- Check that `api` has `OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317`.
|
||||
- Check `docker compose logs alloy`.
|
||||
- Check whether the API is receiving traffic.
|
||||
|
||||
`SocializeApiHighErrorRate`
|
||||
|
||||
- Open the API logs panel.
|
||||
- Filter by recent `5xx` requests.
|
||||
- Open Tempo traces for slow or failing requests if available.
|
||||
|
||||
`SocializeApiHighLatency`
|
||||
|
||||
- Check the p95 latency by endpoint panel.
|
||||
- Inspect slow traces.
|
||||
- Check database health and recent deploy activity.
|
||||
|
||||
`SocializeEmailDeliveryFailures`
|
||||
|
||||
- Check API logs for Resend failures.
|
||||
- Confirm `RESEND_API_KEY` and `RESEND_FROM_EMAIL`.
|
||||
- Confirm Resend service status outside this stack if needed.
|
||||
|
||||
`SocializeBlobStorageFailures`
|
||||
|
||||
- Confirm `./blob-storage` volume permissions on the preprod host.
|
||||
- Check local disk space.
|
||||
- Check API logs for validation or filesystem errors.
|
||||
|
||||
`SocializeBackgroundJobFailures`
|
||||
|
||||
- Check the operational events panel for the failing job name.
|
||||
- Check API logs for the same time window.
|
||||
|
||||
`SocializeContentStaleInApproval`
|
||||
|
||||
- Use the app to inspect content currently in approval.
|
||||
- Contact the relevant internal owner or client contact outside the app if needed.
|
||||
|
||||
`SocializeCoreUsageQuiet` or `SocializeNoActiveWorkspaces`
|
||||
|
||||
- Confirm whether quiet usage is expected for the period.
|
||||
- If not expected, check login events and API reachability.
|
||||
|
||||
## Retention Defaults
|
||||
|
||||
- Prometheus keeps 15 days by default through `PROMETHEUS_RETENTION`.
|
||||
- Tempo keeps traces for 168 hours.
|
||||
- Loki uses local filesystem storage for preproduction.
|
||||
|
||||
Tune retention before heavy customer usage or long-running demos.
|
||||
Reference in New Issue
Block a user