164 lines
3.9 KiB
Markdown
164 lines
3.9 KiB
Markdown
# Observability Runbook
|
|
|
|
## Purpose
|
|
|
|
This runbook is for preproduction operation of Socialize's self-hosted observability stack.
|
|
|
|
The goal is to answer:
|
|
|
|
- Is the app reachable?
|
|
- Is the API healthy?
|
|
- Are errors or latency rising?
|
|
- Are users exercising core workflows?
|
|
- Are emails, blob storage, and background jobs failing?
|
|
- Is work getting stuck?
|
|
|
|
## Start The Stack
|
|
|
|
Run from the repository root on the preproduction host:
|
|
|
|
```bash
|
|
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml up -d
|
|
```
|
|
|
|
Grafana listens on `127.0.0.1:3000` by default. Set `GRAFANA_HTTP_BIND=0.0.0.0`
|
|
only when Grafana is protected by a reverse proxy, VPN, firewall rule, or SSH tunnel.
|
|
|
|
Set these before exposing Grafana:
|
|
|
|
```bash
|
|
GRAFANA_ADMIN_USER=admin
|
|
GRAFANA_ADMIN_PASSWORD=<strong-password>
|
|
```
|
|
|
|
## Alert Delivery
|
|
|
|
Prometheus sends alerts to Alertmanager. Alertmanager sends alerts to the webhook
|
|
configured by:
|
|
|
|
```bash
|
|
ALERTMANAGER_WEBHOOK_URL=<private-alert-webhook-url>
|
|
```
|
|
|
|
If no webhook URL is configured, Alertmanager still starts but alert delivery points
|
|
to a local discard endpoint.
|
|
|
|
Critical alerts repeat every 30 minutes. Other alerts repeat every 4 hours.
|
|
|
|
## Secure Grafana With Caddy
|
|
|
|
An optional Caddy snippet is available at:
|
|
|
|
```txt
|
|
deploy/observability/caddy/grafana.Caddyfile
|
|
```
|
|
|
|
Generate a Caddy password hash:
|
|
|
|
```bash
|
|
caddy hash-password --plaintext '<password>'
|
|
```
|
|
|
|
Configure:
|
|
|
|
```bash
|
|
OBSERVABILITY_HOST=observability.example.com
|
|
GRAFANA_BASIC_AUTH_USER=<user>
|
|
GRAFANA_BASIC_AUTH_HASH=<hash>
|
|
```
|
|
|
|
Keep Grafana private unless the hostname is protected.
|
|
|
|
## First Bring-Up Checks
|
|
|
|
1. Confirm containers are running:
|
|
|
|
```bash
|
|
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml ps
|
|
```
|
|
|
|
2. Check API health:
|
|
|
|
```bash
|
|
curl -i http://127.0.0.1:8080/health
|
|
curl -i http://127.0.0.1:8080/health/ready
|
|
```
|
|
|
|
3. Open Grafana and check the `Socialize Overview` dashboard.
|
|
|
|
4. Generate a few real actions:
|
|
|
|
- log in
|
|
- create a content item
|
|
- add a comment
|
|
- submit feedback
|
|
- create a workspace invite
|
|
|
|
5. Confirm metrics appear in the dashboard:
|
|
|
|
- API request rate
|
|
- usage signals
|
|
- workflow backlog
|
|
- operational events
|
|
|
|
## Alert Triage
|
|
|
|
`SocializePreprodEndpointDown`
|
|
|
|
- Check `docker compose ps`.
|
|
- Check `docker compose logs api web`.
|
|
- Check `/health/ready`.
|
|
|
|
`SocializeApiTelemetryMissing`
|
|
|
|
- Check that `api` has `OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317`.
|
|
- Check `docker compose logs alloy`.
|
|
- Check whether the API is receiving traffic.
|
|
|
|
`SocializeApiHighErrorRate`
|
|
|
|
- Open the API logs panel.
|
|
- Filter by recent `5xx` requests.
|
|
- Open Tempo traces for slow or failing requests if available.
|
|
|
|
`SocializeApiHighLatency`
|
|
|
|
- Check the p95 latency by endpoint panel.
|
|
- Inspect slow traces.
|
|
- Check database health and recent deploy activity.
|
|
|
|
`SocializeEmailDeliveryFailures`
|
|
|
|
- Check API logs for Resend failures.
|
|
- Confirm `RESEND_API_KEY` and `RESEND_FROM_EMAIL`.
|
|
- Confirm Resend service status outside this stack if needed.
|
|
|
|
`SocializeBlobStorageFailures`
|
|
|
|
- Confirm `./blob-storage` volume permissions on the preprod host.
|
|
- Check local disk space.
|
|
- Check API logs for validation or filesystem errors.
|
|
|
|
`SocializeBackgroundJobFailures`
|
|
|
|
- Check the operational events panel for the failing job name.
|
|
- Check API logs for the same time window.
|
|
|
|
`SocializeContentStaleInApproval`
|
|
|
|
- Use the app to inspect content currently in approval.
|
|
- Contact the relevant internal owner or client contact outside the app if needed.
|
|
|
|
`SocializeCoreUsageQuiet` or `SocializeNoActiveWorkspaces`
|
|
|
|
- Confirm whether quiet usage is expected for the period.
|
|
- If not expected, check login events and API reachability.
|
|
|
|
## Retention Defaults
|
|
|
|
- Prometheus keeps 15 days by default through `PROMETHEUS_RETENTION`.
|
|
- Tempo keeps traces for 168 hours.
|
|
- Loki uses local filesystem storage for preproduction.
|
|
|
|
Tune retention before heavy customer usage or long-running demos.
|