feat: close preprod observability loop
This commit is contained in:
@@ -78,3 +78,17 @@ Initial alerts should cover:
|
||||
- email delivery failures
|
||||
- blob storage failures
|
||||
- background job failures
|
||||
|
||||
## Workflow Health Gauges
|
||||
|
||||
Database-derived workflow health metrics should be sampled periodically instead of emitted per request.
|
||||
|
||||
Initial gauges should cover:
|
||||
|
||||
- content item counts by status
|
||||
- feedback report counts by status
|
||||
- pending workspace invites
|
||||
- content stale in approval
|
||||
- active workspace counts over 24-hour and 7-day windows
|
||||
|
||||
These are operator health signals. They should stay aggregate enough to avoid high-cardinality metric labels.
|
||||
|
||||
163
docs/OPERATIONS/observability-runbook.md
Normal file
163
docs/OPERATIONS/observability-runbook.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# Observability Runbook
|
||||
|
||||
## Purpose
|
||||
|
||||
This runbook is for preproduction operation of Socialize's self-hosted observability stack.
|
||||
|
||||
The goal is to answer:
|
||||
|
||||
- Is the app reachable?
|
||||
- Is the API healthy?
|
||||
- Are errors or latency rising?
|
||||
- Are users exercising core workflows?
|
||||
- Are emails, blob storage, and background jobs failing?
|
||||
- Is work getting stuck?
|
||||
|
||||
## Start The Stack
|
||||
|
||||
Run from the repository root on the preproduction host:
|
||||
|
||||
```bash
|
||||
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml up -d
|
||||
```
|
||||
|
||||
Grafana listens on `127.0.0.1:3000` by default. Set `GRAFANA_HTTP_BIND=0.0.0.0`
|
||||
only when Grafana is protected by a reverse proxy, VPN, firewall rule, or SSH tunnel.
|
||||
|
||||
Set these before exposing Grafana:
|
||||
|
||||
```bash
|
||||
GRAFANA_ADMIN_USER=admin
|
||||
GRAFANA_ADMIN_PASSWORD=<strong-password>
|
||||
```
|
||||
|
||||
## Alert Delivery
|
||||
|
||||
Prometheus sends alerts to Alertmanager. Alertmanager sends alerts to the webhook
|
||||
configured by:
|
||||
|
||||
```bash
|
||||
ALERTMANAGER_WEBHOOK_URL=<private-alert-webhook-url>
|
||||
```
|
||||
|
||||
If no webhook URL is configured, Alertmanager still starts but alert delivery points
|
||||
to a local discard endpoint.
|
||||
|
||||
Critical alerts repeat every 30 minutes. Other alerts repeat every 4 hours.
|
||||
|
||||
## Secure Grafana With Caddy
|
||||
|
||||
An optional Caddy snippet is available at:
|
||||
|
||||
```txt
|
||||
deploy/observability/caddy/grafana.Caddyfile
|
||||
```
|
||||
|
||||
Generate a Caddy password hash:
|
||||
|
||||
```bash
|
||||
caddy hash-password --plaintext '<password>'
|
||||
```
|
||||
|
||||
Configure:
|
||||
|
||||
```bash
|
||||
OBSERVABILITY_HOST=observability.example.com
|
||||
GRAFANA_BASIC_AUTH_USER=<user>
|
||||
GRAFANA_BASIC_AUTH_HASH=<hash>
|
||||
```
|
||||
|
||||
Keep Grafana private unless the hostname is protected.
|
||||
|
||||
## First Bring-Up Checks
|
||||
|
||||
1. Confirm containers are running:
|
||||
|
||||
```bash
|
||||
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml ps
|
||||
```
|
||||
|
||||
2. Check API health:
|
||||
|
||||
```bash
|
||||
curl -i http://127.0.0.1:8080/health
|
||||
curl -i http://127.0.0.1:8080/health/ready
|
||||
```
|
||||
|
||||
3. Open Grafana and check the `Socialize Overview` dashboard.
|
||||
|
||||
4. Generate a few real actions:
|
||||
|
||||
- log in
|
||||
- create a content item
|
||||
- add a comment
|
||||
- submit feedback
|
||||
- create a workspace invite
|
||||
|
||||
5. Confirm metrics appear in the dashboard:
|
||||
|
||||
- API request rate
|
||||
- usage signals
|
||||
- workflow backlog
|
||||
- operational events
|
||||
|
||||
## Alert Triage
|
||||
|
||||
`SocializePreprodEndpointDown`
|
||||
|
||||
- Check `docker compose ps`.
|
||||
- Check `docker compose logs api web`.
|
||||
- Check `/health/ready`.
|
||||
|
||||
`SocializeApiTelemetryMissing`
|
||||
|
||||
- Check that `api` has `OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy:4317`.
|
||||
- Check `docker compose logs alloy`.
|
||||
- Check whether the API is receiving traffic.
|
||||
|
||||
`SocializeApiHighErrorRate`
|
||||
|
||||
- Open the API logs panel.
|
||||
- Filter by recent `5xx` requests.
|
||||
- Open Tempo traces for slow or failing requests if available.
|
||||
|
||||
`SocializeApiHighLatency`
|
||||
|
||||
- Check the p95 latency by endpoint panel.
|
||||
- Inspect slow traces.
|
||||
- Check database health and recent deploy activity.
|
||||
|
||||
`SocializeEmailDeliveryFailures`
|
||||
|
||||
- Check API logs for Resend failures.
|
||||
- Confirm `RESEND_API_KEY` and `RESEND_FROM_EMAIL`.
|
||||
- Confirm Resend service status outside this stack if needed.
|
||||
|
||||
`SocializeBlobStorageFailures`
|
||||
|
||||
- Confirm `./blob-storage` volume permissions on the preprod host.
|
||||
- Check local disk space.
|
||||
- Check API logs for validation or filesystem errors.
|
||||
|
||||
`SocializeBackgroundJobFailures`
|
||||
|
||||
- Check the operational events panel for the failing job name.
|
||||
- Check API logs for the same time window.
|
||||
|
||||
`SocializeContentStaleInApproval`
|
||||
|
||||
- Use the app to inspect content currently in approval.
|
||||
- Contact the relevant internal owner or client contact outside the app if needed.
|
||||
|
||||
`SocializeCoreUsageQuiet` or `SocializeNoActiveWorkspaces`
|
||||
|
||||
- Confirm whether quiet usage is expected for the period.
|
||||
- If not expected, check login events and API reachability.
|
||||
|
||||
## Retention Defaults
|
||||
|
||||
- Prometheus keeps 15 days by default through `PROMETHEUS_RETENTION`.
|
||||
- Tempo keeps traces for 168 hours.
|
||||
- Loki uses local filesystem storage for preproduction.
|
||||
|
||||
Tune retention before heavy customer usage or long-running demos.
|
||||
34
docs/TASKS/observability/003-preprod-operations-loop.md
Normal file
34
docs/TASKS/observability/003-preprod-operations-loop.md
Normal file
@@ -0,0 +1,34 @@
|
||||
# Observability 003: Preprod Operations Loop
|
||||
|
||||
## Goal
|
||||
|
||||
Close the preproduction operations loop by adding alert delivery scaffolding, uptime probes, workflow health gauges, secured Grafana guidance, and an operator runbook.
|
||||
|
||||
## Feature Spec
|
||||
|
||||
- `docs/FEATURES/observability.md`
|
||||
|
||||
## Scope
|
||||
|
||||
- Add Alertmanager to the optional observability compose overlay.
|
||||
- Add Blackbox Exporter uptime probes for the web container and API readiness endpoint.
|
||||
- Add backend database-derived workflow health gauges.
|
||||
- Add Prometheus alerts for uptime probes and workflow health.
|
||||
- Add an optional Caddy snippet for protected Grafana exposure.
|
||||
- Add an operator runbook for bring-up, alert triage, and security defaults.
|
||||
|
||||
## Out Of Scope
|
||||
|
||||
- Operating the remote preproduction host.
|
||||
- Choosing the final alert destination.
|
||||
- Client-facing status page.
|
||||
- External third-party uptime monitoring.
|
||||
|
||||
## Validation
|
||||
|
||||
```bash
|
||||
dotnet build backend/Socialize.slnx
|
||||
dotnet test backend/Socialize.slnx
|
||||
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml config
|
||||
jq empty deploy/observability/grafana/dashboards/socialize-overview.json
|
||||
```
|
||||
Reference in New Issue
Block a user