feat: add preprod observability foundation

This commit is contained in:
2026-05-08 15:45:31 -04:00
parent 1ca6ab7117
commit 8bcff96821
35 changed files with 1627 additions and 56 deletions

View File

@@ -0,0 +1,44 @@
# Observability 001: Preprod Foundation
## Goal
Add the first preproduction observability foundation for Socialize so the operator can tell whether the app is healthy and whether core workflows are being used.
## Feature Spec
- `docs/FEATURES/observability.md`
## Scope
- Add backend OpenTelemetry registration for traces and metrics.
- Add structured JSON console logging with request correlation context.
- Add aggregate custom counters for core usage events.
- Expand health endpoints with liveness and readiness checks.
- Add an optional Docker Compose observability overlay for Grafana, Prometheus, Loki, Tempo, and Alloy.
- Add basic Grafana datasource/dashboard provisioning.
## Likely Files
- `backend/src/Socialize.Api/Program.cs`
- `backend/src/Socialize.Api/ApplicationRegistration.cs`
- `backend/src/Socialize.Api/Infrastructure/Observability/*`
- selected backend handlers for usage counters
- `backend/src/Socialize.Api/Socialize.Api.csproj`
- `deploy/observability/*`
- `README.md`
## Out Of Scope
- Client-facing analytics or status page.
- Frontend behavioral analytics.
- Cloud telemetry providers.
- Long-term telemetry retention policy.
- Full product analytics warehouse.
## Validation
```bash
dotnet build backend/Socialize.slnx
dotnet test backend/Socialize.slnx
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml config
```

View File

@@ -0,0 +1,32 @@
# Observability 002: Alerts And Dashboard Hardening
## Goal
Make the preproduction observability stack actionable by adding alert rules, better operator dashboards, pinned image versions, and operational counters for services that commonly fail silently.
## Feature Spec
- `docs/FEATURES/observability.md`
## Scope
- Pin Grafana, Prometheus, Loki, Tempo, and Alloy image tags in the observability compose overlay.
- Add Prometheus alert rules for API health, error rate, latency, usage silence, feedback bugs, email failures, blob failures, and background job failures.
- Expand the Grafana dashboard with health, usage, operational failure, alert, log, and trace-oriented panels.
- Add backend counters for email delivery, blob storage operations, and background job runs.
- Document alerting and safe Grafana exposure expectations.
## Out Of Scope
- Notification delivery integration for alerts.
- Client-facing status page.
- Cloud observability backends.
- Full product analytics or session tracking.
## Validation
```bash
dotnet build backend/Socialize.slnx
dotnet test backend/Socialize.slnx
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml config
```