feat: add preprod observability foundation

2026-05-08 15:45:31 -04:00
parent 1ca6ab7117
commit 8bcff96821
35 changed files with 1627 additions and 56 deletions
--- a/docs/FEATURES/observability.md
+++ b/docs/FEATURES/observability.md
@@ -0,0 +1,80 @@
+# Observability
+
+## Status
+
+Draft
+
+## Goal
+
+Give the SaaS operator preproduction visibility into whether Socialize is healthy and whether real users are exercising core workflows.
+
+This feature is operator-facing. It is not a client-facing analytics suite or status page.
+
+## Initial Scope
+
+- structured backend logs suitable for centralized log search
+- OpenTelemetry traces and metrics emitted by the API
+- self-hosted Grafana observability stack for preproduction
+- health, readiness, and liveness endpoints
+- aggregate product usage counters for core workflow actions
+- dashboards and alerts for app health and adoption signals
+
+## Operational Signals
+
+Health signals should cover:
+
+- API availability
+- Postgres connectivity
+- request rate, latency, and error rate
+- slow endpoints
+- outbound HTTP failures
+- background service failures
+- email delivery failures
+- blob storage failures
+- authentication failures
+
+Usage signals should cover aggregate counts for:
+
+- login attempts and successful logins
+- organizations and workspaces created
+- content items created
+- comments created
+- approval decisions submitted
+- feedback reports submitted
+- workspace invites created
+
+## Privacy And Safety Rules
+
+- Do not log request bodies, access tokens, refresh tokens, passwords, uploaded file contents, screenshots, or raw customer content.
+- Usage metrics are aggregate operational signals, not behavioral tracking.
+- User, organization, and workspace identifiers may be included as structured attributes when already available to backend code.
+- The first implementation targets preproduction and self-hosted Docker infrastructure only.
+
+## Deployment Shape
+
+The application emits OpenTelemetry over OTLP to a local collector.
+
+The preproduction observability stack runs as an optional Docker Compose overlay with:
+
+- Grafana for dashboards and alerting
+- Prometheus for metrics
+- Loki for logs
+- Tempo for traces
+- Grafana Alloy for log collection and telemetry routing
+
+The normal application compose file must remain usable without the observability overlay.
+
+## Alerting
+
+Preproduction alerting should start with local Prometheus alert rules. Notification routing is a separate operational setup step because the first preproduction target may use email, chat, or a private incident channel.
+
+Initial alerts should cover:
+
+- app telemetry missing
+- high API error rate
+- high API p95 latency
+- core usage unexpectedly quiet
+- feedback bug reports submitted
+- email delivery failures
+- blob storage failures
+- background job failures
--- a/docs/TASKS/observability/001-observability-foundation.md
+++ b/docs/TASKS/observability/001-observability-foundation.md
@@ -0,0 +1,44 @@
+# Observability 001: Preprod Foundation
+
+## Goal
+
+Add the first preproduction observability foundation for Socialize so the operator can tell whether the app is healthy and whether core workflows are being used.
+
+## Feature Spec
+
+- `docs/FEATURES/observability.md`
+
+## Scope
+
+- Add backend OpenTelemetry registration for traces and metrics.
+- Add structured JSON console logging with request correlation context.
+- Add aggregate custom counters for core usage events.
+- Expand health endpoints with liveness and readiness checks.
+- Add an optional Docker Compose observability overlay for Grafana, Prometheus, Loki, Tempo, and Alloy.
+- Add basic Grafana datasource/dashboard provisioning.
+
+## Likely Files
+
+- `backend/src/Socialize.Api/Program.cs`
+- `backend/src/Socialize.Api/ApplicationRegistration.cs`
+- `backend/src/Socialize.Api/Infrastructure/Observability/*`
+- selected backend handlers for usage counters
+- `backend/src/Socialize.Api/Socialize.Api.csproj`
+- `deploy/observability/*`
+- `README.md`
+
+## Out Of Scope
+
+- Client-facing analytics or status page.
+- Frontend behavioral analytics.
+- Cloud telemetry providers.
+- Long-term telemetry retention policy.
+- Full product analytics warehouse.
+
+## Validation
+
+```bash
+dotnet build backend/Socialize.slnx
+dotnet test backend/Socialize.slnx
+docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml config
+```
--- a/docs/TASKS/observability/002-alerts-dashboard-hardening.md
+++ b/docs/TASKS/observability/002-alerts-dashboard-hardening.md
@@ -0,0 +1,32 @@
+# Observability 002: Alerts And Dashboard Hardening
+
+## Goal
+
+Make the preproduction observability stack actionable by adding alert rules, better operator dashboards, pinned image versions, and operational counters for services that commonly fail silently.
+
+## Feature Spec
+
+- `docs/FEATURES/observability.md`
+
+## Scope
+
+- Pin Grafana, Prometheus, Loki, Tempo, and Alloy image tags in the observability compose overlay.
+- Add Prometheus alert rules for API health, error rate, latency, usage silence, feedback bugs, email failures, blob failures, and background job failures.
+- Expand the Grafana dashboard with health, usage, operational failure, alert, log, and trace-oriented panels.
+- Add backend counters for email delivery, blob storage operations, and background job runs.
+- Document alerting and safe Grafana exposure expectations.
+
+## Out Of Scope
+
+- Notification delivery integration for alerts.
+- Client-facing status page.
+- Cloud observability backends.
+- Full product analytics or session tracking.
+
+## Validation
+
+```bash
+dotnet build backend/Socialize.slnx
+dotnet test backend/Socialize.slnx
+docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml config
+```