feat: add preprod observability foundation

2026-05-08 15:45:31 -04:00
parent 1ca6ab7117
commit 8bcff96821
35 changed files with 1627 additions and 56 deletions
--- a/docs/FEATURES/observability.md
+++ b/docs/FEATURES/observability.md
@@ -0,0 +1,80 @@
+# Observability
+
+## Status
+
+Draft
+
+## Goal
+
+Give the SaaS operator preproduction visibility into whether Socialize is healthy and whether real users are exercising core workflows.
+
+This feature is operator-facing. It is not a client-facing analytics suite or status page.
+
+## Initial Scope
+
+- structured backend logs suitable for centralized log search
+- OpenTelemetry traces and metrics emitted by the API
+- self-hosted Grafana observability stack for preproduction
+- health, readiness, and liveness endpoints
+- aggregate product usage counters for core workflow actions
+- dashboards and alerts for app health and adoption signals
+
+## Operational Signals
+
+Health signals should cover:
+
+- API availability
+- Postgres connectivity
+- request rate, latency, and error rate
+- slow endpoints
+- outbound HTTP failures
+- background service failures
+- email delivery failures
+- blob storage failures
+- authentication failures
+
+Usage signals should cover aggregate counts for:
+
+- login attempts and successful logins
+- organizations and workspaces created
+- content items created
+- comments created
+- approval decisions submitted
+- feedback reports submitted
+- workspace invites created
+
+## Privacy And Safety Rules
+
+- Do not log request bodies, access tokens, refresh tokens, passwords, uploaded file contents, screenshots, or raw customer content.
+- Usage metrics are aggregate operational signals, not behavioral tracking.
+- User, organization, and workspace identifiers may be included as structured attributes when already available to backend code.
+- The first implementation targets preproduction and self-hosted Docker infrastructure only.
+
+## Deployment Shape
+
+The application emits OpenTelemetry over OTLP to a local collector.
+
+The preproduction observability stack runs as an optional Docker Compose overlay with:
+
+- Grafana for dashboards and alerting
+- Prometheus for metrics
+- Loki for logs
+- Tempo for traces
+- Grafana Alloy for log collection and telemetry routing
+
+The normal application compose file must remain usable without the observability overlay.
+
+## Alerting
+
+Preproduction alerting should start with local Prometheus alert rules. Notification routing is a separate operational setup step because the first preproduction target may use email, chat, or a private incident channel.
+
+Initial alerts should cover:
+
+- app telemetry missing
+- high API error rate
+- high API p95 latency
+- core usage unexpectedly quiet
+- feedback bug reports submitted
+- email delivery failures
+- blob storage failures
+- background job failures