# Observability ## Status Draft ## Goal Give the SaaS operator preproduction visibility into whether Socialize is healthy and whether real users are exercising core workflows. This feature is operator-facing. It is not a client-facing analytics suite or status page. ## Initial Scope - structured backend logs suitable for centralized log search - OpenTelemetry traces and metrics emitted by the API - self-hosted Grafana observability stack for preproduction - health, readiness, and liveness endpoints - aggregate product usage counters for core workflow actions - dashboards and alerts for app health and adoption signals ## Operational Signals Health signals should cover: - API availability - Postgres connectivity - request rate, latency, and error rate - slow endpoints - outbound HTTP failures - background service failures - email delivery failures - blob storage failures - authentication failures Usage signals should cover aggregate counts for: - login attempts and successful logins - organizations and workspaces created - content items created - comments created - approval decisions submitted - feedback reports submitted - workspace invites created ## Privacy And Safety Rules - Do not log request bodies, access tokens, refresh tokens, passwords, uploaded file contents, screenshots, or raw customer content. - Usage metrics are aggregate operational signals, not behavioral tracking. - User, organization, and workspace identifiers may be included as structured attributes when already available to backend code. - The first implementation targets preproduction and self-hosted Docker infrastructure only. ## Deployment Shape The application emits OpenTelemetry over OTLP to a local collector. The preproduction observability stack runs as an optional Docker Compose overlay with: - Grafana for dashboards and alerting - Prometheus for metrics - Loki for logs - Tempo for traces - Grafana Alloy for log collection and telemetry routing The normal application compose file must remain usable without the observability overlay. ## Alerting Preproduction alerting should start with local Prometheus alert rules. Notification routing is a separate operational setup step because the first preproduction target may use email, chat, or a private incident channel. Initial alerts should cover: - app telemetry missing - high API error rate - high API p95 latency - core usage unexpectedly quiet - feedback bug reports submitted - email delivery failures - blob storage failures - background job failures