social-media/docs/FEATURES/observability.md

# Observability

## Status

Draft

## Goal

Give the SaaS operator preproduction visibility into whether Socialize is healthy and whether real users are exercising core workflows.

This feature is operator-facing. It is not a client-facing analytics suite or status page.

## Initial Scope

- structured backend logs suitable for centralized log search
- OpenTelemetry traces and metrics emitted by the API
- self-hosted Grafana observability stack for preproduction
- health, readiness, and liveness endpoints
- aggregate product usage counters for core workflow actions
- dashboards and alerts for app health and adoption signals

## Operational Signals

Health signals should cover:

- API availability
- Postgres connectivity
- request rate, latency, and error rate
- slow endpoints
- outbound HTTP failures
- background service failures
- email delivery failures
- blob storage failures
- authentication failures

Usage signals should cover aggregate counts for:

- login attempts and successful logins
- organizations and workspaces created
- content items created
- comments created
- approval decisions submitted
- feedback reports submitted
- workspace invites created

## Privacy And Safety Rules

- Do not log request bodies, access tokens, refresh tokens, passwords, uploaded file contents, screenshots, or raw customer content.
- Usage metrics are aggregate operational signals, not behavioral tracking.
- User, organization, and workspace identifiers may be included as structured attributes when already available to backend code.
- The first implementation targets preproduction and self-hosted Docker infrastructure only.

## Deployment Shape

The application emits OpenTelemetry over OTLP to a local collector.

The preproduction observability stack runs as an optional Docker Compose overlay with:

- Grafana for dashboards and alerting
- Prometheus for metrics
- Loki for logs
- Tempo for traces
- Grafana Alloy for log collection and telemetry routing

The normal application compose file must remain usable without the observability overlay.

## Alerting

Preproduction alerting should start with local Prometheus alert rules. Notification routing is a separate operational setup step because the first preproduction target may use email, chat, or a private incident channel.

Initial alerts should cover:

- app telemetry missing
- high API error rate
- high API p95 latency
- core usage unexpectedly quiet
- feedback bug reports submitted
- email delivery failures
- blob storage failures
- background job failures

## Workflow Health Gauges

Database-derived workflow health metrics should be sampled periodically instead of emitted per request.

Initial gauges should cover:

- content item counts by status
- feedback report counts by status
- pending workspace invites
- content stale in approval
- active workspace counts over 24-hour and 7-day windows

These are operator health signals. They should stay aggregate enough to avoid high-cardinality metric labels.