Files
social-media/docs/FEATURES/observability.md

2.5 KiB

Observability

Status

Draft

Goal

Give the SaaS operator preproduction visibility into whether Socialize is healthy and whether real users are exercising core workflows.

This feature is operator-facing. It is not a client-facing analytics suite or status page.

Initial Scope

  • structured backend logs suitable for centralized log search
  • OpenTelemetry traces and metrics emitted by the API
  • self-hosted Grafana observability stack for preproduction
  • health, readiness, and liveness endpoints
  • aggregate product usage counters for core workflow actions
  • dashboards and alerts for app health and adoption signals

Operational Signals

Health signals should cover:

  • API availability
  • Postgres connectivity
  • request rate, latency, and error rate
  • slow endpoints
  • outbound HTTP failures
  • background service failures
  • email delivery failures
  • blob storage failures
  • authentication failures

Usage signals should cover aggregate counts for:

  • login attempts and successful logins
  • organizations and workspaces created
  • content items created
  • comments created
  • approval decisions submitted
  • feedback reports submitted
  • workspace invites created

Privacy And Safety Rules

  • Do not log request bodies, access tokens, refresh tokens, passwords, uploaded file contents, screenshots, or raw customer content.
  • Usage metrics are aggregate operational signals, not behavioral tracking.
  • User, organization, and workspace identifiers may be included as structured attributes when already available to backend code.
  • The first implementation targets preproduction and self-hosted Docker infrastructure only.

Deployment Shape

The application emits OpenTelemetry over OTLP to a local collector.

The preproduction observability stack runs as an optional Docker Compose overlay with:

  • Grafana for dashboards and alerting
  • Prometheus for metrics
  • Loki for logs
  • Tempo for traces
  • Grafana Alloy for log collection and telemetry routing

The normal application compose file must remain usable without the observability overlay.

Alerting

Preproduction alerting should start with local Prometheus alert rules. Notification routing is a separate operational setup step because the first preproduction target may use email, chat, or a private incident channel.

Initial alerts should cover:

  • app telemetry missing
  • high API error rate
  • high API p95 latency
  • core usage unexpectedly quiet
  • feedback bug reports submitted
  • email delivery failures
  • blob storage failures
  • background job failures