Observability

Status

Draft

Goal

Give the SaaS operator preproduction visibility into whether Socialize is healthy and whether real users are exercising core workflows.

This feature is operator-facing. It is not a client-facing analytics suite or status page.

Initial Scope

structured backend logs suitable for centralized log search
OpenTelemetry traces and metrics emitted by the API
self-hosted Grafana observability stack for preproduction
health, readiness, and liveness endpoints
aggregate product usage counters for core workflow actions
dashboards and alerts for app health and adoption signals

Operational Signals

Health signals should cover:

API availability
Postgres connectivity
request rate, latency, and error rate
slow endpoints
outbound HTTP failures
background service failures
email delivery failures
blob storage failures
authentication failures

Usage signals should cover aggregate counts for:

login attempts and successful logins
organizations and workspaces created
content items created
comments created
approval decisions submitted
feedback reports submitted
workspace invites created

Privacy And Safety Rules

Do not log request bodies, access tokens, refresh tokens, passwords, uploaded file contents, screenshots, or raw customer content.
Usage metrics are aggregate operational signals, not behavioral tracking.
User, organization, and workspace identifiers may be included as structured attributes when already available to backend code.
The first implementation targets preproduction and self-hosted Docker infrastructure only.

Deployment Shape

The application emits OpenTelemetry over OTLP to a local collector.

The preproduction observability stack runs as an optional Docker Compose overlay with:

Grafana for dashboards and alerting
Prometheus for metrics
Loki for logs
Tempo for traces
Grafana Alloy for log collection and telemetry routing

The normal application compose file must remain usable without the observability overlay.

Alerting

Preproduction alerting should start with local Prometheus alert rules. Notification routing is a separate operational setup step because the first preproduction target may use email, chat, or a private incident channel.

Initial alerts should cover:

app telemetry missing
high API error rate
high API p95 latency
core usage unexpectedly quiet
feedback bug reports submitted
email delivery failures
blob storage failures
background job failures

2.5 KiB Raw Blame History