95 lines
2.9 KiB
Markdown
95 lines
2.9 KiB
Markdown
# Observability
|
|
|
|
## Status
|
|
|
|
Draft
|
|
|
|
## Goal
|
|
|
|
Give the SaaS operator preproduction visibility into whether Socialize is healthy and whether real users are exercising core workflows.
|
|
|
|
This feature is operator-facing. It is not a client-facing analytics suite or status page.
|
|
|
|
## Initial Scope
|
|
|
|
- structured backend logs suitable for centralized log search
|
|
- OpenTelemetry traces and metrics emitted by the API
|
|
- self-hosted Grafana observability stack for preproduction
|
|
- health, readiness, and liveness endpoints
|
|
- aggregate product usage counters for core workflow actions
|
|
- dashboards and alerts for app health and adoption signals
|
|
|
|
## Operational Signals
|
|
|
|
Health signals should cover:
|
|
|
|
- API availability
|
|
- Postgres connectivity
|
|
- request rate, latency, and error rate
|
|
- slow endpoints
|
|
- outbound HTTP failures
|
|
- background service failures
|
|
- email delivery failures
|
|
- blob storage failures
|
|
- authentication failures
|
|
|
|
Usage signals should cover aggregate counts for:
|
|
|
|
- login attempts and successful logins
|
|
- organizations and workspaces created
|
|
- content items created
|
|
- comments created
|
|
- approval decisions submitted
|
|
- feedback reports submitted
|
|
- workspace invites created
|
|
|
|
## Privacy And Safety Rules
|
|
|
|
- Do not log request bodies, access tokens, refresh tokens, passwords, uploaded file contents, screenshots, or raw customer content.
|
|
- Usage metrics are aggregate operational signals, not behavioral tracking.
|
|
- User, organization, and workspace identifiers may be included as structured attributes when already available to backend code.
|
|
- The first implementation targets preproduction and self-hosted Docker infrastructure only.
|
|
|
|
## Deployment Shape
|
|
|
|
The application emits OpenTelemetry over OTLP to a local collector.
|
|
|
|
The preproduction observability stack runs as an optional Docker Compose overlay with:
|
|
|
|
- Grafana for dashboards and alerting
|
|
- Prometheus for metrics
|
|
- Loki for logs
|
|
- Tempo for traces
|
|
- Grafana Alloy for log collection and telemetry routing
|
|
|
|
The normal application compose file must remain usable without the observability overlay.
|
|
|
|
## Alerting
|
|
|
|
Preproduction alerting should start with local Prometheus alert rules. Notification routing is a separate operational setup step because the first preproduction target may use email, chat, or a private incident channel.
|
|
|
|
Initial alerts should cover:
|
|
|
|
- app telemetry missing
|
|
- high API error rate
|
|
- high API p95 latency
|
|
- core usage unexpectedly quiet
|
|
- feedback bug reports submitted
|
|
- email delivery failures
|
|
- blob storage failures
|
|
- background job failures
|
|
|
|
## Workflow Health Gauges
|
|
|
|
Database-derived workflow health metrics should be sampled periodically instead of emitted per request.
|
|
|
|
Initial gauges should cover:
|
|
|
|
- content item counts by status
|
|
- feedback report counts by status
|
|
- pending workspace invites
|
|
- content stale in approval
|
|
- active workspace counts over 24-hour and 7-day windows
|
|
|
|
These are operator health signals. They should stay aggregate enough to avoid high-cardinality metric labels.
|