feat: add preprod observability foundation
This commit is contained in:
80
docs/FEATURES/observability.md
Normal file
80
docs/FEATURES/observability.md
Normal file
@@ -0,0 +1,80 @@
|
||||
# Observability
|
||||
|
||||
## Status
|
||||
|
||||
Draft
|
||||
|
||||
## Goal
|
||||
|
||||
Give the SaaS operator preproduction visibility into whether Socialize is healthy and whether real users are exercising core workflows.
|
||||
|
||||
This feature is operator-facing. It is not a client-facing analytics suite or status page.
|
||||
|
||||
## Initial Scope
|
||||
|
||||
- structured backend logs suitable for centralized log search
|
||||
- OpenTelemetry traces and metrics emitted by the API
|
||||
- self-hosted Grafana observability stack for preproduction
|
||||
- health, readiness, and liveness endpoints
|
||||
- aggregate product usage counters for core workflow actions
|
||||
- dashboards and alerts for app health and adoption signals
|
||||
|
||||
## Operational Signals
|
||||
|
||||
Health signals should cover:
|
||||
|
||||
- API availability
|
||||
- Postgres connectivity
|
||||
- request rate, latency, and error rate
|
||||
- slow endpoints
|
||||
- outbound HTTP failures
|
||||
- background service failures
|
||||
- email delivery failures
|
||||
- blob storage failures
|
||||
- authentication failures
|
||||
|
||||
Usage signals should cover aggregate counts for:
|
||||
|
||||
- login attempts and successful logins
|
||||
- organizations and workspaces created
|
||||
- content items created
|
||||
- comments created
|
||||
- approval decisions submitted
|
||||
- feedback reports submitted
|
||||
- workspace invites created
|
||||
|
||||
## Privacy And Safety Rules
|
||||
|
||||
- Do not log request bodies, access tokens, refresh tokens, passwords, uploaded file contents, screenshots, or raw customer content.
|
||||
- Usage metrics are aggregate operational signals, not behavioral tracking.
|
||||
- User, organization, and workspace identifiers may be included as structured attributes when already available to backend code.
|
||||
- The first implementation targets preproduction and self-hosted Docker infrastructure only.
|
||||
|
||||
## Deployment Shape
|
||||
|
||||
The application emits OpenTelemetry over OTLP to a local collector.
|
||||
|
||||
The preproduction observability stack runs as an optional Docker Compose overlay with:
|
||||
|
||||
- Grafana for dashboards and alerting
|
||||
- Prometheus for metrics
|
||||
- Loki for logs
|
||||
- Tempo for traces
|
||||
- Grafana Alloy for log collection and telemetry routing
|
||||
|
||||
The normal application compose file must remain usable without the observability overlay.
|
||||
|
||||
## Alerting
|
||||
|
||||
Preproduction alerting should start with local Prometheus alert rules. Notification routing is a separate operational setup step because the first preproduction target may use email, chat, or a private incident channel.
|
||||
|
||||
Initial alerts should cover:
|
||||
|
||||
- app telemetry missing
|
||||
- high API error rate
|
||||
- high API p95 latency
|
||||
- core usage unexpectedly quiet
|
||||
- feedback bug reports submitted
|
||||
- email delivery failures
|
||||
- blob storage failures
|
||||
- background job failures
|
||||
44
docs/TASKS/observability/001-observability-foundation.md
Normal file
44
docs/TASKS/observability/001-observability-foundation.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Observability 001: Preprod Foundation
|
||||
|
||||
## Goal
|
||||
|
||||
Add the first preproduction observability foundation for Socialize so the operator can tell whether the app is healthy and whether core workflows are being used.
|
||||
|
||||
## Feature Spec
|
||||
|
||||
- `docs/FEATURES/observability.md`
|
||||
|
||||
## Scope
|
||||
|
||||
- Add backend OpenTelemetry registration for traces and metrics.
|
||||
- Add structured JSON console logging with request correlation context.
|
||||
- Add aggregate custom counters for core usage events.
|
||||
- Expand health endpoints with liveness and readiness checks.
|
||||
- Add an optional Docker Compose observability overlay for Grafana, Prometheus, Loki, Tempo, and Alloy.
|
||||
- Add basic Grafana datasource/dashboard provisioning.
|
||||
|
||||
## Likely Files
|
||||
|
||||
- `backend/src/Socialize.Api/Program.cs`
|
||||
- `backend/src/Socialize.Api/ApplicationRegistration.cs`
|
||||
- `backend/src/Socialize.Api/Infrastructure/Observability/*`
|
||||
- selected backend handlers for usage counters
|
||||
- `backend/src/Socialize.Api/Socialize.Api.csproj`
|
||||
- `deploy/observability/*`
|
||||
- `README.md`
|
||||
|
||||
## Out Of Scope
|
||||
|
||||
- Client-facing analytics or status page.
|
||||
- Frontend behavioral analytics.
|
||||
- Cloud telemetry providers.
|
||||
- Long-term telemetry retention policy.
|
||||
- Full product analytics warehouse.
|
||||
|
||||
## Validation
|
||||
|
||||
```bash
|
||||
dotnet build backend/Socialize.slnx
|
||||
dotnet test backend/Socialize.slnx
|
||||
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml config
|
||||
```
|
||||
32
docs/TASKS/observability/002-alerts-dashboard-hardening.md
Normal file
32
docs/TASKS/observability/002-alerts-dashboard-hardening.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# Observability 002: Alerts And Dashboard Hardening
|
||||
|
||||
## Goal
|
||||
|
||||
Make the preproduction observability stack actionable by adding alert rules, better operator dashboards, pinned image versions, and operational counters for services that commonly fail silently.
|
||||
|
||||
## Feature Spec
|
||||
|
||||
- `docs/FEATURES/observability.md`
|
||||
|
||||
## Scope
|
||||
|
||||
- Pin Grafana, Prometheus, Loki, Tempo, and Alloy image tags in the observability compose overlay.
|
||||
- Add Prometheus alert rules for API health, error rate, latency, usage silence, feedback bugs, email failures, blob failures, and background job failures.
|
||||
- Expand the Grafana dashboard with health, usage, operational failure, alert, log, and trace-oriented panels.
|
||||
- Add backend counters for email delivery, blob storage operations, and background job runs.
|
||||
- Document alerting and safe Grafana exposure expectations.
|
||||
|
||||
## Out Of Scope
|
||||
|
||||
- Notification delivery integration for alerts.
|
||||
- Client-facing status page.
|
||||
- Cloud observability backends.
|
||||
- Full product analytics or session tracking.
|
||||
|
||||
## Validation
|
||||
|
||||
```bash
|
||||
dotnet build backend/Socialize.slnx
|
||||
dotnet test backend/Socialize.slnx
|
||||
docker compose -f deploy/compose.yml -f deploy/observability/compose.observability.yml config
|
||||
```
|
||||
Reference in New Issue
Block a user