Observability Design
Purpose
Design a comprehensive observability strategy covering metrics, logging, tracing, alerting, and SLI/SLO definitions. Produces a monitoring architecture that enables rapid incident detection, diagnosis, and resolution.
Inputs
- •System architecture (services, databases, APIs, third-party dependencies)
- •Current monitoring setup (existing tools, dashboards, alerts)
- •Reliability requirements (SLA commitments, uptime targets)
- •Team structure (on-call rotation, escalation paths)
Process
Step 1: Define Observability Pillars
Establish the three pillars for this system:
- •Metrics: What to measure — request rate, error rate, latency, saturation, business KPIs
- •Logs: What to record — request lifecycle, state changes, errors, audit events
- •Traces: What to follow — cross-service request flows, database queries, external API calls
- •Map each pillar to specific use cases: debugging, alerting, capacity planning, business intelligence
Step 2: Design Metric Collection
Define the metric taxonomy:
- •Application metrics: Request count, error count, latency histograms, queue depth, cache hit rate
- •Infrastructure metrics: CPU, memory, disk I/O, network throughput, connection pool utilization
- •Business metrics: Sign-ups, conversions, revenue events, feature adoption rates
- •Custom instrumentation: Counters (events), gauges (current values), histograms (distributions)
- •Specify metric naming conventions, label/tag strategy, and cardinality limits
Step 3: Define Alert Thresholds and Escalation
Design the alerting strategy:
- •Warning alerts: Early indicators — elevated error rate, latency creep, resource approaching limits
- •Critical alerts: Immediate action required — service down, error rate spike, SLO burn rate exceeded
- •Escalation paths: Primary on-call → secondary → engineering lead → incident commander
- •Runbook links: Every alert includes a link to its diagnosis and remediation runbook
- •Alert fatigue prevention: Grouping, deduplication, silence windows, alert quality reviews
Step 4: Plan Structured Logging
Design the logging architecture:
- •Log levels: DEBUG (development only), INFO (normal operations), WARN (unexpected but handled), ERROR (requires attention)
- •Structured fields: timestamp, service, request_id, user_id, action, duration_ms, status
- •Correlation IDs: Request ID propagation across services for distributed request tracing
- •PII redaction: Identify sensitive fields, implement automatic redaction/masking
- •Log aggregation: Collection, indexing, retention periods, search capabilities
Step 5: Design Distributed Tracing
Plan request flow visibility:
- •Span naming conventions:
service.operationformat, consistent across services - •Context propagation: How trace context passes between services (headers, message metadata)
- •Sampling strategy: Head-based vs tail-based sampling, sampling rate by endpoint or error status
- •Trace enrichment: Adding business context (user tier, feature flag state) to spans
- •Critical paths: Which request flows must always be traced (payments, auth, data mutations)
Step 6: Specify Dashboard Requirements
Define dashboard hierarchy:
- •Operational dashboards: Service health overview, real-time traffic, error rates, latency percentiles
- •Business dashboards: User activity, feature adoption, conversion funnels, revenue metrics
- •SLO dashboards: Error budget remaining, burn rate, SLO compliance history
- •Incident dashboards: Pre-built investigation views for common failure modes
- •Specify dashboard layout, refresh intervals, time range defaults, and access controls
Step 7: Define SLIs/SLOs
Establish reliability targets:
- •Availability SLI: Successful requests / total requests (define "successful")
- •Latency SLI: Proportion of requests faster than threshold (p50, p95, p99 targets)
- •Error rate SLI: Proportion of requests without errors (define "error")
- •SLO targets: e.g., 99.9% availability, p95 latency < 200ms, error rate < 0.1%
- •Error budgets: Calculate error budget from SLO, define burn rate alerts (fast burn, slow burn)
- •SLO review cadence: Weekly error budget check, monthly SLO review, quarterly target adjustment
Output Format
markdown
# Observability Design: [Service/Feature Name] ## Observability Architecture
[Application] → [Metrics Agent] → [Metrics Store] → [Dashboards] ↓ ↓ [Structured Logs] → [Log Aggregator] → [Log Search] [Alerts] → [On-call] ↓ [Trace SDK] → [Trace Collector] → [Trace UI]
code
## Metric Catalog
| Metric Name | Type | Labels | Description | Alert Threshold |
|-------------|------|--------|-------------|-----------------|
| http_requests_total | counter | method, path, status | Request count | N/A |
| http_request_duration_ms | histogram | method, path | Request latency | p95 > 500ms |
| ... | ... | ... | ... | ... |
## Alert Catalog
| Alert Name | Severity | Condition | Duration | Runbook |
|------------|----------|-----------|----------|---------|
| HighErrorRate | critical | error_rate > 5% | 5m | [link] |
| LatencyDegraded | warning | p95 > 500ms | 10m | [link] |
| ... | ... | ... | ... | ... |
## Logging Schema
```json
{
"timestamp": "ISO8601",
"level": "INFO",
"service": "api",
"request_id": "uuid",
"user_id": "string (optional)",
"action": "string",
"duration_ms": "number",
"status": "number",
"message": "string"
}
SLI/SLO Definitions
| SLI | Measurement | SLO Target | Error Budget (30d) |
|---|---|---|---|
| Availability | successful requests / total | 99.9% | 43.2 min downtime |
| Latency | requests < 200ms / total | 99.0% | 432 min slow |
| Error Rate | non-error requests / total | 99.9% | 0.1% errors |
Dashboard Specifications
| Dashboard | Audience | Key Panels | Refresh |
|---|---|---|---|
| Service Health | On-call | Traffic, errors, latency, saturation | 30s |
| SLO Status | Engineering | Error budget, burn rate, compliance | 5m |
| Business Metrics | Product | Adoption, conversions, revenue | 1h |
code
## Quality Checks - [ ] All three observability pillars (metrics, logs, traces) are covered - [ ] Every alert has a defined severity, threshold, and linked runbook - [ ] Structured logging schema includes correlation IDs for distributed tracing - [ ] PII fields are identified with redaction strategy - [ ] SLIs are measurable and SLO targets are realistic for the service tier - [ ] Error budgets are calculated with burn rate alert thresholds - [ ] Dashboard hierarchy covers operational, business, and SLO views - [ ] Sampling strategy balances trace coverage with storage costs ## Evolution Notes <!-- Observations appended after each use -->