Monitoring & Observability
Overview
This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
When to use this skill:
- •Setting up monitoring for new services
- •Designing alerts and dashboards
- •Troubleshooting performance issues
- •Implementing SLO tracking and error budgets
- •Choosing between monitoring tools
- •Integrating OpenTelemetry instrumentation
- •Analyzing metrics, logs, and traces
- •Optimizing Datadog costs and finding waste
- •Migrating from Datadog to open-source stack
Core Workflow: Observability Implementation
Use this decision tree to determine your starting point:
Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
├─ YES → Go to "9. Troubleshooting & Analysis"
└─ NO → Are you improving existing monitoring?
├─ Alerts → Go to "3. Alert Design"
├─ Dashboards → Go to "4. Dashboard & Visualization"
├─ SLOs → Go to "5. SLO & Error Budgets"
├─ Tool selection → Read references/tool_comparison.md
└─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"
1. Design Metrics Strategy
Start with The Four Golden Signals
Every service should monitor:
- •Latency: Response time (p50, p95, p99)
- •Traffic: Requests per second
- •Errors: Failure rate
- •Saturation: Resource utilization
For request-driven services, use the RED Method:
- •Rate: Requests/sec
- •Errors: Error rate
- •Duration: Response time
For infrastructure resources, use the USE Method:
- •Utilization: % time busy
- •Saturation**: Queue depth
- •Errors**: Error count
Quick Start - Web Application Example:
# Rate (requests/sec)
sum(rate(http_requests_total[5m]))
# Errors (error rate %)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Duration (p95 latency)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
Deep Dive: Metric Design
For comprehensive metric design guidance including:
- •Metric types (counter, gauge, histogram, summary)
- •Cardinality best practices
- •Naming conventions
- •Dashboard design principles
→ Read: references/metrics_design.md
Automated Metric Analysis
Detect anomalies and trends in your metrics:
# Analyze Prometheus metrics for anomalies python3 scripts/analyze_metrics.py prometheus \ --endpoint http://localhost:9090 \ --query 'rate(http_requests_total[5m])' \ --hours 24 # Analyze CloudWatch metrics python3 scripts/analyze_metrics.py cloudwatch \ --namespace AWS/EC2 \ --metric CPUUtilization \ --dimensions InstanceId=i-1234567890abcdef0 \ --hours 48
→ Script: scripts/analyze_metrics.py
2. Log Aggregation & Analysis
Structured Logging Checklist
Every log entry should include:
- •✅ Timestamp (ISO 8601 format)
- •✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL)
- •✅ Message (human-readable)
- •✅ Service name
- •✅ Request ID (for tracing)
Example structured log (JSON):
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "error",
"message": "Payment processing failed",
"service": "payment-service",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"order_id": "ORD-456",
"error_type": "GatewayTimeout",
"duration_ms": 5000
}
Log Aggregation Patterns
ELK Stack (Elasticsearch, Logstash, Kibana):
- •Best for: Deep log analysis, complex queries
- •Cost: High (infrastructure + operations)
- •Complexity: High
Grafana Loki:
- •Best for: Cost-effective logging, Kubernetes
- •Cost: Low
- •Complexity: Medium
CloudWatch Logs:
- •Best for: AWS-centric applications
- •Cost: Medium
- •Complexity: Low
Log Analysis
Analyze logs for errors, patterns, and anomalies:
# Analyze log file for patterns python3 scripts/log_analyzer.py application.log # Show error lines with context python3 scripts/log_analyzer.py application.log --show-errors # Extract stack traces python3 scripts/log_analyzer.py application.log --show-traces
→ Script: scripts/log_analyzer.py
Deep Dive: Logging
For comprehensive logging guidance including:
- •Structured logging implementation examples (Python, Node.js, Go, Java)
- •Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd)
- •Query patterns and best practices
- •PII redaction and security
- •Sampling and rate limiting
→ Read: references/logging_guide.md
3. Alert Design
Alert Design Principles
- •Every alert must be actionable - If you can't do something, don't alert
- •Alert on symptoms, not causes - Alert on user experience, not components
- •Tie alerts to SLOs - Connect to business impact
- •Reduce noise - Only page for critical issues
Alert Severity Levels
| Severity | Response Time | Example |
|---|---|---|
| Critical | Page immediately | Service down, SLO violation |
| Warning | Ticket, review in hours | Elevated error rate, resource warning |
| Info | Log for awareness | Deployment completed, scaling event |
Multi-Window Burn Rate Alerting
Alert when error budget is consumed too quickly:
# Fast burn (1h window) - Critical
- alert: ErrorBudgetFastBurn
expr: |
(error_rate / 0.001) > 14.4 # 99.9% SLO
for: 2m
labels:
severity: critical
# Slow burn (6h window) - Warning
- alert: ErrorBudgetSlowBurn
expr: |
(error_rate / 0.001) > 6 # 99.9% SLO
for: 30m
labels:
severity: warning
Alert Quality Checker
Audit your alert rules against best practices:
# Check single file python3 scripts/alert_quality_checker.py alerts.yml # Check all rules in directory python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/
Checks for:
- •Alert naming conventions
- •Required labels (severity, team)
- •Required annotations (summary, description, runbook_url)
- •PromQL expression quality
- •'for' clause to prevent flapping
→ Script: scripts/alert_quality_checker.py
Alert Templates
Production-ready alert rule templates:
→ Templates:
- •assets/templates/prometheus-alerts/webapp-alerts.yml - Web application alerts
- •assets/templates/prometheus-alerts/kubernetes-alerts.yml - Kubernetes alerts
Deep Dive: Alerting
For comprehensive alerting guidance including:
- •Alert design patterns (multi-window, rate of change, threshold with hysteresis)
- •Alert annotation best practices
- •Alert routing (severity-based, team-based, time-based)
- •Inhibition rules
- •Runbook structure
- •On-call best practices
→ Read: references/alerting_best_practices.md
Runbook Template
Create comprehensive runbooks for your alerts:
→ Template: assets/templates/runbooks/incident-runbook-template.md
4. Dashboard & Visualization
Dashboard Design Principles
- •Top-down layout: Most important metrics first
- •Color coding: Red (critical), yellow (warning), green (healthy)
- •Consistent time windows: All panels use same time range
- •Limit panels: 8-12 panels per dashboard maximum
- •Include context: Show related metrics together
Recommended Dashboard Structure
┌─────────────────────────────────────┐ │ Overall Health (Single Stats) │ │ [Requests/s] [Error%] [P95 Latency]│ └─────────────────────────────────────┘ ┌─────────────────────────────────────┐ │ Request Rate & Errors (Graphs) │ └─────────────────────────────────────┘ ┌─────────────────────────────────────┐ │ Latency Distribution (Graphs) │ └─────────────────────────────────────┘ ┌─────────────────────────────────────┐ │ Resource Usage (Graphs) │ └─────────────────────────────────────┘
Generate Grafana Dashboards
Automatically generate dashboards from templates:
# Web application dashboard python3 scripts/dashboard_generator.py webapp \ --title "My API Dashboard" \ --service my_api \ --output dashboard.json # Kubernetes dashboard python3 scripts/dashboard_generator.py kubernetes \ --title "K8s Production" \ --namespace production \ --output k8s-dashboard.json # Database dashboard python3 scripts/dashboard_generator.py database \ --title "PostgreSQL" \ --db-type postgres \ --instance db.example.com:5432 \ --output db-dashboard.json
Supports:
- •Web applications (requests, errors, latency, resources)
- •Kubernetes (pods, nodes, resources, network)
- •Databases (PostgreSQL, MySQL)
→ Script: scripts/dashboard_generator.py
5. SLO & Error Budgets
SLO Fundamentals
SLI (Service Level Indicator): Measurement of service quality
- •Example: Request latency, error rate, availability
SLO (Service Level Objective): Target value for an SLI
- •Example: "99.9% of requests return in < 500ms"
Error Budget: Allowed failure amount = (100% - SLO)
- •Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month
Common SLO Targets
| Availability | Downtime/Month | Use Case |
|---|---|---|
| 99% | 7.2 hours | Internal tools |
| 99.9% | 43.2 minutes | Standard production |
| 99.95% | 21.6 minutes | Critical services |
| 99.99% | 4.3 minutes | High availability |
SLO Calculator
Calculate compliance, error budgets, and burn rates:
# Show SLO reference table python3 scripts/slo_calculator.py --table # Calculate availability SLO python3 scripts/slo_calculator.py availability \ --slo 99.9 \ --total-requests 1000000 \ --failed-requests 1500 \ --period-days 30 # Calculate burn rate python3 scripts/slo_calculator.py burn-rate \ --slo 99.9 \ --errors 50 \ --requests 10000 \ --window-hours 1
→ Script: scripts/slo_calculator.py
Deep Dive: SLO/SLA
For comprehensive SLO/SLA guidance including:
- •Choosing appropriate SLIs
- •Setting realistic SLO targets
- •Error budget policies
- •Burn rate alerting
- •SLA structure and contracts
- •Monthly reporting templates
→ Read: references/slo_sla_guide.md
6. Distributed Tracing
When to Use Tracing
Use distributed tracing when you need to:
- •Debug performance issues across services
- •Understand request flow through microservices
- •Identify bottlenecks in distributed systems
- •Find N+1 query problems
OpenTelemetry Implementation
Python example:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
try:
result = payment_service.charge(order_id)
span.set_attribute("payment.status", "success")
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
Sampling Strategies
- •Development: 100% (ALWAYS_ON)
- •Staging: 50-100%
- •Production: 1-10% (or error-based sampling)
Error-based sampling (always sample errors, 1% of successes):
class ErrorSampler(Sampler):
def should_sample(self, parent_context, trace_id, name, **kwargs):
attributes = kwargs.get('attributes', {})
if attributes.get('error', False):
return Decision.RECORD_AND_SAMPLE
if trace_id & 0xFF < 3: # ~1%
return Decision.RECORD_AND_SAMPLE
return Decision.DROP
OTel Collector Configuration
Production-ready OpenTelemetry Collector configuration:
→ Template: assets/templates/otel-config/collector-config.yaml
Features:
- •Receives OTLP, Prometheus, and host metrics
- •Batching and memory limiting
- •Tail sampling (error-based, latency-based, probabilistic)
- •Multiple exporters (Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog)
Deep Dive: Tracing
For comprehensive tracing guidance including:
- •OpenTelemetry instrumentation (Python, Node.js, Go, Java)
- •Span attributes and semantic conventions
- •Context propagation (W3C Trace Context)
- •Backend comparison (Jaeger, Tempo, X-Ray, Datadog APM)
- •Analysis patterns (finding slow traces, N+1 queries)
- •Integration with logs
→ Read: references/tracing_guide.md
7. Datadog Cost Optimization & Migration
Scenario 1: I'm Using Datadog and Costs Are Too High
If your Datadog bill is growing out of control, start by identifying waste:
Cost Analysis Script
Automatically analyze your Datadog usage and find cost optimization opportunities:
# Analyze Datadog usage (requires API key and APP key) python3 scripts/datadog_cost_analyzer.py \ --api-key $DD_API_KEY \ --app-key $DD_APP_KEY # Show detailed breakdown by category python3 scripts/datadog_cost_analyzer.py \ --api-key $DD_API_KEY \ --app-key $DD_APP_KEY \ --show-details
What it checks:
- •Infrastructure host count and cost
- •Custom metrics usage and high-cardinality metrics
- •Log ingestion volume and trends
- •APM host usage
- •Unused or noisy monitors
- •Container vs VM optimization opportunities
→ Script: scripts/datadog_cost_analyzer.py
Common Cost Optimization Strategies
1. Custom Metrics Optimization (typical savings: 20-40%):
- •Remove high-cardinality tags (user IDs, request IDs)
- •Delete unused custom metrics
- •Aggregate metrics before sending
- •Use metric prefixes to identify teams/services
2. Log Management (typical savings: 30-50%):
- •Implement log sampling for high-volume services
- •Use exclusion filters for debug/trace logs in production
- •Archive cold logs to S3/GCS after 7 days
- •Set log retention policies (15 days instead of 30)
3. APM Optimization (typical savings: 15-25%):
- •Reduce trace sampling rates (10% → 5% in prod)
- •Use head-based sampling instead of complete sampling
- •Remove APM from non-critical services
- •Use trace search with lower retention
4. Infrastructure Monitoring (typical savings: 10-20%):
- •Switch from VM-based to container-based pricing where possible
- •Remove agents from ephemeral instances
- •Use Datadog's host reduction strategies
- •Consolidate staging environments
Scenario 2: Migrating Away from Datadog
If you're considering migrating to a more cost-effective open-source stack:
Migration Overview
From Datadog → To Open Source Stack:
- •Metrics: Datadog → Prometheus + Grafana
- •Logs: Datadog Logs → Grafana Loki
- •Traces: Datadog APM → Tempo or Jaeger
- •Dashboards: Datadog → Grafana
- •Alerts: Datadog Monitors → Prometheus Alertmanager
Estimated Cost Savings: 60-77% ($49.8k-61.8k/year for 100-host environment)
Migration Strategy
Phase 1: Run Parallel (Month 1-2):
- •Deploy open-source stack alongside Datadog
- •Migrate metrics first (lowest risk)
- •Validate data accuracy
Phase 2: Migrate Dashboards & Alerts (Month 2-3):
- •Convert Datadog dashboards to Grafana
- •Translate alert rules (use DQL → PromQL guide below)
- •Train team on new tools
Phase 3: Migrate Logs & Traces (Month 3-4):
- •Set up Loki for log aggregation
- •Deploy Tempo/Jaeger for tracing
- •Update application instrumentation
Phase 4: Decommission Datadog (Month 4-5):
- •Confirm all functionality migrated
- •Cancel Datadog subscription
Query Translation: DQL → PromQL
When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:
Quick examples:
# Average CPU
Datadog: avg:system.cpu.user{*}
Prometheus: avg(node_cpu_seconds_total{mode="user"})
# Request rate
Datadog: sum:requests.count{*}.as_rate()
Prometheus: sum(rate(http_requests_total[5m]))
# P95 latency
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Error rate percentage
Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
→ Full Translation Guide: references/dql_promql_translation.md
Cost Comparison
Example: 100-host infrastructure
| Component | Datadog (Annual) | Open Source (Annual) | Savings |
|---|---|---|---|
| Infrastructure | $18,000 | $10,000 (self-hosted infra) | $8,000 |
| Custom Metrics | $600 | Included | $600 |
| Logs | $24,000 | $3,000 (storage) | $21,000 |
| APM/Traces | $37,200 | $5,000 (storage) | $32,200 |
| Total | $79,800 | $18,000 | $61,800 (77%) |
Deep Dive: Datadog Migration
For comprehensive migration guidance including:
- •Detailed cost comparison and ROI calculations
- •Step-by-step migration instructions
- •Infrastructure sizing recommendations (CPU, RAM, storage)
- •Dashboard conversion tools and examples
- •Alert rule translation patterns
- •Application instrumentation changes (DogStatsD → Prometheus client)
- •Python scripts for exporting Datadog dashboards and monitors
- •Common challenges and solutions
→ Read: references/datadog_migration.md
8. Tool Selection & Comparison
Decision Matrix
Choose Prometheus + Grafana if:
- •✅ Using Kubernetes
- •✅ Want control and customization
- •✅ Have ops capacity
- •✅ Budget-conscious
Choose Datadog if:
- •✅ Want ease of use
- •✅ Need full observability now
- •✅ Budget allows ($8k+/month for 100 hosts)
Choose Grafana Stack (LGTM) if:
- •✅ Want open source full stack
- •✅ Cost-effective solution
- •✅ Cloud-native architecture
Choose ELK Stack if:
- •✅ Heavy log analysis needs
- •✅ Need powerful search
- •✅ Have dedicated ops team
Choose Cloud Native (CloudWatch/etc) if:
- •✅ Single cloud provider
- •✅ Simple needs
- •✅ Want minimal setup
Cost Comparison (100 hosts, 1TB logs/month)
| Solution | Monthly Cost | Setup | Ops Burden |
|---|---|---|---|
| Prometheus + Loki + Tempo | $1,500 | Medium | Medium |
| Grafana Cloud | $3,000 | Low | Low |
| Datadog | $8,000 | Low | None |
| ELK Stack | $4,000 | High | High |
| CloudWatch | $2,000 | Low | Low |
Deep Dive: Tool Comparison
For comprehensive tool comparison including:
- •Metrics platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud)
- •Logging platforms (ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic)
- •Tracing platforms (Jaeger, Tempo, Datadog APM, X-Ray)
- •Full-stack observability comparison
- •Recommendations by company size
→ Read: references/tool_comparison.md
9. Troubleshooting & Analysis
Health Check Validation
Validate health check endpoints against best practices:
# Check single endpoint python3 scripts/health_check_validator.py https://api.example.com/health # Check multiple endpoints python3 scripts/health_check_validator.py \ https://api.example.com/health \ https://api.example.com/readiness \ --verbose
Checks for:
- •✓ Returns 200 status code
- •✓ Response time < 1 second
- •✓ Returns JSON format
- •✓ Contains 'status' field
- •✓ Includes version/build info
- •✓ Checks dependencies
- •✓ Disables caching
→ Script: scripts/health_check_validator.py
Common Troubleshooting Workflows
High Latency Investigation:
- •Check dashboards for latency spike
- •Query traces for slow operations
- •Check database slow query log
- •Check external API response times
- •Review recent deployments
- •Check resource utilization (CPU, memory)
High Error Rate Investigation:
- •Check error logs for patterns
- •Identify affected endpoints
- •Check dependency health
- •Review recent deployments
- •Check resource limits
- •Verify configuration
Service Down Investigation:
- •Check if pods/instances are running
- •Check health check endpoint
- •Review recent deployments
- •Check resource availability
- •Check network connectivity
- •Review logs for startup errors
Quick Reference Commands
Prometheus Queries
# Request rate
sum(rate(http_requests_total[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Kubernetes Commands
# Check pod status kubectl get pods -n <namespace> # View pod logs kubectl logs -f <pod-name> -n <namespace> # Check pod resources kubectl top pods -n <namespace> # Describe pod for events kubectl describe pod <pod-name> -n <namespace> # Check recent deployments kubectl rollout history deployment/<name> -n <namespace>
Log Queries
Elasticsearch:
GET /logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
}
}
Loki (LogQL):
{job="app", level="error"} |= "error" | json
CloudWatch Insights:
fields @timestamp, level, message | filter level = "error" | filter @timestamp > ago(1h)
Resources Summary
Scripts (automation and analysis)
- •
analyze_metrics.py- Detect anomalies in Prometheus/CloudWatch metrics - •
alert_quality_checker.py- Audit alert rules against best practices - •
slo_calculator.py- Calculate SLO compliance and error budgets - •
log_analyzer.py- Parse logs for errors and patterns - •
dashboard_generator.py- Generate Grafana dashboards from templates - •
health_check_validator.py- Validate health check endpoints - •
datadog_cost_analyzer.py- Analyze Datadog usage and find cost waste
References (deep-dive documentation)
- •
metrics_design.md- Four Golden Signals, RED/USE methods, metric types - •
alerting_best_practices.md- Alert design, runbooks, on-call practices - •
logging_guide.md- Structured logging, aggregation patterns - •
tracing_guide.md- OpenTelemetry, distributed tracing - •
slo_sla_guide.md- SLI/SLO/SLA definitions, error budgets - •
tool_comparison.md- Comprehensive comparison of monitoring tools - •
datadog_migration.md- Complete guide for migrating from Datadog to OSS stack - •
dql_promql_translation.md- Datadog Query Language to PromQL translation reference
Templates (ready-to-use configurations)
- •
prometheus-alerts/webapp-alerts.yml- Production-ready web app alerts - •
prometheus-alerts/kubernetes-alerts.yml- Kubernetes monitoring alerts - •
otel-config/collector-config.yaml- OpenTelemetry Collector configuration - •
runbooks/incident-runbook-template.md- Incident response template
Best Practices
Metrics
- •Start with Four Golden Signals
- •Use appropriate metric types (counter, gauge, histogram)
- •Keep cardinality low (avoid high-cardinality labels)
- •Follow naming conventions
Logging
- •Use structured logging (JSON)
- •Include request IDs for tracing
- •Set appropriate log levels
- •Redact PII before logging
Alerting
- •Make every alert actionable
- •Alert on symptoms, not causes
- •Use multi-window burn rate alerts
- •Include runbook links
Tracing
- •Sample appropriately (1-10% in production)
- •Always record errors
- •Use semantic conventions
- •Propagate context between services
SLOs
- •Start with current performance
- •Set realistic targets
- •Define error budget policies
- •Review and adjust quarterly