Health Aggregation Skill
Aggregate system health from discovered observability sources with deployment correlation.
When to Use
- •Overall system health check
- •Pre/post deployment validation
- •SLO/SLA compliance monitoring
- •Capacity assessment
- •Proactive issue detection
- •User asks "health", "status", "how is production"
Health Assessment Approach
1. Discover Observability Sources
Use capability-based discovery to find available integrations:
# List all MCP servers ListMcpResourcesTool # Scan for capabilities by analyzing server descriptions and resources: # - error-tracking capability: Exception/error tracking and reporting # - apm capability: Application performance monitoring and metrics # - logging capability: Log aggregation, search, and analysis # - tracing capability: Distributed tracing and service mapping # - telemetry capability: Metrics collection and custom instrumentation
2. Aggregate Health Metrics
For each discovered source, collect:
- •Error rates: Current vs baseline
- •Performance: Latency (p50, p95, p99)
- •Throughput: Requests per second
- •SLO status: Availability, latency, error budgets
- •Alerts: Active alerts and severity
3. Calculate Health Score
HEALTHY: All metrics within SLO, no active alerts, stable trends DEGRADED: Some metrics elevated, minor alerts, or negative trends CRITICAL: SLO violations, critical alerts, or severe degradation
4. Correlate with Changes
Check for recent changes that might impact health:
- •Deployments (via CI/CD integrations or git history)
- •Code changes (via wicked-search)
- •Infrastructure changes
- •Traffic patterns
5. Provide Actionable Recommendations
Based on health status:
- •Immediate actions for critical issues
- •Investigation paths for degradations
- •Optimization opportunities
- •Capacity planning insights
Integration Discovery
This skill discovers integrations at runtime based on capability:
| Capability | What to Look For | Provides |
|---|---|---|
| error-tracking | Exception tracking, error reporting, crash analytics | Error rates, stack traces, user impact |
| apm | Performance monitoring, service metrics, observability | Latency, throughput, service health |
| logging | Log aggregation, log search, log analysis | Log aggregation, search, patterns |
| tracing | Distributed tracing, request tracing, trace analysis | Distributed traces, dependencies |
| telemetry | Metrics collection, custom instrumentation, time-series data | Custom metrics, instrumentation |
Fallback: If no integrations found, perform local analysis via wicked-search for error patterns in code.
See refs/sources.md for detailed capability discovery patterns.
Output Format
## System Health Report
**Overall Status**: [HEALTHY | DEGRADED | CRITICAL]
**Assessment Time**: {timestamp}
**Data Sources**: {list of integrations used}
### Health Summary
| Service | Status | Error Rate | Latency (p95) | SLO Status |
|---------|--------|------------|---------------|------------|
| {service} | {status} | {rate} | {latency} | {✓ or ✗} |
### Issues Detected
[For each issue]
**{Service}: {Issue Description}**
- Severity: [CRITICAL | HIGH | MEDIUM | LOW]
- Started: {timestamp}
- Metric: {specific metric and values}
- Pattern: {error pattern or behavior}
- Correlation: {deployment or change if found}
- Blast Radius: {impact scope}
### Trends (24h)
- Error Rates: {trend with percentage}
- Latency: {trend with percentage}
- Traffic: {trend with percentage}
### Recommendations
**Immediate**:
{critical actions needed now}
**Short-term**:
{optimizations and improvements}
**Capacity**:
{capacity planning insights}
Common Health Patterns
Post-Deployment Degradation
Error rates or latency increase after deployment. Correlate metrics with deployment time and consider rollback.
Gradual Performance Decline
Metrics slowly degrading over hours/days. Investigate memory leaks, growing data, cache efficiency.
Traffic-Related Issues
Performance degrades with traffic spikes. Check capacity utilization and scaling policies.
Cascading Failures
Single service failure causes downstream issues. Use traces to identify root cause and implement circuit breakers.
Integration with wicked-crew
When crew enters build phase:
- •Capture baseline health metrics
- •Monitor during deployment
- •Validate post-deployment health
- •Alert if degradation detected
Emit events:
- •
observe:health:checked:success - •
observe:health:degraded:warning - •
observe:health:critical:failure
Integration with wicked-engineering
When debugging issues, provide observability context:
- •Recent errors from discovered sources
- •Performance metrics around the issue
- •Correlation with code changes
- •Trace data for request flow
Notes
- •Prioritize data from real observability tools over static analysis
- •Always correlate health changes with deployments or code changes
- •Consider traffic patterns when assessing metrics
- •Document baseline metrics for comparison
- •Alert on SLO burn rate, not just absolute values