AgentSkillsCN

Log Analyzer

日志分析器

SKILL.md

Log Analyzer

Goal: Produce a concise RCA report with summary, timeline, root cause, and suggested fixes using logs from Loki, Docker, or filesystem.

Description

Parses logs for error patterns, root causes, and correlations across services. Suggests fixes based on common patterns and runbooks.

Usage

  • "Why is [service] failing?"
  • "Analyze logs for [error]"
  • "What happened at [timestamp]?"
  • "Debug [service] errors in the last hour"

When run by pipeline, also expose the structured fields below in your reply (e.g. in a final JSON block).

Pipeline Contract (rca-debug.lobster)

Inputs (from pipeline or user; bin wrapper passes --service, --timeframe, --error):

  • service (string): Service or container name to analyze.
  • timeframe (string, optional): e.g. "1h", "24h", "7d". Default "1h".
  • error_pattern (string, optional): Substring or regex to filter (e.g. "500"); passed as --error by wrapper.
  • log_level (string, optional): "ERROR" or "WARN". Default "ERROR".

Output: Return the full markdown report (see Output Format below). When invoked by pipeline or with --json, also output a single JSON object with: error_count, top_error, top_error_type, first_error_at, timeline, root_cause, related_services. This allows a future wrapper or pipeline to parse the result.


Implementation

Execute in order. Do not run the Lobster tool or any pipeline.

Step 1: Locate and Query Logs

  • Normalize timeframe before querying: "1h" → 1 hour ago, "24h" → 24 hours, "7d" → 7 days. Use this duration for all log sources.
  • Prefer Loki MCP first: If Loki MCP is available, use the Loki tool (e.g. mcp_loki_* or the configured Loki MCP tool) with a query like {job=~".*"} |= "<service>" or level=error and the time range derived from timeframe. Parse log lines and timestamps from the response.
  • Fallback order if Loki is not available or returns nothing:
    1. Docker: Run docker ps to find container matching service; then docker logs --since <timeframe> <container> (e.g. --since 1h). Capture stderr and stdout.
    2. journalctl: Run journalctl -u <service> --since "<timeframe>" -p err (or -p warning). Capture output.
    3. Workspace log files: Look under workspace (e.g. logs/, var/log/) or paths from user; grep -E "error|Error|ERROR" with timeframe from file mtime or log content.

Step 2: Filter and Parse

  • Filter by log_level (ERROR vs WARN). If error_pattern given, keep only lines matching it.
  • Extract timestamps from each line (ISO or common log formats). Sort by time. Count occurrences per error type (normalize: same message = same type).

Step 3: Pattern Matching and Timeline

  • For each distinct error message, classify: connection timeout, OOM, 404, 500, DB error, auth failure, etc. Set top_error and top_error_type from the most frequent.
  • Build timeline: first occurrence of each type, then by time order. Set first_error_at to the earliest timestamp. Expose as timeline (list or markdown).
  • Infer related_services from log content (e.g. "calling auth-service", "downstream api_gateway") or leave [] if unknown.

Step 4: Root Cause and Fixes

  • Set root_cause: one paragraph summarizing the most likely cause (e.g. "Database connection pool exhaustion under load").
  • Append Suggested Fixes using the Common Error Patterns below. Optionally suggest a runbook path (e.g. /docs/runbooks/<topic>.md).
  • When invoked by pipeline or with --json: after the markdown report, output a clearly delimited JSON block (e.g. json ... ) with keys: error_count, top_error, top_error_type, first_error_at, timeline, root_cause, related_services.

Output Format

code
Log Analysis: auth-service (last 1 hour)

Summary:
- Total Errors: 247
- Top Error: "Database connection timeout" (189 occurrences, 76%)
- First Occurrence: 2026-02-07 14:23:45 UTC

Timeline:
14:23:45 - First "connection timeout" error
14:24:12 - Database pool exhausted (related)
14:24:30 - Cascading failures to api_gateway (503 errors)
14:25:00 - Load balancer health check failing

Error Breakdown:
1. Database connection timeout (189x)
   - "psycopg2.OperationalError: could not connect to server"
   - Pattern: Spike in concurrent requests
2. API Gateway 503 (42x)
   - Downstream dependency (auth_service) unavailable
3. Health check failures (16x)
   - Service unresponsive during database issues

Root Cause: Database Connection Pool Exhaustion
- Database unable to handle spike in concurrent connections
- Connection pool size (10) insufficient for traffic

Suggested Fixes:
1. Immediate: Scale database OR reduce connection pool size temporarily
2. Short-term: Increase connection pool size to 50
3. Long-term: Implement connection pooling with PgBouncer
4. Monitor: Set up alerts for connection pool saturation

Runbook: /docs/runbooks/database-connection-timeout.md

Common Error Patterns

  • Connection timeout → Database overload or network issue
  • Out of Memory → Memory leak or insufficient resources
  • 404 errors → Routing issue or missing endpoint
  • 500 errors → Unhandled exceptions in code
  • Cascading failures → Dependency chain breakage