Log Analyzer
Goal: Produce a concise RCA report with summary, timeline, root cause, and suggested fixes using logs from Loki, Docker, or filesystem.
Description
Parses logs for error patterns, root causes, and correlations across services. Suggests fixes based on common patterns and runbooks.
Usage
- •"Why is [service] failing?"
- •"Analyze logs for [error]"
- •"What happened at [timestamp]?"
- •"Debug [service] errors in the last hour"
When run by pipeline, also expose the structured fields below in your reply (e.g. in a final JSON block).
Pipeline Contract (rca-debug.lobster)
Inputs (from pipeline or user; bin wrapper passes --service, --timeframe, --error):
- •
service(string): Service or container name to analyze. - •
timeframe(string, optional): e.g. "1h", "24h", "7d". Default "1h". - •
error_pattern(string, optional): Substring or regex to filter (e.g. "500"); passed as--errorby wrapper. - •
log_level(string, optional): "ERROR" or "WARN". Default "ERROR".
Output: Return the full markdown report (see Output Format below). When invoked by pipeline or with --json, also output a single JSON object with: error_count, top_error, top_error_type, first_error_at, timeline, root_cause, related_services. This allows a future wrapper or pipeline to parse the result.
Implementation
Execute in order. Do not run the Lobster tool or any pipeline.
Step 1: Locate and Query Logs
- •Normalize timeframe before querying: "1h" → 1 hour ago, "24h" → 24 hours, "7d" → 7 days. Use this duration for all log sources.
- •Prefer Loki MCP first: If Loki MCP is available, use the Loki tool (e.g.
mcp_loki_*or the configured Loki MCP tool) with a query like{job=~".*"} |= "<service>"orlevel=errorand the time range derived from timeframe. Parse log lines and timestamps from the response. - •Fallback order if Loki is not available or returns nothing:
- •Docker: Run
docker psto find container matchingservice; thendocker logs --since <timeframe> <container>(e.g.--since 1h). Capture stderr and stdout. - •journalctl: Run
journalctl -u <service> --since "<timeframe>" -p err(or -p warning). Capture output. - •Workspace log files: Look under workspace (e.g.
logs/,var/log/) or paths from user;grep -E "error|Error|ERROR"with timeframe from file mtime or log content.
- •Docker: Run
Step 2: Filter and Parse
- •Filter by
log_level(ERROR vs WARN). Iferror_patterngiven, keep only lines matching it. - •Extract timestamps from each line (ISO or common log formats). Sort by time. Count occurrences per error type (normalize: same message = same type).
Step 3: Pattern Matching and Timeline
- •For each distinct error message, classify: connection timeout, OOM, 404, 500, DB error, auth failure, etc. Set
top_errorandtop_error_typefrom the most frequent. - •Build timeline: first occurrence of each type, then by time order. Set
first_error_atto the earliest timestamp. Expose astimeline(list or markdown). - •Infer
related_servicesfrom log content (e.g. "calling auth-service", "downstream api_gateway") or leave[]if unknown.
Step 4: Root Cause and Fixes
- •Set
root_cause: one paragraph summarizing the most likely cause (e.g. "Database connection pool exhaustion under load"). - •Append Suggested Fixes using the Common Error Patterns below. Optionally suggest a runbook path (e.g.
/docs/runbooks/<topic>.md). - •When invoked by pipeline or with
--json: after the markdown report, output a clearly delimited JSON block (e.g.json ...) with keys:error_count,top_error,top_error_type,first_error_at,timeline,root_cause,related_services.
Output Format
Log Analysis: auth-service (last 1 hour) Summary: - Total Errors: 247 - Top Error: "Database connection timeout" (189 occurrences, 76%) - First Occurrence: 2026-02-07 14:23:45 UTC Timeline: 14:23:45 - First "connection timeout" error 14:24:12 - Database pool exhausted (related) 14:24:30 - Cascading failures to api_gateway (503 errors) 14:25:00 - Load balancer health check failing Error Breakdown: 1. Database connection timeout (189x) - "psycopg2.OperationalError: could not connect to server" - Pattern: Spike in concurrent requests 2. API Gateway 503 (42x) - Downstream dependency (auth_service) unavailable 3. Health check failures (16x) - Service unresponsive during database issues Root Cause: Database Connection Pool Exhaustion - Database unable to handle spike in concurrent connections - Connection pool size (10) insufficient for traffic Suggested Fixes: 1. Immediate: Scale database OR reduce connection pool size temporarily 2. Short-term: Increase connection pool size to 50 3. Long-term: Implement connection pooling with PgBouncer 4. Monitor: Set up alerts for connection pool saturation Runbook: /docs/runbooks/database-connection-timeout.md
Common Error Patterns
- •Connection timeout → Database overload or network issue
- •Out of Memory → Memory leak or insufficient resources
- •404 errors → Routing issue or missing endpoint
- •500 errors → Unhandled exceptions in code
- •Cascading failures → Dependency chain breakage