Workflow Debugger

When a Gorgon workflow fails at step 4 of 7, this skill figures out why and what to do about it. Think of it as the technical-debt-auditor for workflow runs instead of repositories.

When to Activate

•"Why did this workflow fail?"
•"The builder agent produced garbage output"
•"This workflow ran out of budget at step 3"
•"The pipeline hung and never completed"
•"Debug the last run"
•Post-mortem analysis after any workflow execution

Failure Taxonomy

Workflows fail for a finite set of reasons. Knowing which category you're in determines the fix.

Category	Symptoms	Root Cause	Fix
Contract Violation	Agent output doesn't match expected schema	Prompt ambiguity, missing output spec	Tighten agent instructions, add validation
Budget Exhaustion	Agent hits token limit mid-response	Task too large for budget, or agent is rambling	Increase budget or decompose task
Timeout	Agent doesn't complete in allotted time	Task too complex, or infinite loop in tool use	Increase timeout or simplify task
Dependency Failure	Upstream agent output missing or malformed	Previous agent failed silently	Add output validation between stages
Context Overflow	Agent loses track of instructions in long context	Too much injected context, or conversation too long	Compress context, split workflow
Hallucination	Agent fabricates files, APIs, or capabilities	Insufficient grounding in context map	Better context mapping, add verification
Checkpoint Corruption	Resume from checkpoint produces different results	State not fully captured at checkpoint	Review checkpoint serialization
External Failure	API rate limit, Docker timeout, network error	Infrastructure, not workflow logic	Retry with backoff, or fix infrastructure

Diagnostic Procedure

Step 1: Locate the Failure Point

bash

# Read Gorgon checkpoint database
sqlite3 .gorgon/checkpoints.db "
  SELECT agent_role, status, started_at, completed_at, error_message
  FROM checkpoints
  WHERE workflow_run_id = '{run_id}'
  ORDER BY started_at
"

Expected output:

code

scanner     | completed | 2026-02-12 10:00:01 | 2026-02-12 10:00:45 | NULL
executor    | completed | 2026-02-12 10:00:46 | 2026-02-12 10:02:12 | NULL
analyzer    | failed    | 2026-02-12 10:02:13 | 2026-02-12 10:02:58 | "KeyError: 'execution_results'"
reporter    | skipped   | NULL                 | NULL                 | "dependency failed"

→ Failure at analyzer, caused by missing key in executor output.

Step 2: Examine Agent Outputs

bash

# Check the output that broke things
cat .gorgon/runs/{run_id}/executor/execution-results.json | python3 -m json.tool

# Compare against expected schema
# Does execution-results.json have the keys the analyzer expects?

Step 3: Check Budget Consumption

bash

sqlite3 .gorgon/checkpoints.db "
  SELECT agent_role, tokens_used, token_budget, 
         ROUND(tokens_used * 100.0 / token_budget, 1) as pct_used
  FROM budget_log
  WHERE workflow_run_id = '{run_id}'
"

code

scanner     | 823  | 1500 | 54.9%
executor    | 412  | 500  | 82.4%   ← Running hot
analyzer    | 1987 | 2000 | 99.4%   ← Budget exhaustion likely

Step 4: Read Agent Logs

bash

# Structured JSON logs per agent
cat .gorgon/runs/{run_id}/analyzer/agent.log | \
  python3 -c "import sys,json; [print(json.dumps(json.loads(l), indent=2)) for l in sys.stdin]" | \
  head -100

Look for:

•Repeated tool calls (looping)
•"I don't have enough context" messages
•Truncated outputs (hit token limit)
•Unexpected tool errors

Step 5: Classify and Report

Produce a diagnostic report:

code

WORKFLOW DEBUG REPORT
═════════════════════

Workflow:  technical_debt_audit
Run ID:    run_2026-02-12_001
Status:    FAILED at analyzer (step 3 of 5)
Duration:  2m 57s (of 10m budget)

ROOT CAUSE: Contract Violation
  The executor agent produced execution-results.json without the
  'tests' key because Docker was not available on the host. The
  executor's on_failure:continue policy meant it returned a partial
  result, but the analyzer expected a complete schema.

EVIDENCE:
  1. executor output missing 'tests' key (expected by analyzer)
  2. executor log shows: "Docker not found, skipping runtime checks"
  3. analyzer crashes at: analysis.py line 42, KeyError('tests')

FIX OPTIONS:
  1. [Quick] Make analyzer handle missing executor fields gracefully
     Effort: 15 min | Prevents: this exact failure
  2. [Proper] Add output schema validation between stages
     Effort: 1 hour | Prevents: all contract violations
  3. [Infrastructure] Install Docker on host
     Effort: 5 min | Prevents: executor partial results

BUDGET ANALYSIS:
  Total spent: 3,222 / 5,500 tokens (58.6%)
  Waste: ~1,987 tokens on analyzer that crashed
  If fixed: run would cost ~4,500 tokens

RECOMMENDATION: Fix #2 (schema validation) — it's a systemic fix
  that prevents an entire category of failures.

Post-Mortem Mode

For completed (successful) workflows, analyze efficiency:

code

WORKFLOW POST-MORTEM
════════════════════

Workflow:  document_analysis
Status:    COMPLETED (all 5 stages)
Duration:  4m 12s
Budget:    4,800 / 6,000 tokens (80%)

STAGE BREAKDOWN:
  context_mapper  | 0:32 |  800 tokens | ✅ Clean
  scanner         | 1:05 | 1,200 tokens | ⚠️ Scanned 3 languages, only Python present
  executor        | 1:45 |   400 tokens | ✅ Clean
  analyzer        | 0:35 | 1,800 tokens | ⚠️ 60% of budget on scoring justifications
  reporter        | 0:15 |   600 tokens | ✅ Clean

OPTIMIZATION OPPORTUNITIES:
  1. Scanner: Skip language detection for non-present languages → save ~300 tokens
  2. Analyzer: Shorten justifications (not user-facing) → save ~600 tokens
  3. Context mapper cache hit possible for repeated runs → save 800 tokens

POTENTIAL SAVINGS: ~1,700 tokens (35% reduction)

Gorgon Integration

The workflow debugger itself can be a Gorgon agent:

yaml

# Add to any workflow as an error handler
workflow:
  error_handler:
    role: workflow_debugger
    agent_ref: skills/workflow-debugger/SKILL.md
    trigger: "any agent fails"
    inputs:
      run_id: "{{ workflow.run_id }}"
      checkpoint_db: "{{ workflow.checkpoint_path }}"
      agent_logs: "{{ workflow.log_path }}"
    output: debug-report.md

Constraints

•Read-only — never modifies workflow state, checkpoints, or outputs
•Non-blocking — debugger runs after failure, doesn't interfere with retry logic
•Evidence-based — every diagnosis must reference specific log lines or data
•Actionable — every report includes concrete fix options with effort estimates
•No guessing — if root cause is uncertain, say so and list possibilities ranked by likelihood