Fixing Bugs Systematically

Structured protocol for isolating root causes and implementing focused fixes in existing features.

When to Use

•Something is broken and needs diagnosis and repair
•Error messages or unexpected behavior occurs
•Performance degradation in existing functionality
•Intermittent or hard-to-reproduce issues

Core Steps

1. Context & Reproduction

Read relevant documentation:

•docs/feature-spec/F-##-*.md for affected feature
•docs/user-stories/US-###-*.md for expected behavior and acceptance criteria
•docs/api-contracts.yaml if API-related
•docs/system-design.md for architecture context

Document the bug:

•Expected behavior (cite story AC or spec)
•Actual behavior (what's broken)
•Reproduction steps
•Feature ID (F-##) and Story ID (US-###) if known

2. Investigation

Simple bugs (obvious entry point)

Use direct investigation:

•Grep to locate error messages or related code
•Read suspected files to examine implementation
•Trace function calls and data transformations
•Check related files for connected logic

Complex bugs (multiple subsystems or unclear origin)

Delegate to async agents in parallel:

Spawn senior-engineer agents to:

•Trace error flow through specific subsystem
•Analyze related failure patterns
•Investigate runtime conditions

Spawn Explore agents to:

•Map data flow across multiple files
•Find all error handling for specific operation
•Locate configuration and integration points

Example: For authentication bug, spawn:

•Agent 1: "Trace auth flow from login endpoint to session creation"
•Agent 2: "Find all error handling and validation in auth module"
•Agent 3: "Locate session storage config and related code"

Wait for results using ./agent-responses/await {agent_id}

3. Root Cause Analysis

Generate hypotheses:

•List 3-8 potential root causes from investigation
•Rank by probability (evidence from code) and impact
•Select most likely cause(s)

Decision point:

•Fix immediately if root cause is obvious and confirmed
•Add validation if multiple plausible causes or runtime-dependent behavior

4. Validation (if needed)

Add minimal debugging:

•Logging at decision points
•Data inspection at boundaries
•Input/output logging at integration points

Test to confirm root cause before proceeding to fix.

5. Implementation

Fix the confirmed root cause:

•Keep changes minimal and focused
•Maintain API stability unless approved
•Follow existing patterns in codebase

Update documentation if needed:

•Add note in feature spec or changelog
•Update docs/api-contracts.yaml if contract changed (requires approval)
•
For slash commands:
- •/manage-project/update/update-feature to correct spec
- •/manage-project/update/update-story if ACs were ambiguous
- •/manage-project/update/update-api if API changed (with approval)

6. Validation & Testing

Verify fix against acceptance criteria:

•Test all ACs from affected user stories
•Check 1-2 key edge cases and error states
•Run contract tests if API changed
•Verify events in docs/data-plan.md still fire correctly

7. Cleanup

•Remove all debugging and logging code
•Verify no temporary files remain

Investigation Strategy

For direct investigation:

•Use grep, read_file to understand subsystem
•Trace flows manually through related files
•Focus on specific area where bug manifests

When to validate before fixing:

•Multiple plausible root causes exist
•Runtime-dependent behavior
•Intermittent or hard-to-reproduce issues

For async investigation:

•Each agent investigates independent subsystem
•Run in parallel for speed
•Maximum 6 agents (diminishing returns)

Artifacts

Inputs:

•docs/feature-spec/F-##-*.md — Feature specs
•docs/user-stories/US-###-*.md — Expected behavior and ACs
•docs/api-contracts.yaml — API specs
•docs/system-design.md — Architecture context

Outputs:

•Investigation findings (inline notes or agent reports)
•Updated feature spec with bug resolution notes
•Fixed code with accompanying tests

Quick Reference

Scenario	Approach
Single subsystem, obvious entry	Direct investigation → immediate fix
Multiple subsystems, unclear origin	Spawn 2-4 agents in parallel → synthesize findings → fix
Runtime-dependent or intermittent	Add targeted logging → reproduce → analyze logs → fix
Multiple independent fixes needed	Pass investigation results to fix agents via artifact files