Using Buggify

When to Use This Skill

Invoke this skill when you are:

•Adding fault injection to force edge cases and error paths
•Debugging why simulation isn't finding bugs in specific code paths
•Strategic placement of chaos to increase test coverage
•Tuning buggify probabilities for rare events
•Understanding how deterministic fault injection works

Related Skills

•designing-simulation-workloads: Create workloads that benefit from buggify chaos
•using-chaos-assertions: Track that buggify actually triggers expected paths
•validating-with-invariants: Ensure system maintains correctness under buggify chaos

Philosophy: Deterministic Fault Injection

Traditional chaos testing has a problem: random failures alone won't find rare bug combinations.

Buggify solves this by deterministically injecting faults at strategic locations:

•Each buggify!() call site is randomly activated once per simulation seed
•When activated, fires probabilistically (default 25%) on each execution
•Same seed = same activation + same firing sequence = reproducible bugs

The Architectural Trick

FoundationDB's key insight: Interface swapping

code

Production:
  INetwork → Net2 (real TCP sockets)

Simulation:
  INetwork → Sim2 (simulated network with buggify)

Same production code runs in simulation. Buggify calls inject chaos when using Sim2 providers, are no-ops in production.

Combinatorial Explosion Strategy

Buggify shrinks timeouts 600x and randomizes configuration:

•Production: 60s timeout
•Simulation with buggify: 0.1s timeout

This creates time pressure, exploring thousands of timing combinations impossible to test manually.

How Buggify Works

Deterministic Randomness

rust

// Each call site gets unique location ID (file:line)
buggify!()  // Location: peer.rs:123

// Simulation decides at start: "Activate peer.rs:123? Yes"
// Then fires ~25% of times when executed

Key property: Same seed → identical decisions → reproducible bugs

Macro API

rust

// Default 25% probability when active
buggify!()

// Custom probability when active (50%)
buggify_with_prob!(0.5)

// Higher probability for rare paths (75%)
buggify_with_prob!(0.75)

Basic Usage Pattern

rust

if buggify!() {
    // Inject failure
    return Err(NetworkError::ConnectionFailed);
}

// Normal path continues
let connection = establish_connection().await?;

Strategic Placement Guide

Place buggify calls where failures should be handled gracefully but rarely occur in normal testing.

1. Error Handling Paths

Force errors that are normally rare:

rust

async fn connect_to_peer(&self, addr: Address) -> Result<Connection> {
    if buggify!() {
        tracing::warn!("Buggify: Simulating connection failure");
        return Err(NetworkError::ConnectionFailed);
    }

    let conn = self.transport.connect(addr).await?;

    Ok(conn)
}

Why: Tests error handling code that might never execute otherwise.

2. Timeout Triggers

Shrink timeouts to create time pressure:

rust

async fn send_with_timeout(&self, msg: Message) -> Result<Response> {
    let timeout = if buggify!() {
        Duration::from_millis(1)   // Very short
    } else {
        Duration::from_secs(5)     // Normal
    };

    self.time.timeout(timeout, self.send(msg)).await?
}

Why: Exposes timeout handling bugs and race conditions.

3. State Transitions

Add delays during critical state changes:

rust

async fn activate_actor(&mut self, actor_id: ActorId) -> Result<()> {
    // Check if already activating
    if self.catalog.is_activating(&actor_id) {
        return Err(ActorError::AlreadyActivating);
    }

    self.catalog.mark_activating(&actor_id);

    // Race condition window
    if buggify!() {
        self.time.sleep(Duration::from_millis(50)).await;
    }

    // Create actor instance
    let actor = self.factory.create(actor_id.clone()).await?;
    self.catalog.mark_active(&actor_id, actor);

    Ok(())
}

Why: Widens race condition windows, making timing bugs reproducible.

4. Resource Limits

Force resource pressure:

rust

fn create_message_queue(&self) -> MessageQueue {
    let capacity = if buggify!() {
        2      // Very small - force overflow
    } else {
        1000   // Normal
    };

    MessageQueue::with_capacity(capacity)
}

Why: Tests queue overflow, backpressure, resource exhaustion handling.

5. Network Operations

Simulate send failures:

rust

async fn send_message(&mut self, msg: Message) -> Result<()> {
    if buggify!() {
        tracing::warn!("Buggify: Simulating send failure");
        return Err(TransportError::SendFailed);
    }

    self.transport.send(msg).await?;

    Ok(())
}

Why: Tests retry logic, message loss handling, failover.

6. Registration/Lookup Failures

Force directory/catalog failures:

rust

async fn register_actor(&mut self, actor_id: ActorId, node_id: NodeId) -> Result<()> {
    if buggify!() {
        tracing::warn!("Buggify: Simulating registration failure");
        return Err(DirectoryError::RegistrationFailed);
    }

    self.directory.register(actor_id, node_id).await?;

    Ok(())
}

Why: Tests fallback logic, retry mechanisms, inconsistency handling.

Actor System-Specific Patterns

MessageBus Routing

rust

async fn route_message(&mut self, target: ActorId, msg: Message) -> Result<()> {
    // Force lookup failure
    if buggify!() {
        tracing::warn!("Buggify: Directory lookup failure");
        return Err(RoutingError::ActorNotFound);
    }

    let node = self.directory.lookup(&target).await?;

    // Force wrong node routing
    if buggify!() {
        let random_node = self.select_random_node();
        tracing::warn!("Buggify: Routing to wrong node");
        self.send_to_node(random_node, target, msg).await?;
    } else {
        self.send_to_node(node, target, msg).await?;
    }

    Ok(())
}

ActorCatalog Activation

rust

async fn get_or_activate(&mut self, actor_id: ActorId) -> Result<ActorHandle> {
    if let Some(handle) = self.actors.get(&actor_id) {
        return Ok(handle.clone());
    }

    // Delay to widen activation race window
    if buggify!() {
        self.time.sleep(Duration::from_millis(100)).await;
    }

    // Check again (someone else might have activated)
    if let Some(handle) = self.actors.get(&actor_id) {
        return Ok(handle.clone());
    }

    // Activate
    let handle = self.activate_new(actor_id.clone()).await?;

    // Force activation failure
    if buggify!() {
        self.actors.remove(&actor_id);
        return Err(ActorError::ActivationFailed);
    }

    Ok(handle)
}

Connection Management

rust

async fn connect_with_backoff(&mut self, peer: PeerId) -> Result<Connection> {
    let mut backoff = Duration::from_millis(10);

    loop {
        match self.try_connect(peer).await {
            Ok(conn) => return Ok(conn),
            Err(e) => {
                // Shrink backoff to create rapid retries
                if buggify!() {
                    backoff = Duration::from_millis(1);
                }

                self.time.sleep(backoff).await;

                // Exponential backoff
                backoff = backoff * 2;

                // Force give up early
                if buggify!() {
                    return Err(NetworkError::ConnectionFailed);
                }
            }
        }
    }
}

Probability Tuning

Default 25% is good for common paths. Adjust for specific needs:

High Probability (50-75%) - Rare Events

Use higher probability when the code path rarely executes:

rust

// This path only executes when queue is nearly full
if queue.len() > queue.capacity() * 9 / 10 {
    // Force overflow more aggressively
    if buggify_with_prob!(0.75) {
        return Err(QueueError::Full);
    }
}

Low Probability (10-25%) - Common Paths

Use default or lower when path executes frequently:

rust

// This executes on every message send
async fn send(&mut self, msg: Message) -> Result<()> {
    // Don't slow down too much
    if buggify_with_prob!(0.1) {
        return Err(TransportError::SendFailed);
    }

    // ... normal send
}

Conditional Buggify

Combine with regular conditions:

rust

// Only inject chaos when under load
if self.queue.len() > 100 && buggify!() {
    return Err(OverloadError::BackpressureActive);
}

Tracking Buggify Coverage

Use assertions to verify buggify actually triggers:

rust

if buggify!() {
    sometimes_assert!(
        buggify_connection_failure,
        true,
        "Buggify triggered connection failure path"
    );
    return Err(NetworkError::ConnectionFailed);
}

Why: If buggify_connection_failure never succeeds, this buggify site never activated or fired.

Performance Considerations

Don't Overuse

Too many buggify calls slow simulation:

rust

// ❌ Bad: Buggify in tight loop
for i in 0..10000 {
    if buggify!() {  // Called 10k times per iteration!
        delay().await;
    }
    process_item(i);
}

// ✅ Good: Buggify before loop
if buggify!() {
    delay().await;
}
for i in 0..10000 {
    process_item(i);
}

Strategic Over Comprehensive

Focus on:

•Error paths that matter
•Race condition windows
•Resource boundaries

Skip:

•Trivial getters/setters
•Pure computation (no I/O)
•Already well-tested paths

Integration Checklist

When adding buggify to a module:

Common Patterns

Connection Failure + Retry

rust

async fn send_with_retry(&mut self, msg: Message) -> Result<()> {
    for attempt in 0..3 {
        if buggify!() {
            return Err(TransportError::SendFailed);
        }

        match self.transport.send(msg.clone()).await {
            Ok(_) => return Ok(()),
            Err(_) if attempt < 2 => {
                let delay = if buggify!() {
                    Duration::from_millis(1)
                } else {
                    Duration::from_millis(100 * (attempt + 1))
                };
                self.time.sleep(delay).await;
            }
            Err(e) => return Err(e),
        }
    }

    Err(TransportError::RetriesExhausted)
}

Resource Exhaustion + Backpressure

rust

async fn enqueue_message(&mut self, msg: Message) -> Result<()> {
    let capacity = if buggify!() {
        5  // Force small queue
    } else {
        1000
    };

    if self.queue.len() >= capacity {
        if buggify!() {
            // Reject immediately
            return Err(QueueError::Full);
        }

        // Wait for space
        self.wait_for_space().await?;
    }

    self.queue.push(msg);
    Ok(())
}

Timing-Dependent State

rust

async fn two_phase_operation(&mut self) -> Result<()> {
    // Phase 1: Prepare
    self.state = State::Preparing;

    if buggify!() {
        self.time.sleep(Duration::from_millis(50)).await;
    }

    self.prepare().await?;

    // Phase 2: Commit
    self.state = State::Committing;

    if buggify!() {
        self.time.sleep(Duration::from_millis(50)).await;
    }

    self.commit().await?;

    self.state = State::Committed;
    Ok(())
}

Debugging Buggify Issues

Issue: Buggify Not Triggering

Symptom: Assertions show certain buggify paths never execute

Solutions:

•Increase probability: buggify_with_prob!(0.75)
•Check if code path is reached at all (add logging)
•Run more iterations: UntilAllSometimesReached(10_000)

Issue: Too Much Chaos

Symptom: All tests fail, can't make progress

Solutions:

•Reduce probability temporarily
•Remove buggify from hot paths
•Check for missing error handling

Issue: Non-Deterministic Failures

Symptom: Same seed produces different results

Solutions:

•Verify using SimRandomProvider, not rand::random()
•Check for tokio::spawn() instead of task_provider.spawn_task()
•Ensure using time.sleep() not tokio::time::sleep()

Best Practices

•Document intent: Add comment explaining what chaos is being injected
•Use assertions: Track that buggify actually triggers
•Start conservative: Begin with low probability, increase if needed
•Target error paths: Focus on rarely-tested code
•Widen race windows: Add delays during state transitions
•Shrink resources: Force pressure on queues, connections, buffers
•Test incrementally: Add buggify gradually, verify each addition

Key Takeaways

•Buggify biases toward edge cases: Random chaos alone won't find rare combinations
•Deterministic: Same seed = same failures = reproducible bugs
•Strategic placement: Error paths, state transitions, resource limits
•Time pressure: Shrink timeouts 600x to explore timing combinations
•Track coverage: Use assertions to verify buggify triggers

The goal: Force the system through error paths that production will eventually encounter!

Additional Resources

See separate reference files:

•PLACEMENT-GUIDE.md: Detailed decision tree for buggify placement
•EXAMPLES.md: Annotated real-world examples from moonpool-foundation
•TROUBLESHOOTING.md: Common issues and solutions