Feature Flags and A/B Testing

When to Use This Skill

Use when implementing feature toggles, gradual rollouts, or controlled experiments. Applies to provider selection, flag lifecycle management, and experiment design with statistical rigor.

Provider Selection

Criteria	LaunchDarkly	Unleash	Flagsmith	Custom (DB/Config)
Cost at 10K MAU	~$833/mo	Free (self-host)	Free tier	Infrastructure only
Cost at 1M MAU	~$3,333/mo	Free (self-host)	~$45/mo	Infrastructure only
Setup time	Hours	Half-day	Hours	Days-weeks
Targeting complexity	Advanced	Moderate	Moderate	Build it yourself
Edge evaluation	Yes	Via proxy	Via proxy	No
Audit trail	Built-in	Built-in	Built-in	Build it yourself
Best for	Funded startup, complex rules	Self-host preference, privacy	Budget-conscious teams	<5 flags, full control

Recommendation

Start with Flagsmith or Unleash if budget-constrained. Move to LaunchDarkly when targeting rules get complex or you need sub-50ms edge evaluation. Custom only if you have fewer than 5 flags and no rollout needs.

Flag SDK Integration

LaunchDarkly (TypeScript)

typescript

import * as ld from "launchdarkly-node-server-sdk";

const client = ld.init(process.env.LD_SDK_KEY\!);
await client.waitForInitialization();

async function getFlag<T>(key: string, user: ld.LDUser, fallback: T): Promise<T> {
  return client.variation(key, user, fallback) as Promise<T>;
}

// Usage with context
const user: ld.LDUser = {
  key: userId,
  email: userEmail,
  custom: { plan: "pro", signupDate: "2025-01-15" },
};

const showNewCheckout = await getFlag("new-checkout-flow", user, false);

Unleash (Self-Hosted)

typescript

import { initialize, isEnabled } from "unleash-client";

const unleash = initialize({
  url: "https://unleash.internal.company.com/api",
  appName: "web-app",
  customHeaders: { Authorization: process.env.UNLEASH_API_KEY\! },
});

// Percentage rollout with stickiness
function isFeatureEnabled(flag: string, userId: string): boolean {
  return isEnabled(flag, { userId, sessionId: userId });
}

Custom Flag System (Minimal)

typescript

interface FeatureFlag {
  key: string;
  enabled: boolean;
  rolloutPercent: number; // 0-100
  allowList: string[];    // user IDs with forced access
}

function evaluateFlag(flag: FeatureFlag, userId: string): boolean {
  if (\!flag.enabled) return false;
  if (flag.allowList.includes(userId)) return true;
  // Deterministic hash for consistent assignment
  const hash = murmurHash3(flag.key + userId) % 100;
  return hash < flag.rolloutPercent;
}

// Murmur3 gives uniform distribution; don't use Math.random()
function murmurHash3(input: string): number {
  let h = 0x811c9dc5;
  for (let i = 0; i < input.length; i++) {
    h ^= input.charCodeAt(i);
    h = Math.imul(h, 0x01000193);
  }
  return Math.abs(h);
}

Gradual Rollout Strategy

Phased Rollout Pattern

code

Phase 1: Internal team (allowList)          -> 0% rollout, team IDs only
Phase 2: Beta cohort                        -> 0% rollout, beta user IDs
Phase 3: 5% canary                          -> 5% rollout, monitor errors
Phase 4: 25% early majority                 -> 25% rollout, monitor metrics
Phase 5: 50% broad rollout                  -> 50% rollout, A/B comparison
Phase 6: 100% general availability          -> 100% rollout
Phase 7: Remove flag, delete dead code path -> Flag archived

Kill Switch Pattern

typescript

// Wrap risky features with circuit-breaker flags
async function processPayment(order: Order) {
  const useNewProcessor = await getFlag("new-payment-processor", order.user, false);

  if (useNewProcessor) {
    try {
      return await newPaymentProcessor.charge(order);
    } catch (error) {
      // Auto-disable on repeated failures
      await reportFlagIncident("new-payment-processor", error);
      return await legacyPaymentProcessor.charge(order);
    }
  }
  return await legacyPaymentProcessor.charge(order);
}

A/B Test Design

Assignment and Bucketing

typescript

interface Experiment {
  key: string;
  variants: { id: string; weight: number }[];
  salt: string; // Unique per experiment to decorrelate
}

function assignVariant(experiment: Experiment, userId: string): string {
  const hash = murmurHash3(experiment.salt + userId) % 10000;
  let cumulative = 0;
  for (const variant of experiment.variants) {
    cumulative += variant.weight * 100; // weights are 0-100
    if (hash < cumulative) return variant.id;
  }
  return experiment.variants[0].id; // fallback to control
}

// Usage
const checkoutExperiment: Experiment = {
  key: "checkout-redesign-q1",
  variants: [
    { id: "control", weight: 50 },
    { id: "single-page", weight: 50 },
  ],
  salt: "checkout-redesign-q1-v1",
};

const variant = assignVariant(checkoutExperiment, userId);
analytics.track("experiment_assigned", {
  experiment: checkoutExperiment.key,
  variant,
  userId,
});

Statistical Significance Calculation

python

import numpy as np
from scipy import stats

def check_significance(
    control_conversions: int,
    control_total: int,
    treatment_conversions: int,
    treatment_total: int,
    alpha: float = 0.05,
) -> dict:
    """Two-proportion z-test for A/B experiment."""
    p_control = control_conversions / control_total
    p_treatment = treatment_conversions / treatment_total
    p_pooled = (control_conversions + treatment_conversions) / (control_total + treatment_total)

    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/control_total + 1/treatment_total))
    z_stat = (p_treatment - p_control) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

    return {
        "control_rate": round(p_control, 4),
        "treatment_rate": round(p_treatment, 4),
        "lift": round((p_treatment - p_control) / p_control * 100, 2),
        "p_value": round(p_value, 4),
        "significant": p_value < alpha,
        "sample_size_adequate": min(control_total, treatment_total) >= 1000,
    }

# Example: 3.2% vs 3.8% conversion
result = check_significance(320, 10000, 380, 10000)
# {"control_rate": 0.032, "treatment_rate": 0.038, "lift": 18.75,
#  "p_value": 0.0234, "significant": True, "sample_size_adequate": True}

Sample Size Calculator

python

def required_sample_size(
    baseline_rate: float,
    minimum_detectable_effect: float,  # relative, e.g., 0.05 = 5% lift
    alpha: float = 0.05,
    power: float = 0.8,
) -> int:
    """Per-variant sample size needed."""
    p1 = baseline_rate
    p2 = baseline_rate * (1 + minimum_detectable_effect)
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    n = ((z_alpha * np.sqrt(2 * p1 * (1-p1)) +
          z_beta * np.sqrt(p1*(1-p1) + p2*(1-p2))) / (p2 - p1)) ** 2
    return int(np.ceil(n))

# 3% baseline, want to detect 10% relative lift -> need ~35K per variant
required_sample_size(0.03, 0.10)  # 34,742

Gotchas and Anti-Patterns

Flag Debt Accumulation

•Problem: Flags stay in code long after rollout completes. Codebases accumulate hundreds of stale flags.
•Fix: Set expiration dates at creation. Add lint rules that flag (pun intended) flags older than 90 days. Track flag lifecycle in a spreadsheet or flag provider dashboard. Schedule monthly flag cleanup sprints.

Sample Ratio Mismatch (SRM)

•Problem: 50/50 split shows 51.2/48.8 actual distribution. Assignment logic has a bug, or redirects/bot filtering disproportionately affect one variant.
•Fix: Run SRM checks daily using chi-squared test. If p < 0.001, halt the experiment and investigate. Never trust results from an experiment with SRM.

The Peeking Problem

•Problem: Checking results daily and stopping when p < 0.05. This inflates false positive rate to 20-30%.
•Fix: Pre-register experiment duration based on sample size calculation. Use sequential testing (e.g., always-valid p-values) if you must monitor continuously. Never call a winner early without sequential correction.