Numerical Validation for Scientific Code
Overview
Verify that changes to mathematical/algorithmic code maintain numerical accuracy and mathematical properties.
Core principle: Capture baseline, make change, compare numerically, verify invariants, provide full analysis.
Announce at start: "I'm using the numerical-validation skill to verify mathematical correctness."
When to Use This Skill
MUST use when modifying:
- •
src/non_local_detector/core.py(HMM algorithms) - •
src/non_local_detector/likelihoods/(likelihood models) - •
src/non_local_detector/continuous_state_transitions.py - •
src/non_local_detector/discrete_state_transitions.py - •
src/non_local_detector/initial_conditions.py - •Any code involving JAX transformations or numerical computations
Also use when:
- •Refactoring mathematical code (tolerance: 1e-14)
- •Optimizing algorithms (tolerance: 1e-10)
- •Changing convergence criteria or tolerances
- •Updating numerical dependencies
Process Checklist
Copy to TodoWrite:
Numerical Validation Progress: - [ ] Capture baseline outputs before change - [ ] Make the code change - [ ] Capture new outputs after change - [ ] Compare numerical differences - [ ] Verify mathematical invariants - [ ] Run property-based tests - [ ] Run golden regression tests - [ ] Generate full analysis report - [ ] Present analysis and request approval (if differences found)
Detailed Steps
Step 1: Capture Baseline Outputs
Before making any changes:
# Run tests and capture output /Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \ src/non_local_detector/tests/test_golden_regression.py \ -v > /tmp/baseline_output.txt 2>&1 # Run property tests /Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \ -m property -v > /tmp/baseline_property.txt 2>&1 # Run snapshot tests /Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \ -m snapshot -v > /tmp/baseline_snapshot.txt 2>&1
Save output: Keep baseline files for comparison
Step 2: Make Code Change
Implement your modification to the mathematical/algorithmic code.
Step 3: Capture New Outputs
After making changes:
# Run same tests /Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \ src/non_local_detector/tests/test_golden_regression.py \ -v > /tmp/new_output.txt 2>&1 /Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \ -m property -v > /tmp/new_property.txt 2>&1 /Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \ -m snapshot -v > /tmp/new_snapshot.txt 2>&1
Step 4: Compare Numerical Differences
Difference tolerances:
- •Refactoring (no behavior change): Max 1e-14 (floating-point noise only)
- •Intentional algorithm changes: Max 1e-10 (must be justified)
- •Any larger difference: Requires investigation and explanation
Compare outputs:
# Check if outputs differ diff /tmp/baseline_output.txt /tmp/new_output.txt
For each difference:
- •Is it expected based on the change?
- •What is the magnitude? (< 1e-14 is floating-point noise, < 1e-10 is acceptable for algorithm changes)
- •Does it affect scientific conclusions?
Step 5: Verify Mathematical Invariants
Critical invariants that must ALWAYS hold:
- •
Probability distributions sum to 1.0:
pythonassert np.allclose(probabilities.sum(axis=-1), 1.0, atol=1e-10)
- •
Transition matrices are stochastic:
pythonassert np.allclose(transition_matrix.sum(axis=-1), 1.0, atol=1e-10) assert np.all(transition_matrix >= 0) assert np.all(transition_matrix <= 1)
- •
Log-probabilities are finite:
pythonassert np.all(np.isfinite(log_probs))
- •
Covariance matrices are positive semi-definite:
pythoneigenvalues = np.linalg.eigvalsh(covariance) assert np.all(eigenvalues >= -1e-10)
- •
Likelihoods are non-negative:
pythonassert np.all(likelihood >= 0)
Verify these with tests or spot checks after changes.
Step 6: Run Property-Based Tests
/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest -m property -v
Expected: All property tests pass
Property tests verify:
- •Invariants hold across many random inputs (hypothesis library)
- •Edge cases are handled correctly
- •Mathematical properties are maintained
If failures: Investigate why property violated - likely a bug in your change.
Step 7: Run Golden Regression Tests
/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \ src/non_local_detector/tests/test_golden_regression.py -v
Golden regression tests:
- •Use real scientific data
- •Compare against validated reference outputs
- •Catch subtle numerical changes that affect scientific results
Expected for refactoring: Exact match (or < 1e-14 difference) Expected for algorithm changes: Document and justify any differences
Step 8: Generate Full Analysis Report
Create a comprehensive report with:
1. Diff - What Changed:
Snapshot changes: - test_model_output: posterior probabilities differ by max 2.3e-11 - test_transition_matrix: no changes Test output changes: - Golden regression: 3 values differ by < 1e-10
2. Explanation - Why It Changed:
Changed optimizer tolerance from 1e-6 to 1e-8, resulting in: - More precise convergence - Slight differences in final parameter estimates - Differences are within acceptable scientific tolerance
3. Validation - Invariants Still Hold:
Verified: ✓ All probabilities sum to 1.0 (max deviation: 3.4e-15) ✓ Transition matrices stochastic (max row sum deviation: 1.2e-14) ✓ No NaN or Inf values in any outputs ✓ All property tests pass (42/42) ✓ Covariance matrices positive semi-definite
4. Test Case - Demonstrate Correctness:
# Before change: old_result = [0.342156, 0.657844] # Posterior at time 10 # After change: new_result = [0.342156023, 0.657843977] # Posterior at time 10 # Difference: 2.3e-8 (acceptable) # Scientific interpretation: No change to conclusions # Both results indicate strong preference for state 2
Step 9: Present Analysis and Request Approval
If differences are within tolerance (< 1e-14 for refactoring):
- •Present analysis for information
- •Proceed with change
- •No approval needed
If differences are 1e-14 to 1e-10:
- •Present full analysis
- •Explain why differences are acceptable
- •Request approval: "These differences are expected and scientifically acceptable. Approve?"
- •Wait for user response
If differences are > 1e-10:
- •Present full analysis
- •Explain significance of differences
- •Provide scientific justification
- •Request explicit approval
- •If rejected: Investigate further or revert change
Approval Process
For snapshot updates with numerical changes:
- •
Generate full analysis (all 4 sections above)
- •
Present to user
- •
Ask: "These changes are [expected/acceptable/significant]. Approve snapshot update?"
- •
If approved: User will set approval flag, then run:
bash/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest --snapshot-update
Integration with Other Skills
- •Use with scientific-tdd: After implementing new feature, validate numerics
- •Use with safe-refactoring: Verify no numerical changes during refactoring
- •Use with jax: After JAX optimizations, verify numerical equivalence
Tolerance Guidelines
| Change Type | Max Acceptable Difference | Approval Required |
|---|---|---|
| Pure refactoring | 1e-14 | No |
| Code optimization | 1e-10 | Yes (informational) |
| Algorithm modification | 1e-10 | Yes (justification) |
| > 1e-10 | Any | Yes (strong justification) |
Example Workflow
Task: Refactor HMM filtering to use scan instead of for loop
1. Capture baseline: - Run golden regression: All pass - Run property tests: 42 pass - Save outputs to /tmp/baseline_* 2. Make change: - Replace for loop with jax.lax.scan - Maintain identical logic 3. Capture new outputs: - Run same tests: All pass - Save outputs to /tmp/new_* 4. Compare: - Max difference: 4.2e-15 (floating-point noise) - Within refactoring tolerance 5. Verify invariants: ✓ Probabilities sum to 1.0 ✓ No NaN/Inf ✓ Property tests pass 6. Report: "Refactoring complete. Max numerical difference: 4.2e-15 (floating-point noise). All invariants verified. No approval needed."
Red Flags
Don't:
- •Skip baseline capture
- •Ignore numerical differences > 1e-14
- •Assume "small" differences don't matter
- •Update snapshots without analysis
- •Skip property or golden regression tests
- •Proceed with NaN/Inf in outputs
Do:
- •Always capture baseline before changes
- •Investigate all unexpected differences
- •Verify mathematical invariants explicitly
- •Provide full analysis for any differences
- •Get approval before snapshot updates
- •Document tolerance justifications