Debug Training Issues Skill
Systematically diagnose and resolve ML-Agents training problems.
When to Use
- •Agent not learning or reward staying flat
- •Unity environment connection failures
- •Training running but no progress
- •GPU not being utilized
- •Out of memory errors
Quick Diagnostics
1. Check Training is Running
bash
# Monitor GPU usage nvidia-smi # Check TensorBoard tensorboard --logdir=results/<run-id> # Look for: # - Environment/Cumulative Reward updates # - Policy/Learning Rate changes # - Losses/Policy Loss values
2. Verify Environment Connection
bash
# Check if Unity environment is accessible ls -la ./builds/ # or your env path # Test port availability netstat -an | grep 5005 # Linux/Mac netstat -an | findstr 5005 # Windows # Try with explicit path mlagents-learn config.yaml --run-id=test --env=./builds/MyEnv/MyEnv.x86_64
3. Check Reward Signal
python
# In Unity Agent script:
public override void OnEpisodeBegin()
{
Debug.Log("Episode started");
}
public override void CollectObservations(VectorSensor sensor)
{
Debug.Log($"Observations: {sensor.ObservationSize()}");
}
public void OnActionReceived(ActionBuffers actions)
{
Debug.Log($"Reward this step: {GetCumulativeReward()}");
}
Common Problems & Solutions
Problem: Agent Not Learning
Symptoms:
- •Flat cumulative reward
- •Random behavior
- •No improvement over time
Diagnosis:
bash
# Check TensorBoard metrics tensorboard --logdir=results # Verify observations are being collected # Check Unity logs for reward values
Solutions:
- •Adjust reward shaping (add intermediate rewards)
- •Normalize observations to [-1, 1] or [0, 1]
- •Reduce learning rate (try 1e-4 instead of 3e-4)
- •Increase max_steps (try 2M instead of 500K)
- •Check reward scale (too small or too large)
Problem: Connection Timeout
Symptoms:
code
UnityTimeOutException: The Unity environment took too long to respond
Solutions:
bash
# 1. Check Unity build exists ls ./builds/MyEnv/ # 2. Increase timeout # In Python: from mlagents_envs.environment import UnityEnvironment env = UnityEnvironment(file_name="path", timeout_wait=120) # 3. Check firewall # 4. Verify no other process using port 5005
Problem: GPU Not Used
Symptoms:
- •Training slow
- •GPU utilization 0%
- •CPU maxed out
Solutions:
bash
# 1. Verify CUDA available python -c "import torch; print(torch.cuda.is_available())" # 2. Check PyTorch version pip show torch # 3. Reinstall with CUDA support pip uninstall torch pip install torch --index-url https://download.pytorch.org/whl/cu118 # 4. Set environment variable export CUDA_VISIBLE_DEVICES=0
Problem: Out of Memory
Symptoms:
code
RuntimeError: CUDA out of memory
Solutions:
yaml
# In config.yaml, reduce: hyperparameters: batch_size: 512 # Down from 1024 buffer_size: 5120 # Down from 10240 # Also reduce: num_envs: 1 # Train with single environment first
Problem: NaN Loss
Symptoms:
- •Policy loss shows NaN in TensorBoard
- •Training crashes or produces invalid models
Solutions:
- •Reduce learning rate significantly (1e-5)
- •Check for NaN in observations
- •Enable gradient clipping in config
- •Normalize inputs properly
- •Check reward scale (avoid extremely large rewards)
Debug Mode
Enable verbose logging:
bash
# Run with debug flag mlagents-learn config.yaml --run-id=debug --debug # Set Python logging level export MLAGENTS_LOG_LEVEL=DEBUG mlagents-learn config.yaml --run-id=debug
Profiling
bash
# Profile Python training loop python -m cProfile -o training.prof \ -m mlagents.trainers.learn \ config.yaml --run-id=profile # Analyze profile python -m pstats training.prof
Testing with Example Environments
Verify setup works with known-good environment:
bash
# Train on 3DBall (should converge quickly) mlagents-learn config/ppo/3DBall.yaml --run-id=baseline_test # Watch for reward increasing within 10K steps tensorboard --logdir=results
If baseline works but your environment doesn't, issue is in your custom environment.
Logs to Check
- •TensorBoard logs:
results/<run-id>/ - •Unity Player logs:
- •Windows:
%USERPROFILE%\AppData\LocalLow\CompanyName\ProductName\Player.log - •Mac:
~/Library/Logs/Company Name/Product Name/Player.log - •Linux:
~/.config/unity3d/CompanyName/ProductName/Player.log
- •Windows:
- •Python stdout/stderr: Terminal output
Related Resources
- •Runbook:
docs/runbooks/training-convergence.md - •Runbook:
docs/runbooks/gpu-memory.md - •Unity Discussions: https://discussions.unity.com/tag/ml-agents
- •GitHub Issues: https://github.com/Unity-Technologies/ml-agents/issues
Related Skills
- •
train-ml-agent- Basic training workflow - •
optimize-performance- Speed up training - •
export-models- Export after successful training