Model Evaluation
Use this skill for rigorously assessing model performance, comparing alternatives, and diagnosing issues.
When to use this skill
- •Model training complete — need performance assessment
- •Comparing multiple models/algorithms
- •Diagnosing overfitting/underfitting
- •Hyperparameter tuning
- •Production readiness check
Evaluation workflow
- •
Cross-validation strategy
- •K-fold (default for most cases)
- •Stratified K-fold (classification with imbalance)
- •TimeSeriesSplit (temporal data)
- •GroupKFold (grouped/clustered data)
- •
Choose appropriate metrics
- •Classification: accuracy, precision, recall, F1, ROC-AUC, PR-AUC
- •Regression: MAE, RMSE, R², MAPE
- •Ranking: NDCG, MAP
- •Business: custom metrics tied to outcomes
- •
Analyze performance
- •Cross-validation mean ± std
- •Validation curve (bias-variance tradeoff)
- •Learning curves (data sufficiency)
- •Error analysis by segment
- •
Model comparison
- •Statistical significance (paired t-test, McNemar)
- •Calibration (for probability outputs)
- •Speed vs accuracy tradeoffs
Quick tool selection
| Task | Default choice | Notes |
|---|---|---|
| Cross-validation | sklearn.model_selection | Standard CV, stratified, time series |
| Metrics | sklearn.metrics | Comprehensive metric suite |
| Hyperparameter tuning | Optuna or Ray Tune | Efficient search algorithms |
| Model comparison | scikit-learn + statistical tests | Paired comparisons |
| Experiment tracking | MLflow or Weights & Biases | Track runs, metrics, artifacts |
Core implementation rules
1) Always use proper validation
python
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
2) Match metrics to problem
python
# Classification with imbalance
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true, y_pred))
# Focus on F1, precision/recall for minority class
# Regression
from sklearn.metrics import mean_absolute_error, root_mean_squared_error
print(f"MAE: {mean_absolute_error(y_true, y_pred):.3f}")
print(f"RMSE: {root_mean_squared_error(y_true, y_pred):.3f}")
3) Analyze errors systematically
python
# Error by segment
errors = y_pred != y_true
error_df = X_test[errors]
error_df['true'] = y_true[errors]
error_df['pred'] = y_pred[errors]
# Analyze patterns in errors
print(error_df.groupby('category').size())
4) Track experiments
python
import mlflow
with mlflow.start_run():
mlflow.log_params(params)
mlflow.log_metrics({'auc': auc, 'f1': f1})
mlflow.sklearn.log_model(model, 'model')
Common anti-patterns
- •❌ Single train/test split without CV
- •❌ Optimizing wrong metric (accuracy on imbalanced data)
- •❌ Data leakage in preprocessing
- •❌ Not checking calibration for probability outputs
- •❌ Ignoring inference speed/memory constraints
- •❌ No error analysis or debugging bad predictions
Progressive disclosure
- •
../references/cross-validation.md— CV strategies for different data types - •
../references/metrics-guide.md— Choosing and interpreting metrics - •
../references/hyperparameter-tuning.md— Optuna, Ray Tune patterns - •
../references/experiment-tracking.md— MLflow, W&B setup
Related skills
- •
@data-science-feature-engineering— Features to evaluate - •
@data-engineering-orchestration— Production model deployment - •
@data-engineering-observability— Model monitoring in production