Model Evaluation

Name: data-science-model-evaluation
Rating: 92
Author: legout

Use this skill for rigorously assessing model performance, comparing alternatives, and diagnosing issues.

When to use this skill

•Model training complete — need performance assessment
•Comparing multiple models/algorithms
•Diagnosing overfitting/underfitting
•Hyperparameter tuning
•Production readiness check

Evaluation workflow

•
Cross-validation strategy
- •K-fold (default for most cases)
- •Stratified K-fold (classification with imbalance)
- •TimeSeriesSplit (temporal data)
- •GroupKFold (grouped/clustered data)
•
Choose appropriate metrics
- •Classification: accuracy, precision, recall, F1, ROC-AUC, PR-AUC
- •Regression: MAE, RMSE, R², MAPE
- •Ranking: NDCG, MAP
- •Business: custom metrics tied to outcomes
•
Analyze performance
- •Cross-validation mean ± std
- •Validation curve (bias-variance tradeoff)
- •Learning curves (data sufficiency)
- •Error analysis by segment
•
Model comparison
- •Statistical significance (paired t-test, McNemar)
- •Calibration (for probability outputs)
- •Speed vs accuracy tradeoffs

Quick tool selection

Task	Default choice	Notes
Cross-validation	sklearn.model_selection	Standard CV, stratified, time series
Metrics	sklearn.metrics	Comprehensive metric suite
Hyperparameter tuning	Optuna or Ray Tune	Efficient search algorithms
Model comparison	scikit-learn + statistical tests	Paired comparisons
Experiment tracking	MLflow or Weights & Biases	Track runs, metrics, artifacts

Core implementation rules

1) Always use proper validation

python

from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

2) Match metrics to problem

python

# Classification with imbalance
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_true, y_pred))
# Focus on F1, precision/recall for minority class

# Regression
from sklearn.metrics import mean_absolute_error, root_mean_squared_error

print(f"MAE: {mean_absolute_error(y_true, y_pred):.3f}")
print(f"RMSE: {root_mean_squared_error(y_true, y_pred):.3f}")

3) Analyze errors systematically

python

# Error by segment
errors = y_pred != y_true
error_df = X_test[errors]
error_df['true'] = y_true[errors]
error_df['pred'] = y_pred[errors]

# Analyze patterns in errors
print(error_df.groupby('category').size())

4) Track experiments

python

import mlflow

with mlflow.start_run():
    mlflow.log_params(params)
    mlflow.log_metrics({'auc': auc, 'f1': f1})
    mlflow.sklearn.log_model(model, 'model')

Common anti-patterns

•❌ Single train/test split without CV
•❌ Optimizing wrong metric (accuracy on imbalanced data)
•❌ Data leakage in preprocessing
•❌ Not checking calibration for probability outputs
•❌ Ignoring inference speed/memory constraints
•❌ No error analysis or debugging bad predictions

Progressive disclosure

•../references/cross-validation.md — CV strategies for different data types
•../references/metrics-guide.md — Choosing and interpreting metrics
•../references/hyperparameter-tuning.md — Optuna, Ray Tune patterns
•../references/experiment-tracking.md — MLflow, W&B setup

Related skills

•@data-science-feature-engineering — Features to evaluate
•@data-engineering-orchestration — Production model deployment
•@data-engineering-observability — Model monitoring in production

Model Evaluation

When to use this skill

Evaluation workflow

Quick tool selection

Core implementation rules

1) Always use proper validation

2) Match metrics to problem

3) Analyze errors systematically

4) Track experiments

Common anti-patterns

Progressive disclosure

Related skills

References