AgentSkillsCN

ML Workflow

设计机器学习工作流——实验跟踪、特征存储、模型训练、模型推理、漂移监测。

SKILL.md
--- frontmatter
name: "ML Workflow"
department: "alchemist"
description: "Design ML workflows — experiment tracking, feature stores, model training, serving, monitoring for drift"
version: 1
triggers:
  - "ML"
  - "machine learning"
  - "model"
  - "training"
  - "feature store"
  - "MLflow"
  - "experiment"
  - "drift"
  - "serving"
  - "inference"
  - "W&B"
  - "Weights & Biases"
  - "model deployment"

ML Workflow

Purpose

Design end-to-end ML workflows covering experiment tracking, feature engineering and storage, model training pipelines, model serving and deployment, A/B testing for models, and monitoring for data and model drift. Produces a workflow architecture, tool selection rationale, and operational runbook.

Inputs

  • ML problem type (classification, regression, ranking, recommendation, NLP, CV)
  • Data sources and feature candidates
  • Model complexity range (linear/tree-based vs deep learning)
  • Serving requirements (batch predictions, real-time inference, edge deployment)
  • Team size and ML maturity (first model vs established ML platform)
  • Infrastructure constraints (cloud provider, GPU availability, budget)

Process

Step 1: Define the ML Problem Clearly

Before any tooling decisions, formalize:

  • What is the prediction target? What does "correct" look like?
  • What is the business metric this model optimizes? (Not just accuracy — revenue, conversion, engagement)
  • What is the baseline? (Rule-based heuristic, current model, random chance)
  • What is the minimum viable performance to ship?

Document the problem statement, target variable, evaluation metric, and success threshold.

Step 2: Design the Feature Engineering Pipeline

Map raw data to model-ready features:

  • Feature identification: Which raw fields become features? What transformations are needed (encoding, scaling, windowing, embedding)?
  • Temporal features: Aggregations over time windows (last 7 days, last 30 days). Guard against leakage — never use future data to predict the past.
  • Feature store evaluation: Does this project warrant a feature store (Feast, Tecton, Hopsworks)? Feature stores add value when: features are shared across models, real-time features are needed, or training-serving skew is a risk.
  • Feature documentation: Each feature should have: name, description, data type, source, transformation logic, and expected distribution.

Step 3: Design Experiment Tracking

Set up reproducible experiment management:

  • Tool selection: MLflow (open-source, self-hosted), Weights & Biases (managed, rich visualization), Neptune, or ClearML.
  • What to track: Hyperparameters, metrics (train/val/test), dataset version, code version (git SHA), environment (dependencies), artifacts (model files, plots).
  • Experiment organization: Project → Experiment group → Individual runs. Name runs meaningfully (not "run_42").
  • Comparison workflow: How does the team compare runs? Dashboard? Automated reports?

Step 4: Design the Training Pipeline

Build a reproducible, automated training workflow:

  • Data split strategy: Time-based splits for temporal data, stratified splits for imbalanced classes. Never random-split time-series data.
  • Training orchestration: Single script, or DAG-based (Airflow, Kubeflow Pipelines, SageMaker Pipelines)?
  • Hyperparameter tuning: Grid search, random search, Bayesian optimization (Optuna, Ray Tune)?
  • Validation strategy: Cross-validation, holdout, or time-series walk-forward?
  • Model registry: Where are trained models stored? How are they versioned? Who approves promotion to production?

Step 5: Design Model Serving

Plan how predictions reach users:

  • Batch serving: Run predictions on a schedule, store results in a table. Best for recommendations, risk scores, daily reports.
  • Real-time serving: Model behind an API endpoint. Best for search ranking, fraud detection, dynamic pricing.
  • Streaming serving: Model embedded in a stream processor. Best for event-driven predictions on Kafka/Kinesis streams.
  • Edge serving: Model deployed to device/browser. Best for latency-critical or offline-capable applications.

For real-time serving, specify: latency SLA (p50/p99), throughput (requests/second), scaling strategy (auto-scale triggers), and fallback behavior (what happens if the model is unavailable?).

Step 6: Design A/B Testing for Models

Plan controlled rollout of model changes:

  • Traffic splitting: How is traffic divided between control (current model) and treatment (new model)?
  • Metric selection: Primary metric (business KPI), guardrail metrics (latency, error rate), and minimum detectable effect.
  • Duration calculation: How long must the test run to reach statistical significance?
  • Rollback criteria: What triggers an automatic rollback?

Step 7: Design Monitoring and Drift Detection

Plan ongoing model health monitoring:

  • Data drift: Monitor input feature distributions for shifts. Tool options: Evidently, WhyLabs, Great Expectations.
  • Model drift: Monitor prediction distribution and performance metrics over time. Alert when performance degrades below threshold.
  • Concept drift: Monitor the relationship between features and target. Retrain triggers when the world changes (seasonality, market shifts).
  • Operational monitoring: Latency, error rates, throughput, GPU utilization for serving infrastructure.

Define retraining policy: scheduled (weekly/monthly), triggered (drift detected), or continuous (online learning).

Output Format

markdown
# ML Workflow: [Project/Model Name]

## Problem Definition

| Aspect | Detail |
|--------|--------|
| Problem type | ... |
| Target variable | ... |
| Business metric | ... |
| Evaluation metric | ... |
| Baseline performance | ... |
| Success threshold | ... |

## Feature Engineering

| Feature | Source | Transformation | Type | Leakage Risk |
|---------|--------|---------------|------|-------------|
| ...     | ...    | ...           | ...  | Low/Med/High |

**Feature store:** [Yes/No — tool choice and rationale]

## Experiment Tracking

| Aspect | Choice | Rationale |
|--------|--------|-----------|
| Tool | ... | ... |
| What's tracked | ... | ... |
| Organization | ... | ... |

## Training Pipeline

[ASCII diagram showing data → features → train → evaluate → register]

code

| Stage | Tool/Method | Notes |
|-------|------------|-------|
| Data split | ... | ... |
| Training | ... | ... |
| Tuning | ... | ... |
| Validation | ... | ... |
| Registry | ... | ... |

## Model Serving

| Aspect | Detail |
|--------|--------|
| Serving mode | Batch / Real-time / Streaming / Edge |
| Latency SLA | ... |
| Throughput | ... |
| Scaling | ... |
| Fallback | ... |

## A/B Testing

| Aspect | Detail |
|--------|--------|
| Traffic split | ... |
| Primary metric | ... |
| Guardrail metrics | ... |
| Min duration | ... |
| Rollback criteria | ... |

## Monitoring and Drift

| Monitor | Tool | Threshold | Action |
|---------|------|-----------|--------|
| Data drift | ... | ... | ... |
| Model drift | ... | ... | ... |
| Concept drift | ... | ... | ... |
| Operational | ... | ... | ... |

**Retraining policy:** [Scheduled / Triggered / Continuous — details]

Quality Checks

  • Problem definition includes a clear business metric, not just an ML metric
  • Feature engineering documents leakage risk for every temporal feature
  • Experiment tracking captures enough metadata to reproduce any run
  • Training pipeline uses appropriate split strategy (time-based for temporal data)
  • Model registry has a clear promotion workflow (dev → staging → production)
  • Serving architecture matches latency and throughput requirements
  • A/B testing plan includes statistical power calculation and guardrail metrics
  • Drift monitoring covers data, model, and concept drift with defined thresholds
  • Retraining policy is documented with clear triggers and automation level
  • Fallback behavior is defined for model unavailability

Evolution Notes

<!-- Observations appended after each use -->