AgentSkillsCN

python-data-science

严格遵循 Python 数据科学标准,包括类型注解、Google 风格的文档字符串、防御性编程、Pandas/NumPy 最佳实践,以及各类配置模式。适用于编写或审查用于数据分析、预测流水线,或科学计算的 Python 代码时使用。

SKILL.md
--- frontmatter
name: python-data-science
description: Enforces Python data science standards including type hints, Google-style docstrings, defensive programming, Pandas/NumPy best practices, and configuration patterns. Use when writing or reviewing Python code for data analysis, forecasting pipelines, or scientific computing.

Python Data Science Expert

When to Apply

  • Writing Python for data analysis, forecasting, or scientific computing
  • Reviewing data science or ML pipeline code
  • Adding type hints, docstrings, or error handling to data code

Code Style & Structure

1. Type Hints Everywhere

Use typing extensively: List, Dict, Optional, Union, Tuple. Use dataclasses for structured data. Never omit return type hints.

python
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass
from datetime import datetime

@dataclass
class TimeSeriesPoint:
    timestamp: datetime
    value: float
    confidence: Optional[float] = None

def process_data(
    df: pd.DataFrame,
    window_size: int = 24
) -> Tuple[np.ndarray, np.ndarray]:
    """Process time series with rolling window."""
    ...

2. Docstrings (Google Style)

Every public function needs a docstring with Args, Returns, Raises, and Example when helpful.

python
def calculate_metrics(
    predictions: np.ndarray,
    actuals: np.ndarray,
    metric_types: List[str]
) -> Dict[str, float]:
    """
    Calculate forecasting error metrics.

    Args:
        predictions: Predicted values array (n_samples,)
        actuals: Actual observed values (n_samples,)
        metric_types: List of metrics to compute ['mae', 'mape', 'rmse']

    Returns:
        Dictionary mapping metric names to computed values

    Raises:
        ValueError: If predictions and actuals have different lengths

    Example:
        >>> pred = np.array([100, 105, 110])
        >>> actual = np.array([102, 103, 112])
        >>> metrics = calculate_metrics(pred, actual, ['mae', 'rmse'])
    """
    if len(predictions) != len(actuals):
        raise ValueError(f"Length mismatch: {len(predictions)} vs {len(actuals)}")
    ...

3. Error Handling

Use specific exception types, not bare except:. Log errors with context. Provide helpful messages.

python
import logging
logger = logging.getLogger(__name__)

try:
    data = load_csv(filepath)
except FileNotFoundError:
    logger.error(f"Data file not found: {filepath}")
    raise
except pd.errors.EmptyDataError:
    logger.error(f"CSV file is empty: {filepath}")
    raise ValueError(f"Cannot process empty file: {filepath}")
except Exception as e:
    logger.exception(f"Unexpected error loading {filepath}")
    raise RuntimeError(f"Failed to load data: {e}") from e

4. Defensive Programming

Validate inputs early. Use assertions for invariants. Check edge cases (empty data, constant columns, nulls). Use defensive copies when modifying.

python
def normalize_timeseries(
    data: pd.DataFrame,
    columns: List[str]
) -> pd.DataFrame:
    """Normalize specified columns to [0, 1] range."""
    assert len(data) > 0, "Cannot normalize empty dataframe"
    assert all(col in data.columns for col in columns), \
        f"Missing columns: {set(columns) - set(data.columns)}"

    null_counts = data[columns].isnull().sum()
    if null_counts.any():
        logger.warning(f"Null values detected: {null_counts[null_counts > 0]}")

    result = data.copy()
    for col in columns:
        col_min, col_max = result[col].min(), result[col].max()
        if col_max == col_min:
            logger.warning(f"Column {col} is constant, setting to 0.5")
            result[col] = 0.5
        else:
            result[col] = (result[col] - col_min) / (col_max - col_min)
    return result

5. Pandas Best Practices

Use vectorized operations; avoid row loops. Chain operations when clear. Use .loc / .iloc explicitly. Be explicit about copies vs views.

python
# GOOD: Vectorized, explicit
df = (
    pd.read_csv('data.csv')
    .assign(
        timestamp=lambda x: pd.to_datetime(x['timestamp']),
        hour=lambda x: x['timestamp'].dt.hour,
        is_weekend=lambda x: x['timestamp'].dt.dayofweek >= 5
    )
    .sort_values('timestamp')
    .reset_index(drop=True)
)
weekend_data = df.loc[df['is_weekend'], ['timestamp', 'load']]

# BAD: Loops, implicit behavior
# for i in range(len(df)): df['hour'][i] = ...  # SettingWithCopyWarning!

6. NumPy Efficiency

Preallocate when size is known. Use appropriate dtypes (float32 vs float64). Leverage broadcasting.

python
n_samples = 1000
results = np.zeros(n_samples, dtype=np.float32)

data = np.random.rand(100, 10)
means = data.mean(axis=0)
normalized = data - means  # Broadcasting

row_sums = large_array.sum(axis=1)  # Prefer over loops

7. Configuration Management

Use dataclasses or Pydantic for config. Load secrets from environment variables. Provide sensible defaults and validate in __post_init__.

python
from dataclasses import dataclass, field
import os

@dataclass
class ForecastConfig:
    forecast_horizon: int = 24
    context_window_hours: int = 168
    models: List[str] = field(default_factory=lambda: ['gemini', 'modal'])
    ensemble_weights: List[float] = field(default_factory=lambda: [0.6, 0.4])
    gemini_api_key: str = field(default_factory=lambda: os.getenv('GEMINI_API_KEY', ''))

    def __post_init__(self):
        assert self.forecast_horizon > 0
        assert len(self.models) == len(self.ensemble_weights)
        assert abs(sum(self.ensemble_weights) - 1.0) < 1e-6

8. Code Organization

Order: standard imports → third-party → local. Constants at module level. No magic numbers; use named constants.

python
# Standard imports first
import os
from pathlib import Path
from typing import List, Dict, Optional, Tuple
from datetime import datetime

# Third-party
import numpy as np
import pandas as pd
from loguru import logger

# Local
from .config import ForecastConfig

# Constants
DEFAULT_FORECAST_HORIZON = 24
MAX_CONTEXT_WINDOW = 720

Checklist When Writing Code

  • Always: Type hints on function signatures, docstrings for public functions, explicit error handling with helpful messages, logging of important events/errors, input validation before processing
  • Prefer: Vectorized operations over loops, explicit over implicit (e.g. .copy() when modifying)
  • Avoid: Bare except:, magic numbers