Dataset Management

Storage and Versioning Approach

Approach	Best For	Version Control	Scale	Collaboration
DVC + S3/GCS	Large datasets, ML pipelines	Git-like (content-addressed)	TB+	Good
HF Datasets + Hub	Public datasets, sharing	Hub revisions	GB-TB	Best
Delta Lake	Data lake, Spark workflows	Time travel	PB	Enterprise
Git LFS	Small datasets (<1 GB)	Native git	GB	Native
Lakehouse (Iceberg)	Analytics + ML combined	Snapshots	PB	Enterprise

Decision rule: ML project with existing git workflow -> DVC. Sharing/community -> HF Hub. Enterprise data platform -> Delta/Iceberg. Small and simple -> Git LFS.

DVC Pipeline

Setup

bash

# Initialize DVC in an existing git repo
pip install dvc dvc-s3       # or dvc-gs, dvc-azure
dvc init
dvc remote add -d myremote s3://my-bucket/dvc-store

# Track a dataset
dvc add data/training.jsonl
git add data/training.jsonl.dvc data/.gitignore
git commit -m "Track training dataset v1"
dvc push

DVC Pipeline Definition

yaml

# dvc.yaml
stages:
  prepare:
    cmd: python scripts/prepare_data.py
    deps:
      - scripts/prepare_data.py
      - data/raw/
    outs:
      - data/processed/train.jsonl
      - data/processed/eval.jsonl
    params:
      - prepare.max_length
      - prepare.min_quality_score

  train:
    cmd: python scripts/train.py
    deps:
      - scripts/train.py
      - data/processed/train.jsonl
    outs:
      - models/adapter/
    params:
      - train.learning_rate
      - train.epochs
    metrics:
      - metrics/eval_results.json:
          cache: false

  evaluate:
    cmd: python scripts/evaluate.py
    deps:
      - scripts/evaluate.py
      - models/adapter/
      - data/processed/eval.jsonl
    metrics:
      - metrics/benchmark_results.json:
          cache: false
    plots:
      - metrics/confusion_matrix.csv

HF Datasets Patterns

Creating a Dataset from Scratch

python

from datasets import Dataset, DatasetDict, Features, Value, ClassLabel

# From list of dicts
data = [
    {"text": "Great product!", "label": "positive"},
    {"text": "Terrible experience.", "label": "negative"},
]
ds = Dataset.from_dict({k: [d[k] for d in data] for k in data[0]})

# With explicit schema
features = Features({
    "text": Value("string"),
    "label": ClassLabel(names=["negative", "neutral", "positive"]),
    "metadata": {
        "source": Value("string"),
        "timestamp": Value("string"),
    },
})
ds = Dataset.from_dict(data_dict, features=features)

# Create train/test split
dataset = DatasetDict({
    "train": ds.select(range(0, int(len(ds) * 0.9))),
    "test": ds.select(range(int(len(ds) * 0.9), len(ds))),
})

# Push to Hub
dataset.push_to_hub("myorg/my-dataset", private=True)

Processing Pipeline

python

from datasets import load_dataset

ds = load_dataset("json", data_files="data/raw/*.jsonl", split="train")

# Chain transformations
processed = (
    ds
    .filter(lambda x: len(x["text"]) > 50)                          # Remove short
    .filter(lambda x: x["language"] == "en")                         # Language filter
    .map(clean_text, num_proc=8)                                     # Parallel cleaning
    .map(tokenize_and_align, batched=True, batch_size=1000)          # Batched tokenization
    .remove_columns(["raw_html", "metadata"])                        # Drop unneeded cols
    .shuffle(seed=42)
)

# Deduplicate by content hash
def add_hash(example):
    example["content_hash"] = hashlib.md5(example["text"].encode()).hexdigest()
    return example

processed = processed.map(add_hash)
unique_hashes = set()
def dedup(example):
    if example["content_hash"] in unique_hashes:
        return False
    unique_hashes.add(example["content_hash"])
    return True

processed = processed.filter(dedup)

Gotcha: num_proc > 1 with filter that uses external mutable state (like unique_hashes above) is not safe. Dedup must run single-process or use a different approach (e.g., sort + batch dedup).

Data Quality Checks

python

from collections import Counter
import re

def quality_report(dataset):
    """Generate data quality report. Run before every training run."""
    report = {
        "total_examples": len(dataset),
        "empty_texts": sum(1 for x in dataset if not x["text"].strip()),
        "avg_length": sum(len(x["text"]) for x in dataset) / len(dataset),
        "length_p95": sorted(len(x["text"]) for x in dataset)[int(len(dataset) * 0.95)],
    }

    # Label distribution
    if "label" in dataset.column_names:
        labels = Counter(dataset["label"])
        report["label_distribution"] = dict(labels)
        report["label_imbalance_ratio"] = max(labels.values()) / max(min(labels.values()), 1)

    # Duplicate detection
    texts = dataset["text"]
    unique = len(set(texts))
    report["duplicates"] = len(texts) - unique
    report["duplicate_pct"] = (len(texts) - unique) / len(texts) * 100

    return report

def assert_data_quality(report, max_dup_pct=5.0, max_imbalance=10.0, min_examples=100):
    """Hard gates -- fail the pipeline if data is bad."""
    assert report["total_examples"] >= min_examples, f"Too few examples: {report['total_examples']}"
    assert report["empty_texts"] == 0, f"Found {report['empty_texts']} empty texts"
    assert report["duplicate_pct"] <= max_dup_pct, f"Duplicate rate too high: {report['duplicate_pct']:.1f}%"
    if "label_imbalance_ratio" in report:
        assert report["label_imbalance_ratio"] <= max_imbalance, \
            f"Label imbalance too high: {report['label_imbalance_ratio']:.1f}x"

Annotation Pipeline

Label Studio Setup

python

# label_studio_config.xml -- for text classification
LABEL_CONFIG = """
<View>
  <Text name="text" value="$text"/>
  <Choices name="sentiment" toName="text" choice="single">
    <Choice value="positive"/>
    <Choice value="negative"/>
    <Choice value="neutral"/>
  </Choices>
</View>
"""

# Import tasks programmatically
from label_studio_sdk import Client

ls = Client(url="http://localhost:8080", api_key="your-key")
project = ls.start_project(title="Sentiment v2", label_config=LABEL_CONFIG)

# Import data
tasks = [{"data": {"text": example["text"]}} for example in unlabeled_data]
project.import_tasks(tasks)

Gotchas and Anti-Patterns

Data Leakage

•Always split before any processing that uses global statistics (normalization, TF-IDF)
•Check for near-duplicates across train/test, not just exact duplicates
•Time-series data: split by time, never randomly

Train/Test Contamination

•If your eval set overlaps with a public training corpus, your benchmarks are meaningless
•Use n-gram overlap detection between your training data and standard benchmarks
•Keep a held-out "canary" test set that is never used during development

Common Mistakes

•Shuffling before splitting grouped data (e.g., same user's reviews in both splits)
•Not versioning the preprocessing code alongside the data
•Filtering after splitting (can remove all examples of a class from test)
•Using num_proc > 1 with stateful filter functions (race conditions)
•Not tracking dataset lineage (which raw data produced which processed version)
•Oversampling minority class in the test set (inflates metrics)
•Annotator fatigue in long labeling sessions (quality drops after ~2 hours)