Eval and Benchmarking
Metric Selection by Task Type
| Task | Primary Metric | Secondary | Avoid |
|---|---|---|---|
| Classification | F1 (macro) | Precision, Recall, AUC-ROC | Accuracy (misleading on imbalanced) |
| Text generation | Human eval, LLM-as-judge | BLEU, ROUGE-L | BLEU alone (poor correlation) |
| Summarization | ROUGE-L, BERTScore | Faithfulness (NLI-based) | ROUGE-1 alone |
| QA (extractive) | Exact Match, F1 (token) | Has-answer accuracy | Accuracy without normalization |
| QA (generative) | LLM-as-judge, semantic sim | ROUGE-L | Exact match (too strict) |
| Code generation | pass@k | Functional correctness | BLEU (meaningless for code) |
| Translation | COMET, BLEURT | chrF, BLEU | BLEU without normalization |
| Reasoning | Accuracy (MCQ) | Chain-of-thought analysis | Single-benchmark conclusions |
Decision rule: If human eval is feasible, do it. If not, LLM-as-judge + at least one automatic metric. Never rely on a single metric.
lm-evaluation-harness
Running Standard Benchmarks
bash
# Install
pip install lm-eval
# Run MMLU (5-shot) on a local model
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,dtype=bfloat16 \
--tasks mmlu \
--num_fewshot 5 \
--batch_size auto \
--output_path ./results/
# Run multiple benchmarks
lm_eval --model hf \
--model_args pretrained=./my_model \
--tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2,winogrande,gsm8k \
--batch_size auto \
--output_path ./results/
# Evaluate a vLLM server
lm_eval --model local-completions \
--model_args model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v1 \
--tasks mmlu \
--num_fewshot 5
Custom Eval Task
yaml
# tasks/my_task/my_task.yaml
task: my_custom_qa
dataset_path: json
dataset_name: null
dataset_kwargs:
data_files: "./data/eval.jsonl"
output_type: generate_until
training_split: null
test_split: train
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"
generation_kwargs:
until: ["\n", "Question:"]
max_gen_toks: 64
temperature: 0
do_sample: false
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
- metric: f1
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
bash
# Run custom task
lm_eval --model hf \
--model_args pretrained=./my_model \
--tasks my_custom_qa \
--include_path ./tasks/ \
--output_path ./results/
Custom Metric in Python
python
# tasks/my_task/utils.py
import re
def normalize_answer(s):
"""Lowercase, strip articles/punctuation/whitespace."""
s = s.lower().strip()
s = re.sub(r"\b(a|an|the)\b", " ", s)
s = re.sub(r"[^\w\s]", "", s)
return " ".join(s.split())
def custom_exact_match(references, predictions, **kwargs):
correct = sum(
normalize_answer(pred) == normalize_answer(ref)
for pred, ref in zip(predictions, references)
)
return correct / len(predictions)
Statistical Significance Testing
Bootstrap Confidence Intervals
python
import numpy as np
def bootstrap_ci(scores, n_bootstrap=10000, confidence=0.95):
"""Bootstrap confidence interval for a metric."""
rng = np.random.default_rng(42)
bootstrapped = np.array([
np.mean(rng.choice(scores, size=len(scores), replace=True))
for _ in range(n_bootstrap)
])
alpha = (1 - confidence) / 2
return np.percentile(bootstrapped, [100 * alpha, 100 * (1 - alpha)])
def paired_bootstrap_test(scores_a, scores_b, n_bootstrap=10000):
"""Test if model A is significantly better than model B."""
rng = np.random.default_rng(42)
diff = np.array(scores_a) - np.array(scores_b)
observed_diff = np.mean(diff)
count = 0
for _ in range(n_bootstrap):
sample = rng.choice(diff, size=len(diff), replace=True)
if np.mean(sample) <= 0:
count += 1
p_value = count / n_bootstrap
return {"observed_diff": observed_diff, "p_value": p_value,
"significant": p_value < 0.05}
python
# Usage
model_a_scores = [1, 0, 1, 1, 0, 1, 1, 1, 0, 1] # per-example correctness
model_b_scores = [1, 0, 0, 1, 0, 1, 0, 1, 0, 1]
ci = bootstrap_ci(model_a_scores)
print(f"Model A: {np.mean(model_a_scores):.3f} [{ci[0]:.3f}, {ci[1]:.3f}]")
result = paired_bootstrap_test(model_a_scores, model_b_scores)
print(f"Diff: {result['observed_diff']:.3f}, p={result['p_value']:.3f}")
LLM-as-Judge
python
JUDGE_PROMPT = """Rate the following response on a scale of 1-5.
Criteria:
- Accuracy: Does the response correctly answer the question?
- Completeness: Does it cover all relevant aspects?
- Conciseness: Is it free of unnecessary information?
Question: {question}
Reference Answer: {reference}
Model Response: {response}
Output ONLY a JSON object: {{"score": <1-5>, "reasoning": "<brief explanation>"}}"""
def llm_judge(client, question, reference, response):
result = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": JUDGE_PROMPT.format(
question=question, reference=reference, response=response
)}],
temperature=0,
)
return json.loads(result.choices[0].message.content)
Gotcha: LLM judges have position bias (prefer the first response in pairwise eval). Always run comparisons in both orders and average.
Contamination Detection
N-gram Overlap Check
python
def check_contamination(train_texts, eval_texts, n=13):
"""Check for n-gram overlap between training and eval data."""
def get_ngrams(text, n):
words = text.lower().split()
return set(tuple(words[i:i+n]) for i in range(len(words) - n + 1))
train_ngrams = set()
for text in train_texts:
train_ngrams.update(get_ngrams(text, n))
contaminated = []
for i, text in enumerate(eval_texts):
eval_ngrams = get_ngrams(text, n)
overlap = eval_ngrams & train_ngrams
if overlap:
contaminated.append({
"index": i, "overlap_ratio": len(overlap) / max(len(eval_ngrams), 1),
"sample_overlap": list(overlap)[:3],
})
return contaminated
Canary String Method
python
# Insert unique canaries into training data, check if model memorizes them
CANARY = "The quick brown fox benchmark canary 7f3a9b2e"
def test_memorization(model, tokenizer, canary_prefix, full_canary):
"""If model completes the canary, training data leaked into eval."""
inputs = tokenizer(canary_prefix, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50, temperature=0)
generated = tokenizer.decode(output[0], skip_special_tokens=True)
return full_canary in generated
Gotchas and Anti-Patterns
Benchmark Gaming
- •Prompt format sensitivity: A model scoring 70% on MMLU with one prompt can score 60% with another. Always report the exact prompt template.
- •Few-shot vs zero-shot: Not comparable. A 5-shot result is not better than a 0-shot result from a different model. Always compare like-for-like.
- •Subset selection: Reporting results on "selected" benchmarks is cherry-picking. Use standard suites (Open LLM Leaderboard tasks).
Lost-in-the-Middle
For long-context eval, models perform best on information at the beginning and end of the context, worst in the middle. Eval must test retrieval at all positions.
Common Mistakes
- •Comparing models evaluated with different tokenizers (affects token-level metrics)
- •Using
temperature > 0during eval (introduces variance between runs) - •Not normalizing answers before exact match (whitespace, articles, casing)
- •Reporting mean without confidence intervals on small eval sets
- •Evaluating instruction-tuned models with base-model prompts (or vice versa)
- •Running benchmarks on data the model was trained on (contamination)