Data Check
Inspect and validate a data file before analysis. Report findings with suggested fixes. Do not auto-fix.
When to Use
- •Before any analysis or visualization on a new data file
- •When parsing errors or unexpected values appear
- •When switching to a different dataset mid-session
- •User says "check", "inspect", "what's in this", "profile" a data file
When NOT to Use
- •File has already been checked in this session and nothing changed
- •User just wants to read a specific value (use Read tool directly)
Workflow
Step 1: Raw File Inspection
Before loading into Python, check the raw file:
file <filename> # encoding detection head -c 500 <filename> | cat -v # hidden chars: ^M (CR), BOM, non-UTF8 wc -l <filename> # row count sanity check head -3 <filename> # peek at delimiter and header
Flag these issues:
| Symptom | Meaning |
|---|---|
^M at line ends | Classic Mac \r or Windows \r\n line endings |
\xEF\xBB\xBF at start | UTF-8 BOM marker |
Only 1 line from wc -l | Entire file on one line (wrong line endings) |
| Mixed delimiters | Inconsistent separator characters |
Step 2: Load and Summarize
For CSV/TSV files:
import pandas as pd
df = pd.read_csv("file.csv") # or sep='\t' for TSV
For Excel files:
import pandas as pd
xls = pd.ExcelFile("file.xlsx")
print(f"Sheets: {xls.sheet_names}")
df = pd.read_excel("file.xlsx", sheet_name=0) # or specific sheet name
For JSON files:
df = pd.read_json("file.json")
Report this table:
| Item | Value |
|---|---|
| Dimensions | rows x cols |
| Column names | list all |
| Column types | int, float, object, datetime, bool |
| Total missing | count and % |
Then per column:
Column Type Missing Unique Example Values ───────────────────────────────────────────────────────────── revenue float64 12 (2%) 845 19.99, 49.50, 129.00 category object 0 (0%) 8 Electronics, Clothing, Home order_date datetime 3 (0.5%) 365 2024-01-15, 2024-02-20
Step 3: Excel-Specific Checks
When the file is .xlsx or .xls, also report:
- •Sheet names: List all sheets and row counts per sheet
- •Merged cells: Flag if any exist (they break pandas parsing)
- •Header row position: Check if header is row 0 or if there are title rows above
- •Formula columns: Columns that had formulas (now resolved to values)
- •Named ranges: If any relevant data is in named ranges
Step 4: Data Quality Checks
Scan ALL columns for common issues:
| Check | What to Look For |
|---|---|
| Coded missings | 97/98/99, -1/-9/-99 in numeric columns with lower range |
| Placeholder values | "N/A", "n/a", "#N/A", "NULL", "-", "" in string columns |
| Date parsing | Dates stored as strings, mixed formats (MM/DD vs DD/MM) |
| Currency strings | "$1,234.56" stored as text instead of numeric |
| Duplicate rows | Exact duplicates and near-duplicates (same key columns) |
| ID uniqueness | If a column looks like an ID, check it's actually unique |
Report as:
DATA QUALITY ISSUES:
revenue: 15 values are "$X,XXX.XX" strings. Needs: df['revenue'] = df['revenue'].str.replace('[$,]', '', regex=True).astype(float)
status: contains {'N/A', '-', ''} (n=23). Likely missing values.
order_date: mixed formats detected (2024-01-15 and 01/15/2024). Needs standardization.
Step 5: Light Profiling
After the structural check:
- •Numeric columns: Basic stats (mean, median, min, max, std) + correlation matrix (top 10 strongest pairs)
- •Categorical columns: Value counts (top 10 levels per column, flag if >50 unique levels)
- •Flag: Numeric columns stored as object (likely parsing issue), zero-variance columns, constant columns
# Quick profiling
print(df.describe(include='all'))
print(df.nunique())
# Correlation matrix for numeric columns
numeric_cols = df.select_dtypes(include='number')
if len(numeric_cols.columns) > 1:
corr = numeric_cols.corr()
# Show top correlated pairs
Step 6: Report Summary
Present a concise summary with three sections:
Clean: Columns ready for analysis (no issues)
Needs attention: Columns with data quality issues. Include suggested fix code but do NOT execute.
# SUGGESTED FIX for revenue currency strings:
df['revenue'] = df['revenue'].str.replace('[$,]', '', regex=True).astype(float)
# SUGGESTED FIX for placeholder missings:
df['status'] = df['status'].replace({'N/A': pd.NA, '-': pd.NA, '': pd.NA})
Notes: Observations about the data structure (time series detected, panel structure, likely join keys, etc.)
Quick Reference
| File type | Load with | Extra checks |
|---|---|---|
.csv | pd.read_csv() | encoding, delimiter, line endings |
.tsv | pd.read_csv(sep='\t') | encoding, line endings |
.xlsx / .xls | pd.read_excel() | sheets, merged cells, header row |
.json | pd.read_json() | nested structure, normalize if needed |
.parquet | pd.read_parquet() | just structure check |
Common Mistakes
| Mistake | Fix |
|---|---|
| Loading Excel without checking sheets | Always check pd.ExcelFile(f).sheet_names first |
| Assuming column types are correct | Check with df.dtypes — revenue as object means string |
| Ignoring encoding issues | ALWAYS run file and cat -v before loading |
| Auto-fixing without asking | Report + suggest. Never auto-recode values |
| Not checking for duplicate rows | Always run df.duplicated().sum() |