JSONL Data Analyzer
Comprehensive analysis of humanizer benchmark JSONL files for data quality validation.
Quick Start
# Full analysis (auto-detects field, shows all statistics) python .claude/skills/analyze-jsonl/scripts/analyzer.py data/processed/your_file.jsonl # Analyze specific field with comparison python .claude/skills/analyze-jsonl/scripts/analyzer.py file.jsonl --text-field humanized # → Auto-compares text_ai_humanized vs text_ai_base # Filter to specific issues only python .claude/skills/analyze-jsonl/scripts/analyzer.py file.jsonl --filter length_short
Core Features
1. Automatic Field Detection & Comparison
The analyzer automatically:
- •Detects which text field to analyze (priority:
humanized>ai_base>human) - •Compares to the previous stage:
- •
humanized→ compares toai_base - •
ai_base→ compares tohuman - •
human→ no comparison
- •
Example:
# Analyzes humanized field, compares to ai_base python analyzer.py squad_humanized_stealthgpt.jsonl
2. Comprehensive Statistics (Without Filtering)
When run without --filter, reports ALL entries with breakdowns:
🚨 Error Summary:
- •Total errors count and percentage
- •Breakdown by type:
api_error,length_short, specific exceptions - •Separate counts for each error category
📏 Length Analysis:
- •Character/word count differences (mean, median, min, max)
- •Change direction: expanded, shortened, unchanged (with percentages)
- •Detailed distribution buckets:
- •Shortened: >50%, 20-50%, 10-20%
- •Changed: <±10%
- •Expanded: 10-20%, 20-50%, 50-100%, >100%
- •Top 10 entries with largest changes
✍️ Text Completion:
- •Incomplete entries count
- •Ends with punctuation vs mid-sentence
- •Suspicious endings (comma, colon, semicolon)
- •Sample incomplete entries
📈 Status Breakdown:
- •Count of each status:
ok,length_short,api_error, etc. - •Percentage for each category
🔧 Regeneration Commands:
- •Ready-to-use
--regen_idscommands for:- •All failed IDs
- •Length short IDs
- •Incomplete text IDs
- •API error IDs
3. Pre-Filtering (Focus on Specific Issues)
Use --filter to analyze ONLY entries matching criteria:
# Only analyze length_short entries python analyzer.py file.jsonl --filter length_short # → Shows 107 entries, statistics for just these entries # Multiple filters (OR logic) python analyzer.py file.jsonl --filter length_short api_error # → Shows entries that are EITHER length_short OR api_error
Available Filters:
- •
api_error- Status code ≠ 200, or error field not null, or status is api_error - •
length_short- Status is length_short - •
incomplete- Text doesn't end with proper punctuation - •
truncated- Was truncated flag is True - •
ok- Status is ok - •
failed- Any non-ok status EXCEPT api_error and length_short (other failures only)
What Changes When Filtering:
- •Report shows: "Filtered entries: 107" vs "Total entries: 300"
- •All statistics computed ONLY on filtered subset
- •Length analysis shows distribution within that subset
- •Regeneration commands provided for the filtered entries
Common Usage Patterns
# 1. After humanization - get full overview python analyzer.py squad_humanized_stealthgpt.jsonl # → See: 107 length_short, 2 api_error, overall length changes +48.6% # 2. Investigate length_short issues specifically python analyzer.py squad_humanized_stealthgpt.jsonl --filter length_short # → See: Within 107 length_short entries, 72% actually expanded! # 3. After AI generation - compare to human baseline python analyzer.py xsum_group_a.jsonl --text-field ai_base # → Compares text_ai_base vs text_human # 4. Find only problematic entries python analyzer.py file.jsonl --filter failed # → Shows all non-ok entries with regeneration commands # 5. Analyze human baseline (no comparison) python analyzer.py xsum_group_h.jsonl --text-field human # → Shows text quality metrics without comparison
Key Insights from Filtering
Understanding the difference:
# Without filter: See the big picture python analyzer.py file.jsonl # Output: "Total errors: 109 (36.3%)" # "status:length_short: 107" # "status:api_error: 2" # "Overall length change: +48.6%" # With filter: Drill into specific subset python analyzer.py file.jsonl --filter length_short # Output: "Filtered entries: 107" # "Within these 107 entries:" # " - 72% expanded (got longer!)" # " - 28% shortened" # " - Mean change: +12.8%"
This reveals: Many length_short entries actually expanded, but still marked as short because they didn't grow enough relative to original input!
Typical Workflow
# 1. Get full overview (no filter) python analyzer.py squad_humanized.jsonl # See: error counts, overall trends, all statistics # 2. If issues found, drill into specific category python analyzer.py squad_humanized.jsonl --filter length_short # See: detailed stats for just the problematic subset # 3. Copy regeneration command from output --regen_ids "squad_570955...,squad_57097697..."
Documentation
📝 GUIDE.md - Report interpretation and workflow
🔧 ADVANCED.md - Filter and comparison features (detailed)
💻 API.md - Python API reference
Examples
See project root:
- •
examples/skills/data_analyzer_demo.py - •
examples/skills/advanced_analyzer_demo.py