JSONL Data Analyzer

Comprehensive analysis of humanizer benchmark JSONL files for data quality validation.

Quick Start

bash

# Full analysis (auto-detects field, shows all statistics)
python .claude/skills/analyze-jsonl/scripts/analyzer.py data/processed/your_file.jsonl

# Analyze specific field with comparison
python .claude/skills/analyze-jsonl/scripts/analyzer.py file.jsonl --text-field humanized
# → Auto-compares text_ai_humanized vs text_ai_base

# Filter to specific issues only
python .claude/skills/analyze-jsonl/scripts/analyzer.py file.jsonl --filter length_short

Core Features

1. Automatic Field Detection & Comparison

The analyzer automatically:

•Detects which text field to analyze (priority: humanized > ai_base > human)
•
Compares to the previous stage:
- •humanized → compares to ai_base
- •ai_base → compares to human
- •human → no comparison

Example:

bash

# Analyzes humanized field, compares to ai_base
python analyzer.py squad_humanized_stealthgpt.jsonl

2. Comprehensive Statistics (Without Filtering)

When run without --filter, reports ALL entries with breakdowns:

🚨 Error Summary:

•Total errors count and percentage
•Breakdown by type: api_error, length_short, specific exceptions
•Separate counts for each error category

📏 Length Analysis:

•Character/word count differences (mean, median, min, max)
•Change direction: expanded, shortened, unchanged (with percentages)
•
Detailed distribution buckets:
- •Shortened: >50%, 20-50%, 10-20%
- •Changed: <±10%
- •Expanded: 10-20%, 20-50%, 50-100%, >100%
•Top 10 entries with largest changes

✍️ Text Completion:

•Incomplete entries count
•Ends with punctuation vs mid-sentence
•Suspicious endings (comma, colon, semicolon)
•Sample incomplete entries

📈 Status Breakdown:

•Count of each status: ok, length_short, api_error, etc.
•Percentage for each category

🔧 Regeneration Commands:

•
Ready-to-use --regen_ids commands for:
- •All failed IDs
- •Length short IDs
- •Incomplete text IDs
- •API error IDs

3. Pre-Filtering (Focus on Specific Issues)

Use --filter to analyze ONLY entries matching criteria:

bash

# Only analyze length_short entries
python analyzer.py file.jsonl --filter length_short
# → Shows 107 entries, statistics for just these entries

# Multiple filters (OR logic)
python analyzer.py file.jsonl --filter length_short api_error
# → Shows entries that are EITHER length_short OR api_error

Available Filters:

•api_error - Status code ≠ 200, or error field not null, or status is api_error
•length_short - Status is length_short
•incomplete - Text doesn't end with proper punctuation
•truncated - Was truncated flag is True
•ok - Status is ok
•failed - Any non-ok status EXCEPT api_error and length_short (other failures only)

What Changes When Filtering:

•Report shows: "Filtered entries: 107" vs "Total entries: 300"
•All statistics computed ONLY on filtered subset
•Length analysis shows distribution within that subset
•Regeneration commands provided for the filtered entries

Common Usage Patterns

bash

# 1. After humanization - get full overview
python analyzer.py squad_humanized_stealthgpt.jsonl
# → See: 107 length_short, 2 api_error, overall length changes +48.6%

# 2. Investigate length_short issues specifically
python analyzer.py squad_humanized_stealthgpt.jsonl --filter length_short
# → See: Within 107 length_short entries, 72% actually expanded!

# 3. After AI generation - compare to human baseline
python analyzer.py xsum_group_a.jsonl --text-field ai_base
# → Compares text_ai_base vs text_human

# 4. Find only problematic entries
python analyzer.py file.jsonl --filter failed
# → Shows all non-ok entries with regeneration commands

# 5. Analyze human baseline (no comparison)
python analyzer.py xsum_group_h.jsonl --text-field human
# → Shows text quality metrics without comparison

Key Insights from Filtering

Understanding the difference:

bash

# Without filter: See the big picture
python analyzer.py file.jsonl
# Output: "Total errors: 109 (36.3%)"
#         "status:length_short: 107"
#         "status:api_error: 2"
#         "Overall length change: +48.6%"

# With filter: Drill into specific subset
python analyzer.py file.jsonl --filter length_short
# Output: "Filtered entries: 107"
#         "Within these 107 entries:"
#         "  - 72% expanded (got longer!)"
#         "  - 28% shortened"
#         "  - Mean change: +12.8%"

This reveals: Many length_short entries actually expanded, but still marked as short because they didn't grow enough relative to original input!

Typical Workflow

bash

# 1. Get full overview (no filter)
python analyzer.py squad_humanized.jsonl
# See: error counts, overall trends, all statistics

# 2. If issues found, drill into specific category
python analyzer.py squad_humanized.jsonl --filter length_short
# See: detailed stats for just the problematic subset

# 3. Copy regeneration command from output
--regen_ids "squad_570955...,squad_57097697..."

Documentation

📝 GUIDE.md - Report interpretation and workflow

🔧 ADVANCED.md - Filter and comparison features (detailed)

💻 API.md - Python API reference

Examples

See project root:

•examples/skills/data_analyzer_demo.py
•examples/skills/advanced_analyzer_demo.py