AgentSkillsCN

analyze-jsonl

分析来自humanizer基准测试的JSONL数据文件,识别其中的错误、长度问题以及文本质量。在检查数据质量、查找失败条目,或验证流水线输出时使用此功能。支持通过--filter预过滤,以聚焦特定问题(api_error、length_short、incomplete、truncated、ok、failed)。注意:“failed”不包括api_error与length_short。

SKILL.md
--- frontmatter
name: analyze-jsonl
description: Analyze JSONL data files from humanizer benchmark to identify errors, length issues, and text quality. Use when checking data quality, finding failed entries, or validating pipeline outputs. Supports pre-filtering with --filter to focus on specific issues (api_error, length_short, incomplete, truncated, ok, failed). Note: 'failed' excludes api_error and length_short.

JSONL Data Analyzer

Comprehensive analysis of humanizer benchmark JSONL files for data quality validation.

Quick Start

bash
# Full analysis (auto-detects field, shows all statistics)
python .claude/skills/analyze-jsonl/scripts/analyzer.py data/processed/your_file.jsonl

# Analyze specific field with comparison
python .claude/skills/analyze-jsonl/scripts/analyzer.py file.jsonl --text-field humanized
# → Auto-compares text_ai_humanized vs text_ai_base

# Filter to specific issues only
python .claude/skills/analyze-jsonl/scripts/analyzer.py file.jsonl --filter length_short

Core Features

1. Automatic Field Detection & Comparison

The analyzer automatically:

  • Detects which text field to analyze (priority: humanized > ai_base > human)
  • Compares to the previous stage:
    • humanized → compares to ai_base
    • ai_base → compares to human
    • human → no comparison

Example:

bash
# Analyzes humanized field, compares to ai_base
python analyzer.py squad_humanized_stealthgpt.jsonl

2. Comprehensive Statistics (Without Filtering)

When run without --filter, reports ALL entries with breakdowns:

🚨 Error Summary:

  • Total errors count and percentage
  • Breakdown by type: api_error, length_short, specific exceptions
  • Separate counts for each error category

📏 Length Analysis:

  • Character/word count differences (mean, median, min, max)
  • Change direction: expanded, shortened, unchanged (with percentages)
  • Detailed distribution buckets:
    • Shortened: >50%, 20-50%, 10-20%
    • Changed: <±10%
    • Expanded: 10-20%, 20-50%, 50-100%, >100%
  • Top 10 entries with largest changes

✍️ Text Completion:

  • Incomplete entries count
  • Ends with punctuation vs mid-sentence
  • Suspicious endings (comma, colon, semicolon)
  • Sample incomplete entries

📈 Status Breakdown:

  • Count of each status: ok, length_short, api_error, etc.
  • Percentage for each category

🔧 Regeneration Commands:

  • Ready-to-use --regen_ids commands for:
    • All failed IDs
    • Length short IDs
    • Incomplete text IDs
    • API error IDs

3. Pre-Filtering (Focus on Specific Issues)

Use --filter to analyze ONLY entries matching criteria:

bash
# Only analyze length_short entries
python analyzer.py file.jsonl --filter length_short
# → Shows 107 entries, statistics for just these entries

# Multiple filters (OR logic)
python analyzer.py file.jsonl --filter length_short api_error
# → Shows entries that are EITHER length_short OR api_error

Available Filters:

  • api_error - Status code ≠ 200, or error field not null, or status is api_error
  • length_short - Status is length_short
  • incomplete - Text doesn't end with proper punctuation
  • truncated - Was truncated flag is True
  • ok - Status is ok
  • failed - Any non-ok status EXCEPT api_error and length_short (other failures only)

What Changes When Filtering:

  • Report shows: "Filtered entries: 107" vs "Total entries: 300"
  • All statistics computed ONLY on filtered subset
  • Length analysis shows distribution within that subset
  • Regeneration commands provided for the filtered entries

Common Usage Patterns

bash
# 1. After humanization - get full overview
python analyzer.py squad_humanized_stealthgpt.jsonl
# → See: 107 length_short, 2 api_error, overall length changes +48.6%

# 2. Investigate length_short issues specifically
python analyzer.py squad_humanized_stealthgpt.jsonl --filter length_short
# → See: Within 107 length_short entries, 72% actually expanded!

# 3. After AI generation - compare to human baseline
python analyzer.py xsum_group_a.jsonl --text-field ai_base
# → Compares text_ai_base vs text_human

# 4. Find only problematic entries
python analyzer.py file.jsonl --filter failed
# → Shows all non-ok entries with regeneration commands

# 5. Analyze human baseline (no comparison)
python analyzer.py xsum_group_h.jsonl --text-field human
# → Shows text quality metrics without comparison

Key Insights from Filtering

Understanding the difference:

bash
# Without filter: See the big picture
python analyzer.py file.jsonl
# Output: "Total errors: 109 (36.3%)"
#         "status:length_short: 107"
#         "status:api_error: 2"
#         "Overall length change: +48.6%"

# With filter: Drill into specific subset
python analyzer.py file.jsonl --filter length_short
# Output: "Filtered entries: 107"
#         "Within these 107 entries:"
#         "  - 72% expanded (got longer!)"
#         "  - 28% shortened"
#         "  - Mean change: +12.8%"

This reveals: Many length_short entries actually expanded, but still marked as short because they didn't grow enough relative to original input!

Typical Workflow

bash
# 1. Get full overview (no filter)
python analyzer.py squad_humanized.jsonl
# See: error counts, overall trends, all statistics

# 2. If issues found, drill into specific category
python analyzer.py squad_humanized.jsonl --filter length_short
# See: detailed stats for just the problematic subset

# 3. Copy regeneration command from output
--regen_ids "squad_570955...,squad_57097697..."

Documentation

📝 GUIDE.md - Report interpretation and workflow

🔧 ADVANCED.md - Filter and comparison features (detailed)

💻 API.md - Python API reference

Examples

See project root:

  • examples/skills/data_analyzer_demo.py
  • examples/skills/advanced_analyzer_demo.py