PDF to LLM Converter
Convert large PDF documents into structured formats optimized for language model processing.
Features
- •Document metadata extraction - Title, entity, date period, type
- •Section detection - Identifies document sections and headers
- •Table extraction - Preserves table structures with headers and rows
- •Key figure extraction - Extracts numerical values with labels and units
- •Multiple output formats - JSON, JSONL, Markdown
Usage
code
/pdf-to-llm <path-or-url> [--format <format>] [--mode <mode>] [--pages <pages>]
Arguments
- •
path-or-url: Local file path or URL to the PDF document - •
--format: Output format -json(default),jsonl, ormarkdown - •
--mode: Processing mode:- •
auto(default): Uses LLM if API key available, otherwise basic - •
llm: Requires ANTHROPIC_API_KEY for intelligent extraction - •
basic: No API key needed, uses pattern-based extraction
- •
- •
--pages: Specific pages to process (e.g., "1,2,10-20,50")
Instructions
When the user invokes this skill:
- •
Run the enhanced processing script:
bashpython scripts/pdf_to_llm.py --file <path> --output output/<name>.json --mode auto
Or for URLs:
bashpython scripts/pdf_to_llm.py --url <url> --output output/<name>.md --format markdown
- •
For LLM mode, ensure ANTHROPIC_API_KEY is set:
bashexport ANTHROPIC_API_KEY=your-key-here python scripts/pdf_to_llm.py --file <path> --mode llm --output result.json
- •
Process specific pages for faster testing:
bashpython scripts/pdf_to_llm.py --file <path> --pages "1,2,10-15" --output sample.json
Output Structure
JSON
json
{
"metadata": {
"title": "LATAM Airlines Group Financial Statements",
"document_type": "financial_report",
"entity": "LATAM Airlines Group S.A.",
"date_period": "December 31, 2025",
"total_pages": 149,
"source_file": "financial-statements.pdf",
"processed_at": "2025-02-06T...",
"executive_summary": "Consolidated financial statements showing...",
"key_sections": ["Balance Sheet", "Income Statement", "Notes"]
},
"pages": [
{
"page_number": 1,
"section": "Balance Sheet",
"summary": "Total assets of $17.6 billion...",
"key_figures": [
{"label": "Total Assets", "value": "17,640,891", "unit": "ThUS$", "context": "As of Dec 31, 2025"}
],
"tables": [
{
"title": "Assets",
"headers": ["Item", "Note", "2025", "2024"],
"rows": [["Cash", "6-7", "2,150,113", "1,957,788"]],
"context": "Current and non-current assets"
}
],
"raw_text": "..."
}
]
}
JSONL
jsonl
{"type": "metadata", "title": "...", "entity": "...", ...}
{"type": "page", "page_number": 1, "section": "...", "summary": "...", ...}
{"type": "page", "page_number": 2, "section": "...", "summary": "...", ...}
Markdown
markdown
--- title: "LATAM Airlines Group Financial Statements" entity: "LATAM Airlines Group S.A." document_type: financial_report period: "December 31, 2025" total_pages: 149 --- # Executive Summary Consolidated financial statements showing... --- # Balance Sheet ## Page 1 Summary of page content... ### Key Figures - **Total Assets**: 17,640,891 ThUS$ (As of Dec 31, 2025) ### Assets | Item | Note | 2025 | 2024 | |------|------|------|------| | Cash | 6-7 | 2,150,113 | 1,957,788 |
Dependencies
- •Python 3.8+
- •
pdftotext(poppler-utils) - •
anthropic(optional, for LLM mode)
Environment Variables
- •
ANTHROPIC_API_KEY: Required for LLM mode, enables intelligent extraction