Argument Graph Extraction
Given a source (URL or PDF path), extract a structured argument map: a graph of atomic claims and evidence nodes connected by logical edges, grounded in specific text spans.
Inputs
- •
SOURCE: a URL or an absolute path to a PDF file - •
OUTPUT_PATH(optional): where to write the final JSON (default:argument_graph.json)
Step 0 — Ingest the source
Run the ingestion script and pipe its output into source.md:
uv run --with trafilatura --with pymupdf4llm python3 scripts/ingest.py "$SOURCE" > source.md
Progress and errors go to stderr. If the exit code is non-zero or source.md is empty, report the error and stop.
The script wraps each paragraph in <span id="p-N">...</span> so that grounding citations in the extracted graph map directly to stable HTML anchors in the rendered document.
Step 1 — Full extraction (Call 1)
Read source.md in full. Then produce a structured extraction of all claims and edges in a single pass.
Reason through the text section by section in a scratchpad before emitting any JSON. For each section ask: what is the author asserting here, what are they assuming, and what evidence are they offering?
Node types
| type | meaning |
|---|---|
thesis | The central claim the section or text is arguing for |
supporting_claim | A claim that directly supports a thesis or another claim |
empirical_finding | A result from a study, experiment, or dataset cited as evidence |
definition | A concept being formally defined that other claims depend on |
assumption | A premise the argument relies on but does not argue for |
Edge types
| type | meaning |
|---|---|
supports | Source node provides evidence or reasoning for target |
contradicts | Source node is in tension with target |
elaborates | Source node adds detail or nuance to target without adding new support |
is_evidence_for | Source is an empirical finding offered as direct evidence for target |
assumes | Target claim is a prerequisite assumption of source |
follows_from | Source is a logical consequence of target |
Output schema (Call 1)
{
"reasoning": "<scratchpad: walk through the text section by section>",
"nodes": [
{
"id": "n1",
"type": "<node type>",
"statement": "<single assertable sentence>",
"grounding": {
"paragraph_id": "<id from the <span> tag in source.md, e.g. 'p-14'>",
"quote": "<short verbatim or near-verbatim span from that paragraph — mandatory unless the node spans multiple paragraphs>"
}
}
],
"edges": [
{
"id": "e1",
"source": "n1",
"target": "n2",
"type": "<edge type>",
"reasoning": "<one sentence justifying this edge>"
}
]
}
Write this to graph_v1.json.
Step 2 — Critique pass (Call 2)
Read source.md and graph_v1.json.
Act as a critic. Do not rewrite the graph — produce a structured list of proposed changes only. For each issue give a type, a description, and a concrete suggested action.
Issue types to look for
- •redundant: two nodes express essentially the same claim and should be merged
- •too_abstract: a node's statement is not grounded in anything specific in the text; it needs re-grounding or splitting
- •missing_assumption: an edge A→B requires an implicit premise C that is not represented as a node
- •implausible_edge: an edge is not well supported given the text
- •ungrounded_chunk: a section of the text has no node grounded in it (possible extraction gap)
Output schema (Call 2)
{
"reasoning": "<overall assessment of the graph's quality>",
"issues": [
{
"id": "i1",
"type": "<issue type>",
"affects": ["n2", "n5"],
"description": "<what is wrong>",
"suggested_action": "<what to do: merge n2 into n5 / add node for X / remove edge e3 / etc.>"
}
]
}
Write this to critique.json.
Step 3 — Revision (Call 3)
Read graph_v1.json and critique.json.
Apply the suggested changes from the critique. For each issue, either apply it or explicitly skip it with a reason. Produce the final revised graph.
Output schema (Call 3)
{
"reasoning": "<summary of what was changed and what was skipped and why>",
"changes_applied": ["i1", "i3"],
"changes_skipped": [
{ "issue_id": "i2", "reason": "<why this was not applied>" }
],
"nodes": [ ... ],
"edges": [ ... ]
}
Write the final graph to $OUTPUT_PATH (default: argument_graph.json).
Step 4 — Report
Print a brief summary:
- •Number of nodes by type
- •Number of edges by type
- •Any critique issues that were skipped and why
- •Path to the output file
Notes
- •Keep
source.md,graph_v1.json, andcritique.jsonas intermediate artifacts — they are useful for debugging and for re-running individual steps. - •For texts under ~5000 words, all three LLM calls can use the full
source.mdin context without chunking. - •For longer texts, segment
source.mdby heading before Call 1 and run extraction per segment, then merge before Call 2. - •The
reasoningfield in each call is load-bearing — it forces the model to build up understanding before committing to structured output. Do not omit it or move it after the structured fields. - •Grounding design:
paragraph_idis the ground truth for viewer navigation — it maps directly to a<span id="p-N">anchor insource.md. Thequotefield is for human readability and loose correctness checking (if the quote doesn't appear in or near the cited paragraph, the LLM has likely cited the wrong id). A viewer can rendersource.mdas HTML and jump to#p-Nwith a singlescrollIntoViewcall. Thequotefield is mandatory except when a node's grounding genuinely spans multiple paragraphs, in which case cite the id of the most relevant one and note the spread.