Data Extraction Skill

Use this guide to collect structured data from dynamic pages with predictable output quality.

Extraction Planning

Define the schema before interacting:

•Define output fields and required keys.
•Identify page regions that contain those fields.
•Capture a fresh snapshot and map refs to schema fields.

text

opendevbrowser_snapshot sessionId="<session-id>" format="actionables"

Table Extraction

For semantic HTML tables:

•Wait for table visibility.
•Snapshot and identify table/container refs.
•Extract targeted table HTML.
•Parse rows/cells in the host script.

text

opendevbrowser_wait sessionId="<session-id>" until="networkidle"
opendevbrowser_dom_get_html sessionId="<session-id>" ref="<table-ref>"

For virtualized or grid UIs, extract per-row/card refs and normalize in post-processing.

List and Card Extraction

For repeated list/card content:

•Snapshot and identify repeating item refs.
•Extract only needed nodes per item (title, price, meta, url).
•Normalize records to a stable schema.

text

opendevbrowser_dom_get_text sessionId="<session-id>" ref="<item-title-ref>"
opendevbrowser_get_attr sessionId="<session-id>" ref="<item-link-ref>" name="href"

Pagination Patterns

Numbered or Next/Previous Pagination

•Extract current page records.
•Click next/page ref.
•Wait for load.
•Re-snapshot and continue until terminal state.

text

opendevbrowser_click sessionId="<session-id>" ref="<next-ref>"
opendevbrowser_wait sessionId="<session-id>" until="networkidle"
opendevbrowser_snapshot sessionId="<session-id>" format="actionables"

Infinite Scroll

•Extract visible records.
•Scroll incrementally.
•Wait for newly loaded items.
•Stop when no new unique records appear.

text

opendevbrowser_scroll sessionId="<session-id>" dy=1000
opendevbrowser_wait sessionId="<session-id>" until="networkidle"

Load More Button

•Extract visible records.
•Click load-more ref.
•Wait and re-snapshot.
•Repeat until button disappears or no new data arrives.

Structured Data Shortcuts

When available, prefer embedded structured data:

•JSON-LD scripts
•Microdata attributes (itemscope, itemprop)

text

opendevbrowser_dom_get_text sessionId="<session-id>" ref="<json-ld-script-ref>"

Parse JSON-LD in the host script and merge with extracted UI records if needed.

Quality Controls

Apply quality checks during extraction:

•Deduplicate by stable key (URL, ID, composite key).
•Track page number and source URL per record.
•Record null/missing fields explicitly.
•Validate record counts per page before continuing.

Use opendevbrowser_network_poll when extraction depends on API completion.

text

opendevbrowser_network_poll sessionId="<session-id>" max=50

data-extraction

Data Extraction Skill

Extraction Planning

Table Extraction

List and Card Extraction

Pagination Patterns

Numbered or Next/Previous Pagination

Infinite Scroll

Load More Button

Structured Data Shortcuts

Quality Controls

Export Pattern

Compliance and Rate Limits