Scrape Webpage
Extract content, metadata, and images from a webpage for import/migration.
When to Use This Skill
Use this skill when:
- •Starting a page import and need to extract content from source URL
- •Need webpage analysis with local image downloads
- •Want metadata extraction (Open Graph, JSON-LD, etc.)
Invoked by: page-import skill (Step 1)
Prerequisites
Before using this skill, ensure:
- •✅ Node.js is available
- •✅ npm playwright is installed (
npm install playwright) - •✅ Chromium browser is installed (
npx playwright install chromium) - •✅ Sharp image library is installed (
cd .claude/skills/scrape-webpage/scripts && npm install)
Related Skills
- •page-import - Orchestrator that invokes this skill
- •identify-page-structure - Uses this skill's output (screenshot, HTML, metadata)
- •generate-import-html - Uses image mapping and paths from this skill
Scraping Workflow
Step 1: Run Analysis Script
Command:
bash
node .claude/skills/scrape-webpage/scripts/analyze-webpage.js "https://example.com/page" --output ./import-work
What the script does:
- •Sets up network interception to capture all images
- •Loads page in headless Chromium
- •Scrolls through entire page to trigger lazy-loaded images
- •Downloads all images locally (converts WebP/AVIF/SVG to PNG)
- •Captures full-page screenshot for visual reference
- •Extracts metadata (title, description, Open Graph, JSON-LD, canonical)
- •Fixes images in DOM (background-image→img, picture elements, srcset→src, relative→absolute, inline SVG→img)
- •Extracts cleaned HTML (removes scripts/styles)
- •Replaces image URLs in HTML with local paths (./images/...)
- •Generates document paths (sanitized, lowercase, no .html extension)
- •Saves complete analysis with image mapping to metadata.json
For detailed explanation: See resources/web-page-analysis.md
Step 2: Verify Output
Output files:
- •
./import-work/metadata.json- Complete analysis with paths and image mapping - •
./import-work/screenshot.png- Visual reference for layout comparison - •
./import-work/cleaned.html- Main content HTML with local image paths - •
./import-work/images/- All downloaded images (WebP/AVIF/SVG converted to PNG)
Verify files exist:
bash
ls -lh ./import-work/metadata.json ./import-work/screenshot.png ./import-work/cleaned.html ls -lh ./import-work/images/ | head -5
Step 3: Review Metadata JSON
Output JSON structure:
json
{
"url": "https://example.com/page",
"timestamp": "2025-01-12T10:30:00.000Z",
"paths": {
"documentPath": "/us/en/about",
"htmlFilePath": "us/en/about.plain.html",
"mdFilePath": "us/en/about.md",
"dirPath": "us/en",
"filename": "about"
},
"screenshot": "./import-work/screenshot.png",
"html": {
"filePath": "./import-work/cleaned.html",
"size": 45230
},
"metadata": {
"title": "Page Title",
"description": "Page description",
"og:image": "https://example.com/image.jpg",
"canonical": "https://example.com/page"
},
"images": {
"count": 15,
"mapping": {
"https://example.com/hero.jpg": "./images/a1b2c3d4e5f6.jpg",
"https://example.com/logo.webp": "./images/f6e5d4c3b2a1.png"
},
"stats": {
"total": 15,
"converted": 3,
"skipped": 12,
"failed": 0
}
}
}
Key fields:
- •
paths.documentPath- Used for browser preview URL - •
paths.htmlFilePath- Where to save final HTML file - •
images.mapping- Original URLs → local paths - •
metadata- Extracted page metadata
Output
This skill provides:
- •✅ metadata.json with paths, metadata, image mapping
- •✅ screenshot.png for visual reference
- •✅ cleaned.html with local image references
- •✅ images/ folder with all downloaded images
Next step: Pass these outputs to identify-page-structure skill
Troubleshooting
Browser not installed:
bash
npx playwright install chromium
Sharp not installed:
bash
cd .claude/skills/scrape-webpage/scripts && npm install
Image download failures:
- •Check images.stats.failed count in metadata.json
- •Some images may require authentication or be blocked by CORS
- •Failed images will be noted but won't stop the scraping process
Lazy-loaded images not captured:
- •Script scrolls through page to trigger lazy loading
- •Some advanced lazy-loading may need customization in scripts/analyze-webpage.js