RAG Content Chunker Skill

Description

This agent splits text and Markdown into optimally-sized, token-counted chunks for RAG pipelines. It supports multiple chunking strategies (recursive, Markdown-aware, sentence-based), produces deterministic chunk IDs for incremental vector DB updates, and outputs flat chunk objects ready for embedding. Designed to slot between crawlers (e.g., Website Content Crawler) and vector store integrations (e.g., Pinecone, Qdrant) in a RAG pipeline.

Inputs

•text: String (optional). Plain text or Markdown to chunk. Maximum 500,000 characters.
•dataset_id: String (optional). Apify dataset ID from a previous actor run (e.g., Website Content Crawler). Each item is chunked individually. Takes priority over text.
•dataset_field: String (optional). Field to read from each dataset item. Defaults to "text". Supports dot notation for nested fields (e.g., "metadata.content").
•strategy: String (optional). Chunking strategy: "recursive" (default), "markdown", or "sentence".
•chunk_size: Integer (optional). Target chunk size in tokens (cl100k_base). Range: 64-8192. Default: 512.
•chunk_overlap: Integer (optional). Overlapping tokens between consecutive chunks. Range: 0-2048. Default: 64.

At least one of text or dataset_id must be provided.

Outputs

Each chunk is pushed as a separate dataset item:

•chunk_id: String. Deterministic SHA-256 ID (16 hex chars) for deduplication.
•chunk_index: Integer. Zero-based position within the source document.
•text: String. The chunk content.
•token_count: Integer. Token count using cl100k_base encoding.
•source_url: String. Source URL from crawler metadata (if available).
•page_title: String. Page title from crawler metadata (if available).
•section_heading: String. Markdown section heading (markdown strategy only).

A summary item is also pushed:

•_summary: Boolean. Always true (use to filter out the summary row).
•total_chunks: Integer. Total chunks produced.
•total_tokens: Integer. Total tokens across all chunks.
•strategy: String. Strategy used.
•chunk_size: Integer. Chunk size setting used.
•chunk_overlap: Integer. Overlap setting used.
•processing_time: Float. Wall-clock seconds.
•billing: Object. Contains total_chunks, amount, and rate_per_chunk.

Pricing Model

•Pay-Per-Event: $0.0005 per chunk produced ($0.50 per 1,000 chunks).
•A typical 10-page website produces 50-100 chunks, costing $0.025-$0.05.
•Billing is deterministic and included in the output summary.

Chunking Strategies

recursive (default)

Splits on paragraphs first, then sentences, then words. Falls back to hard token splits as a last resort. Best for general-purpose text and mixed content.

markdown

Splits on Markdown headers (h1-h6), preserving document structure. Sections exceeding chunk_size are sub-split recursively. Each chunk retains its section_heading in metadata. Best for structured documentation and crawled web pages.

sentence

Splits strictly on sentence boundaries. Best for conversational content, Q&A pairs, and prose where sentence integrity matters.

Constraints

•Maximum text input: 500,000 characters.
•Maximum dataset items: 10,000.
•chunk_overlap must be strictly less than chunk_size.
•No external network calls (pure local computation).
•No LLM calls; no API keys required.
•Input is sanitized: null bytes and control characters are stripped.
•Dataset IDs and field names are regex-validated to prevent injection.

Example

Input:

json

{
  "text": "# Introduction\n\nThis is a sample document.\n\n## Details\n\nIt contains multiple sections for demonstration.",
  "strategy": "markdown",
  "chunk_size": 256,
  "chunk_overlap": 32
}

Output (one chunk item):

json

{
  "chunk_id": "a1b2c3d4e5f67890",
  "chunk_index": 0,
  "text": "# Introduction\n\nThis is a sample document.",
  "token_count": 10,
  "source_url": "",
  "page_title": "",
  "section_heading": "Introduction"
}

Pipeline Position

code

Crawl (Website Content Crawler)
  -> Clean (optional)
    -> Chunk (this actor)
      -> Embed (OpenAI, Cohere, etc.)
        -> Store (Pinecone, Qdrant, Weaviate)