AgentSkillsCN

rag-knowledge-idexer

rag-knowledge-idexer

SKILL.md

Overview

Automates the extraction, chunking, and metadata tagging of textbook content for the Physical AI & Humanoid Robotics RAG chatbot. It scans Docusaurus .mdx files and generates a structured JSON payload ready for Qdrant vector database ingestion.

Activation Triggers

File Pattern Triggers

  • Saving files in docs/**/*.mdx (if configured for auto-watch)

Keyword Triggers

  • "update rag index"
  • "prepare embeddings"
  • "index chapter"
  • "scan docs for chatbot"
  • "generate vector payload"

Command Triggers

bash
# Explicit activation
index_rag_content --week 3
prepare_embeddings --all

Core Functionality

1. Semantic Chunking

  • Parses Markdown structure to split content by logical sections (Headers).
  • Preserves code blocks within their explanatory context.
  • Cleans MDX-specific syntax (imports, tabs) that adds noise to LLM context.
  • Respects token limits (default ~500 words per chunk).

2. Metadata Extraction

  • Frontmatter Parsing: Extracts title, week, difficulty, and tags.
  • Context Awareness: Appends parent hierarchy (Module -> Chapter -> Section) to every chunk.
  • Hardware Tagging: Identifies if a chunk requires specific hardware (e.g., "Requires: Jetson Orin").

3. Payload Generation

  • Validates data against metadata_schema.json.
  • Generates unique IDs for every text chunk.
  • Outputs a single qdrant_payload.json file used by the backend ingestion script.

Inputs

Required Parameters

None (Defaults to scanning docs/ recursively).

Optional Parameters

python
{
  "target_week": int,       # Only index specific week
  "output_file": str,       # Custom output path
  "force_reindex": bool     # Ignore cache/checksums
}

Outputs

Generated Files

code
rag_data/
└── qdrant_payload.json     # The master dataset for the vector DB

Console Output

  • Statistics on processed files.
  • Number of chunks generated.
  • Warnings for missing metadata or empty sections.

Integration Points

  • Input: Reads from docs/ (generated by docusaurus-chapter-builder).
  • Output: Feeds into the FastAPI/Qdrant backend (Course Requirement #2).