Skill: Connect Data

Purpose

Guided wizard to connect a new dataset. Walks the user through selecting a connection type, configuring credentials, validating the connection, profiling the schema, and setting up the knowledge brain.

When to Use

•User says /connect-data or "connect my database" or "add a new dataset"
•First-run welcome suggests connecting data
•After /switch-dataset when the target dataset doesn't exist yet

Invocation

/connect-data — start the connection wizard /connect-data type=postgres — skip type selection

Instructions

Step 1: Choose Connection Type

Present options:

•CSV files — "I have CSV files in a local directory"
•DuckDB — "I have a local DuckDB database file"
•MotherDuck — "I have a MotherDuck cloud database"
•PostgreSQL — "I have a PostgreSQL database"
•BigQuery — "I have a Google BigQuery dataset"
•Snowflake — "I have a Snowflake warehouse"

Step 2: Collect Connection Details

For CSV:

•Ask: "What's the path to your CSV directory? (relative to this repo)"
•Verify the directory exists and contains .csv files
•List found files and ask to confirm

For DuckDB:

•Ask: "Path to your .duckdb file?"
•Verify file exists
•Test connection with SELECT 1

For MotherDuck:

•Ask: "Database name and schema?"
•Note: "MotherDuck connects via MCP. Make sure your token is configured."

For PostgreSQL / BigQuery / Snowflake:

•Copy the appropriate template from connection_templates/
•Ask user to fill in required fields
•IMPORTANT: Never ask for or store passwords directly. Guide the user to use environment variables (e.g., $PG_PASSWORD).

Step 3: Create Dataset Brain

•Generate a dataset_id from the display name (lowercase, hyphens)
•Create .knowledge/datasets/{id}/ directory
•Write manifest.yaml from the connection template + user inputs
•Create empty quirks.md with section headers
•Create empty metrics/index.yaml

Step 4: Test Connection

Use ConnectionManager from helpers/connection_manager.py:

•Instantiate with the new config
•Call test_connection()
•If fails: show error, offer to retry or edit config
•If passes: proceed

Step 5: Profile Schema

•Call list_tables() to enumerate tables
•For each table: get column names and types via get_table_schema()
•Generate schema.md using schema_to_markdown() from helpers/data_helpers.py
•Write to .knowledge/datasets/{id}/schema.md
•Offer to run full data profiling: "Want me to deep-profile this dataset?"

Step 6: Set Active

•Update .knowledge/active.yaml to point to the new dataset
•Confirm: "Connected! {display_name} is now your active dataset."
•Show: table count, estimated row count, date range (if detected)
•Suggest next steps: /explore to browse, /metrics to define metrics, or just ask a question

Rules

•Never store credentials in plain text in manifest files
•Always test the connection before declaring success
•Always generate a schema.md — it's required for analysis
•Create the full .knowledge/datasets/{id}/ tree even if profiling fails
•If the user already has this dataset, ask before overwriting

Edge Cases

•Directory doesn't exist: Offer to create it
•No CSV files found: Check for other formats (.parquet, .json)
•Connection fails repeatedly: Suggest checking credentials, firewall, VPN
•Schema too large (>100 tables): Profile only, skip per-table details
•Dataset name collision: Append a number (e.g., "mydata-2")