Skill: Connect Data
Purpose
Guided wizard to connect a new dataset. Walks the user through selecting a connection type, configuring credentials, validating the connection, profiling the schema, and setting up the knowledge brain.
When to Use
- •User says
/connect-dataor "connect my database" or "add a new dataset" - •First-run welcome suggests connecting data
- •After
/switch-datasetwhen the target dataset doesn't exist yet
Invocation
/connect-data — start the connection wizard
/connect-data type=postgres — skip type selection
Instructions
Step 1: Choose Connection Type
Present options:
- •CSV files — "I have CSV files in a local directory"
- •DuckDB — "I have a local DuckDB database file"
- •MotherDuck — "I have a MotherDuck cloud database"
- •PostgreSQL — "I have a PostgreSQL database"
- •BigQuery — "I have a Google BigQuery dataset"
- •Snowflake — "I have a Snowflake warehouse"
Step 2: Collect Connection Details
For CSV:
- •Ask: "What's the path to your CSV directory? (relative to this repo)"
- •Verify the directory exists and contains .csv files
- •List found files and ask to confirm
For DuckDB:
- •Ask: "Path to your .duckdb file?"
- •Verify file exists
- •Test connection with
SELECT 1
For MotherDuck:
- •Ask: "Database name and schema?"
- •Note: "MotherDuck connects via MCP. Make sure your token is configured."
For PostgreSQL / BigQuery / Snowflake:
- •Copy the appropriate template from
connection_templates/ - •Ask user to fill in required fields
- •IMPORTANT: Never ask for or store passwords directly. Guide the user
to use environment variables (e.g.,
$PG_PASSWORD).
Step 3: Create Dataset Brain
- •Generate a dataset_id from the display name (lowercase, hyphens)
- •Create
.knowledge/datasets/{id}/directory - •Write
manifest.yamlfrom the connection template + user inputs - •Create empty
quirks.mdwith section headers - •Create empty
metrics/index.yaml
Step 4: Test Connection
Use ConnectionManager from helpers/connection_manager.py:
- •Instantiate with the new config
- •Call
test_connection() - •If fails: show error, offer to retry or edit config
- •If passes: proceed
Step 5: Profile Schema
- •Call
list_tables()to enumerate tables - •For each table: get column names and types via
get_table_schema() - •Generate
schema.mdusingschema_to_markdown()fromhelpers/data_helpers.py - •Write to
.knowledge/datasets/{id}/schema.md - •Offer to run full data profiling: "Want me to deep-profile this dataset?"
Step 6: Set Active
- •Update
.knowledge/active.yamlto point to the new dataset - •Confirm: "Connected! {display_name} is now your active dataset."
- •Show: table count, estimated row count, date range (if detected)
- •Suggest next steps:
/exploreto browse,/metricsto define metrics, or just ask a question
Rules
- •Never store credentials in plain text in manifest files
- •Always test the connection before declaring success
- •Always generate a schema.md — it's required for analysis
- •Create the full .knowledge/datasets/{id}/ tree even if profiling fails
- •If the user already has this dataset, ask before overwriting
Edge Cases
- •Directory doesn't exist: Offer to create it
- •No CSV files found: Check for other formats (.parquet, .json)
- •Connection fails repeatedly: Suggest checking credentials, firewall, VPN
- •Schema too large (>100 tables): Profile only, skip per-table details
- •Dataset name collision: Append a number (e.g., "mydata-2")