Table of Contents
- •Purpose
- •Quick Start
- •Core Concepts
- •Instructions
- •Step 1: Choose Your Aggregation Pattern
- •Step 2: Select the Right Destination Engine
- •Step 3: Design Your Aggregation Query
- •Step 4: Implement State/Merge Functions
- •Step 5: Chain Views for Multi-Level Aggregation
- •Step 6: Handle Schema Evolution
- •Step 7: Monitor Performance
- •Step 8: Backfill Historical Data
- •Examples & Patterns
- •References
- •Requirements
ClickHouse Materialized Views Skill
Purpose
Materialized views in ClickHouse are incremental triggers that automatically transform and aggregate incoming data in real-time. Unlike traditional databases, ClickHouse views process only new data as it arrives, enabling efficient streaming analytics, real-time dashboards, and continuous aggregation without scheduled batch jobs.
When to Use This Skill
Use when asked to:
- •Build real-time data aggregation pipelines ("create real-time metrics")
- •Implement streaming analytics or continuous aggregation
- •Create pre-aggregated tables for fast dashboard queries
- •Set up automatic data transformation as events arrive
- •Chain multi-level aggregations (hourly → daily → monthly)
Do NOT use when:
- •Data is batch-processed infrequently (use scheduled jobs instead)
- •Aggregation logic changes often (materialized views are harder to modify)
- •Source data is small and queries are already fast
Quick Start
Create a simple real-time aggregation pipeline:
-- Step 1: Create source table (raw events)
CREATE TABLE events (
user_id UInt32,
event_type String,
timestamp DateTime,
revenue Decimal(10, 2)
) ENGINE = MergeTree()
ORDER BY (user_id, timestamp);
-- Step 2: Create destination table (aggregated)
CREATE TABLE daily_revenue (
date Date,
total_revenue Decimal(18, 2),
event_count UInt64
) ENGINE = SummingMergeTree()
ORDER BY date;
-- Step 3: Create materialized view (automatic aggregation)
CREATE MATERIALIZED VIEW daily_revenue_mv TO daily_revenue AS
SELECT
toDate(timestamp) as date,
SUM(revenue) as total_revenue,
COUNT() as event_count
FROM events
GROUP BY date;
-- Now every INSERT into events automatically updates daily_revenue
INSERT INTO events VALUES (12345, 'purchase', '2024-01-15 10:30:00', 99.99);
SELECT * FROM daily_revenue; -- Result: 2024-01-15 | 99.99 | 1
Instructions
Step 1: Choose Your Aggregation Pattern
Materialized views support three main patterns depending on your use case:
Incremental Views (Most Common)
- •Process only new data as it arrives
- •Best for streaming event data and continuous updates
- •Use with
SummingMergeTreefor automatic deduplication
Refreshable Views
- •Recalculate entire view on a schedule (e.g., hourly)
- •Best for complex queries that need full recalculation
- •Syntax:
CREATE MATERIALIZED VIEW ... REFRESH EVERY 1 HOUR AS ...
Populate-Based Views
- •Backfill existing data when view is created
- •Warning: Can be expensive on large tables
- •Syntax:
CREATE MATERIALIZED VIEW ... POPULATE AS ...
See examples/aggregation-patterns.md for complete pattern examples.
Step 2: Select the Right Destination Engine
The destination table engine determines how your aggregated data behaves:
SummingMergeTree (Simple Aggregates)
- •Use when aggregating with COUNT, SUM, AVG
- •Automatically sums columns during background merges
- •No duplicate rows in final results
- •Example: Hourly sales totals, daily revenue
AggregatingMergeTree (Complex Aggregates)
- •Use with State/Merge functions for uniq, quantile, topK
- •Stores intermediate aggregation states
- •Combine with
uniqState(),quantileState(),topKState()in views - •Query with
uniqMerge(),quantileMerge(),topKMerge()functions - •Example: Unique user counts, percentile calculations
ReplacingMergeTree (Deduplication)
- •Use when you need to replace/update existing rows
- •Maintains version column to determine latest record
- •Example: User profile updates, entity state snapshots
MergeTree (Raw Events)
- •Use when storing raw events without aggregation
- •No automatic deduplication
- •Example: Event audit trail, raw clickstream data
See references/engine-comparison.md for detailed comparison.
Step 3: Design Your Aggregation Query
Follow these principles:
GROUP BY Cardinality
- •Keep cardinality moderate (not millions of unique combinations)
- •Example: Good =
GROUP BY date, product_id| Bad =GROUP BY user_id, timestamp
ORDER BY for Query Patterns
- •Design ORDER BY for your most common query filters
- •If queries filter by date first:
ORDER BY (date, product_id) - •If queries filter by product first:
ORDER BY (product_id, date)
Include Necessary Columns
- •Include columns needed for post-aggregation filtering
- •Store raw values alongside aggregates when needed for flexibility
- •Example: Store
timestampandrevenuealong with hourly aggregates
Use Conditional Aggregation
- •Use
countIf(),sumIf(),avgIf()for event filtering - •Filter within the aggregation rather than in views
- •Example:
countIf(event_type = 'purchase') as purchase_count
See examples/aggregation-patterns.md for SQL patterns.
Step 4: Implement State/Merge Functions for Complex Aggregates
When using AggregatingMergeTree, use special State/Merge function pairs:
In Materialized View (State Functions)
CREATE MATERIALIZED VIEW user_stats_mv TO user_stats AS
SELECT
toDate(timestamp) as date,
uniqState(user_id) as unique_users,
quantileState(0.95)(response_time) as p95_latency
FROM events
GROUP BY date;
In Queries (Merge Functions)
SELECT
date,
uniqMerge(unique_users) as total_unique_users,
quantileMerge(0.95)(p95_latency) as latency_p95
FROM user_stats
GROUP BY date;
Common function pairs:
- •
uniq()↔uniqState()/uniqMerge() - •
sum()↔sumState()/sumMerge() - •
avg()↔avgState()/avgMerge() - •
quantile(0.95)()↔quantileState(0.95)()/quantileMerge(0.95)() - •
topK(10)()↔topKState(10)()/topKMerge(10)()
See references/state-merge-reference.md for all function pairs.
Step 5: Chain Views for Multi-Level Aggregation
Build hierarchical aggregations to reduce redundant computation:
-- Level 1: Raw events (source)
CREATE TABLE events (...) ENGINE = MergeTree() ORDER BY (user_id, timestamp);
-- Level 2: Hourly aggregation
CREATE TABLE hourly_stats (...) ENGINE = SummingMergeTree() ORDER BY (hour, product_id);
CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats AS
SELECT
toStartOfHour(timestamp) as hour,
product_id,
COUNT() as event_count,
SUM(revenue) as total_revenue
FROM events
GROUP BY hour, product_id;
-- Level 3: Daily aggregation (chain from hourly, not raw events)
CREATE TABLE daily_stats (...) ENGINE = SummingMergeTree() ORDER BY (date, product_id);
CREATE MATERIALIZED VIEW daily_stats_mv TO daily_stats AS
SELECT
toDate(hour) as date,
product_id,
SUM(event_count) as event_count,
SUM(total_revenue) as total_revenue
FROM hourly_stats
GROUP BY date, product_id;
Benefits:
- •Reduces redundant computation (don't re-process raw events for daily stats)
- •Each level optimized for different query patterns
- •Easier to debug and modify intermediate aggregations
Step 6: Handle Schema Evolution and Versioning
When modifying aggregation logic:
-- Create new versioned view with updated schema
CREATE MATERIALIZED VIEW hourly_stats_v2_mv TO hourly_stats_v2 AS
SELECT
toStartOfHour(timestamp) as hour,
product_id,
COUNT() as sales_count,
SUM(revenue) as total_revenue,
COUNT(DISTINCT user_id) as unique_customers -- New column
FROM orders
GROUP BY hour, product_id;
-- Migrate queries gradually to v2
-- Then drop old view:
DROP VIEW hourly_stats_v1_mv;
DROP TABLE hourly_stats_v1;
Avoid ALTER TABLE on materialized view destinations when possible (can cause data loss).
Step 7: Monitor and Optimize View Performance
Check View Execution Performance
SELECT
view_name,
elapsed,
read_rows,
memory_usage
FROM system.query_log
WHERE query_kind = 'MaterializedView'
AND elapsed > 1
ORDER BY elapsed DESC;
Common Performance Issues and Solutions
- •High memory usage: Reduce GROUP BY cardinality or add WHERE filters
- •Slow execution: Simplify aggregation logic or chain views instead of complex single view
- •Data lag: Increase
kafka_max_block_sizeif consuming from Kafka
See references/troubleshooting.md for debugging guide.
Step 8: Backfill Historical Data
When adding new views to existing data:
Option 1: Direct Insert (Safest)
-- Backfill historical data directly to destination
INSERT INTO hourly_stats
SELECT
toStartOfHour(timestamp) as hour,
product_id,
COUNT() as sales_count,
SUM(revenue) as total_revenue
FROM orders
WHERE date < '2024-01-01'
GROUP BY hour, product_id;
Option 2: Use POPULATE (for empty tables)
-- Only use if destination table is empty CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats POPULATE AS SELECT ... FROM orders;
⚠️ Avoid POPULATE on large source tables (can be very slow).
Examples & Patterns
The following patterns are common use cases for materialized views:
Real-Time Dashboard Metrics
Pre-aggregate metrics to enable instant dashboard queries:
CREATE MATERIALIZED VIEW dashboard_metrics_mv TO dashboard_metrics AS
SELECT
toStartOfHour(timestamp) as hour,
COUNT() as total_events,
uniq(user_id) as unique_users,
countIf(event_type = 'purchase') as purchases
FROM events
GROUP BY hour;
User Segmentation
Automatically segment users based on behavior:
CREATE MATERIALIZED VIEW user_segments_mv TO user_segments AS
SELECT
user_id,
COUNT() as event_count,
SUM(revenue) as lifetime_value,
CASE
WHEN lifetime_value > 1000 THEN 'vip'
WHEN lifetime_value > 100 THEN 'regular'
ELSE 'casual'
END as segment
FROM events
GROUP BY user_id;
Multi-Destination Routing
Route data to multiple aggregations:
-- High-value orders CREATE MATERIALIZED VIEW high_value_orders_mv TO high_value_orders AS SELECT * FROM orders WHERE total_amount > 1000; -- All orders CREATE MATERIALIZED VIEW all_orders_mv TO all_orders AS SELECT * FROM orders;
Complete Examples
For comprehensive, production-ready examples covering real-world scenarios, see the examples/ directory:
- •Real-time dashboards and metrics aggregation
- •User segmentation and cohort analysis
- •Funnel analysis and conversion tracking
- •Time-series downsampling and rollups
- •Kafka streaming pipelines with end-to-end examples
Requirements
- •ClickHouse 20.6+: Basic materialized views
- •ClickHouse 21.9+: Refreshable materialized views
- •ClickHouse 22.0+: Advanced State/Merge functions
- •Python client (for querying views):
pip install clickhouse-driver - •Kafka/Redpanda: For streaming pipelines (optional)
References
Engine Comparison
references/engine-comparison.md - Detailed comparison of ClickHouse table engines for materialized views:
- •SummingMergeTree vs AggregatingMergeTree: When to use each for aggregation
- •ReplacingMergeTree: For deduplication and entity updates
- •Engine characteristics: Memory usage, merge behavior, query performance
- •Decision matrix: Choosing the right engine for your use case
State/Merge Functions Reference
references/state-merge-reference.md - Complete reference for all State/Merge function pairs:
- •Function pair mapping (State ↔ Merge)
- •Supported aggregation functions
- •Parameter specifications and usage
- •Performance considerations
- •Examples for each function
Kafka Streaming Pipeline
examples/kafka-streaming-pipeline.md - Real-world Kafka integration patterns:
- •Consuming from Kafka with Kafka table engine
- •Chaining materialized views with Kafka sources
- •Error handling and data validation
- •Complete end-to-end pipeline example
- •Monitoring Kafka consumption in ClickHouse
Aggregation Patterns
examples/aggregation-patterns.md - SQL pattern examples and common scenarios:
- •Real-time dashboards and metrics
- •User segmentation and cohort analysis
- •Funnel analysis and conversion tracking
- •Time-series downsampling and rollups
- •Copy-paste ready SQL examples
Complex Aggregates with State/Merge
examples/state-merge-aggregates.md - Advanced aggregation techniques:
- •Using uniq, quantile, and topK in views
- •Multi-stage aggregation with State/Merge
- •Accuracy vs performance tradeoffs
- •Real-world performance tuning examples
- •Working with approximate functions
Troubleshooting Guide
references/troubleshooting.md - Comprehensive debugging guide for common issues:
- •Performance problems and optimization strategies
- •Data correctness and consistency issues
- •Schema evolution and migration challenges
- •Monitoring and health checks
- •Common error messages and solutions