ClickHouse Materialized Views Skill

Purpose

Materialized views in ClickHouse are incremental triggers that automatically transform and aggregate incoming data in real-time. Unlike traditional databases, ClickHouse views process only new data as it arrives, enabling efficient streaming analytics, real-time dashboards, and continuous aggregation without scheduled batch jobs.

When to Use This Skill

Use when asked to:

•Build real-time data aggregation pipelines ("create real-time metrics")
•Implement streaming analytics or continuous aggregation
•Create pre-aggregated tables for fast dashboard queries
•Set up automatic data transformation as events arrive
•Chain multi-level aggregations (hourly → daily → monthly)

Do NOT use when:

•Data is batch-processed infrequently (use scheduled jobs instead)
•Aggregation logic changes often (materialized views are harder to modify)
•Source data is small and queries are already fast

Quick Start

Create a simple real-time aggregation pipeline:

sql

-- Step 1: Create source table (raw events)
CREATE TABLE events (
    user_id UInt32,
    event_type String,
    timestamp DateTime,
    revenue Decimal(10, 2)
) ENGINE = MergeTree()
ORDER BY (user_id, timestamp);

-- Step 2: Create destination table (aggregated)
CREATE TABLE daily_revenue (
    date Date,
    total_revenue Decimal(18, 2),
    event_count UInt64
) ENGINE = SummingMergeTree()
ORDER BY date;

-- Step 3: Create materialized view (automatic aggregation)
CREATE MATERIALIZED VIEW daily_revenue_mv TO daily_revenue AS
SELECT
    toDate(timestamp) as date,
    SUM(revenue) as total_revenue,
    COUNT() as event_count
FROM events
GROUP BY date;

-- Now every INSERT into events automatically updates daily_revenue
INSERT INTO events VALUES (12345, 'purchase', '2024-01-15 10:30:00', 99.99);
SELECT * FROM daily_revenue;  -- Result: 2024-01-15 | 99.99 | 1

Instructions

Step 1: Choose Your Aggregation Pattern

Materialized views support three main patterns depending on your use case:

Incremental Views (Most Common)

•Process only new data as it arrives
•Best for streaming event data and continuous updates
•Use with SummingMergeTree for automatic deduplication

Refreshable Views

•Recalculate entire view on a schedule (e.g., hourly)
•Best for complex queries that need full recalculation
•Syntax: CREATE MATERIALIZED VIEW ... REFRESH EVERY 1 HOUR AS ...

Populate-Based Views

•Backfill existing data when view is created
•Warning: Can be expensive on large tables
•Syntax: CREATE MATERIALIZED VIEW ... POPULATE AS ...

See examples/aggregation-patterns.md for complete pattern examples.

Step 2: Select the Right Destination Engine

The destination table engine determines how your aggregated data behaves:

SummingMergeTree (Simple Aggregates)

•Use when aggregating with COUNT, SUM, AVG
•Automatically sums columns during background merges
•No duplicate rows in final results
•Example: Hourly sales totals, daily revenue

AggregatingMergeTree (Complex Aggregates)

•Use with State/Merge functions for uniq, quantile, topK
•Stores intermediate aggregation states
•Combine with uniqState(), quantileState(), topKState() in views
•Query with uniqMerge(), quantileMerge(), topKMerge() functions
•Example: Unique user counts, percentile calculations

ReplacingMergeTree (Deduplication)

•Use when you need to replace/update existing rows
•Maintains version column to determine latest record
•Example: User profile updates, entity state snapshots

MergeTree (Raw Events)

•Use when storing raw events without aggregation
•No automatic deduplication
•Example: Event audit trail, raw clickstream data

See references/engine-comparison.md for detailed comparison.

Step 3: Design Your Aggregation Query

Follow these principles:

GROUP BY Cardinality

•Keep cardinality moderate (not millions of unique combinations)
•Example: Good = GROUP BY date, product_id | Bad = GROUP BY user_id, timestamp

ORDER BY for Query Patterns

•Design ORDER BY for your most common query filters
•If queries filter by date first: ORDER BY (date, product_id)
•If queries filter by product first: ORDER BY (product_id, date)

Include Necessary Columns

•Include columns needed for post-aggregation filtering
•Store raw values alongside aggregates when needed for flexibility
•Example: Store timestamp and revenue along with hourly aggregates

Use Conditional Aggregation

•Use countIf(), sumIf(), avgIf() for event filtering
•Filter within the aggregation rather than in views
•Example: countIf(event_type = 'purchase') as purchase_count

See examples/aggregation-patterns.md for SQL patterns.

Step 4: Implement State/Merge Functions for Complex Aggregates

When using AggregatingMergeTree, use special State/Merge function pairs:

In Materialized View (State Functions)

sql

CREATE MATERIALIZED VIEW user_stats_mv TO user_stats AS
SELECT
    toDate(timestamp) as date,
    uniqState(user_id) as unique_users,
    quantileState(0.95)(response_time) as p95_latency
FROM events
GROUP BY date;

In Queries (Merge Functions)

sql

SELECT
    date,
    uniqMerge(unique_users) as total_unique_users,
    quantileMerge(0.95)(p95_latency) as latency_p95
FROM user_stats
GROUP BY date;

Common function pairs:

•uniq() ↔ uniqState() / uniqMerge()
•sum() ↔ sumState() / sumMerge()
•avg() ↔ avgState() / avgMerge()
•quantile(0.95)() ↔ quantileState(0.95)() / quantileMerge(0.95)()
•topK(10)() ↔ topKState(10)() / topKMerge(10)()

See references/state-merge-reference.md for all function pairs.

Step 5: Chain Views for Multi-Level Aggregation

Build hierarchical aggregations to reduce redundant computation:

sql

-- Level 1: Raw events (source)
CREATE TABLE events (...) ENGINE = MergeTree() ORDER BY (user_id, timestamp);

-- Level 2: Hourly aggregation
CREATE TABLE hourly_stats (...) ENGINE = SummingMergeTree() ORDER BY (hour, product_id);
CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats AS
SELECT
    toStartOfHour(timestamp) as hour,
    product_id,
    COUNT() as event_count,
    SUM(revenue) as total_revenue
FROM events
GROUP BY hour, product_id;

-- Level 3: Daily aggregation (chain from hourly, not raw events)
CREATE TABLE daily_stats (...) ENGINE = SummingMergeTree() ORDER BY (date, product_id);
CREATE MATERIALIZED VIEW daily_stats_mv TO daily_stats AS
SELECT
    toDate(hour) as date,
    product_id,
    SUM(event_count) as event_count,
    SUM(total_revenue) as total_revenue
FROM hourly_stats
GROUP BY date, product_id;

Benefits:

•Reduces redundant computation (don't re-process raw events for daily stats)
•Each level optimized for different query patterns
•Easier to debug and modify intermediate aggregations

Step 6: Handle Schema Evolution and Versioning

When modifying aggregation logic:

sql

-- Create new versioned view with updated schema
CREATE MATERIALIZED VIEW hourly_stats_v2_mv TO hourly_stats_v2 AS
SELECT
    toStartOfHour(timestamp) as hour,
    product_id,
    COUNT() as sales_count,
    SUM(revenue) as total_revenue,
    COUNT(DISTINCT user_id) as unique_customers  -- New column
FROM orders
GROUP BY hour, product_id;

-- Migrate queries gradually to v2
-- Then drop old view:
DROP VIEW hourly_stats_v1_mv;
DROP TABLE hourly_stats_v1;

Avoid ALTER TABLE on materialized view destinations when possible (can cause data loss).

Step 7: Monitor and Optimize View Performance

Check View Execution Performance

sql

SELECT
    view_name,
    elapsed,
    read_rows,
    memory_usage
FROM system.query_log
WHERE query_kind = 'MaterializedView'
  AND elapsed > 1
ORDER BY elapsed DESC;

Common Performance Issues and Solutions

•High memory usage: Reduce GROUP BY cardinality or add WHERE filters
•Slow execution: Simplify aggregation logic or chain views instead of complex single view
•Data lag: Increase kafka_max_block_size if consuming from Kafka

See references/troubleshooting.md for debugging guide.

Step 8: Backfill Historical Data

When adding new views to existing data:

Option 1: Direct Insert (Safest)

sql

-- Backfill historical data directly to destination
INSERT INTO hourly_stats
SELECT
    toStartOfHour(timestamp) as hour,
    product_id,
    COUNT() as sales_count,
    SUM(revenue) as total_revenue
FROM orders
WHERE date < '2024-01-01'
GROUP BY hour, product_id;

Option 2: Use POPULATE (for empty tables)

sql

-- Only use if destination table is empty
CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats
POPULATE AS
SELECT ... FROM orders;

⚠️ Avoid POPULATE on large source tables (can be very slow).

Examples & Patterns

The following patterns are common use cases for materialized views:

Real-Time Dashboard Metrics

Pre-aggregate metrics to enable instant dashboard queries:

sql

CREATE MATERIALIZED VIEW dashboard_metrics_mv TO dashboard_metrics AS
SELECT
    toStartOfHour(timestamp) as hour,
    COUNT() as total_events,
    uniq(user_id) as unique_users,
    countIf(event_type = 'purchase') as purchases
FROM events
GROUP BY hour;

User Segmentation

Automatically segment users based on behavior:

sql

CREATE MATERIALIZED VIEW user_segments_mv TO user_segments AS
SELECT
    user_id,
    COUNT() as event_count,
    SUM(revenue) as lifetime_value,
    CASE
        WHEN lifetime_value > 1000 THEN 'vip'
        WHEN lifetime_value > 100 THEN 'regular'
        ELSE 'casual'
    END as segment
FROM events
GROUP BY user_id;

Multi-Destination Routing

Route data to multiple aggregations:

sql

-- High-value orders
CREATE MATERIALIZED VIEW high_value_orders_mv TO high_value_orders AS
SELECT * FROM orders WHERE total_amount > 1000;

-- All orders
CREATE MATERIALIZED VIEW all_orders_mv TO all_orders AS
SELECT * FROM orders;

Complete Examples

For comprehensive, production-ready examples covering real-world scenarios, see the examples/ directory:

•Real-time dashboards and metrics aggregation
•User segmentation and cohort analysis
•Funnel analysis and conversion tracking
•Time-series downsampling and rollups
•Kafka streaming pipelines with end-to-end examples

Requirements

•ClickHouse 20.6+: Basic materialized views
•ClickHouse 21.9+: Refreshable materialized views
•ClickHouse 22.0+: Advanced State/Merge functions
•Python client (for querying views): pip install clickhouse-driver
•Kafka/Redpanda: For streaming pipelines (optional)

References

Engine Comparison

references/engine-comparison.md - Detailed comparison of ClickHouse table engines for materialized views:

•SummingMergeTree vs AggregatingMergeTree: When to use each for aggregation
•ReplacingMergeTree: For deduplication and entity updates
•Engine characteristics: Memory usage, merge behavior, query performance
•Decision matrix: Choosing the right engine for your use case

State/Merge Functions Reference

references/state-merge-reference.md - Complete reference for all State/Merge function pairs:

•Function pair mapping (State ↔ Merge)
•Supported aggregation functions
•Parameter specifications and usage
•Performance considerations
•Examples for each function

Kafka Streaming Pipeline

examples/kafka-streaming-pipeline.md - Real-world Kafka integration patterns:

•Consuming from Kafka with Kafka table engine
•Chaining materialized views with Kafka sources
•Error handling and data validation
•Complete end-to-end pipeline example
•Monitoring Kafka consumption in ClickHouse

Aggregation Patterns

examples/aggregation-patterns.md - SQL pattern examples and common scenarios:

•Real-time dashboards and metrics
•User segmentation and cohort analysis
•Funnel analysis and conversion tracking
•Time-series downsampling and rollups
•Copy-paste ready SQL examples

Complex Aggregates with State/Merge

examples/state-merge-aggregates.md - Advanced aggregation techniques:

•Using uniq, quantile, and topK in views
•Multi-stage aggregation with State/Merge
•Accuracy vs performance tradeoffs
•Real-world performance tuning examples
•Working with approximate functions

Troubleshooting Guide

references/troubleshooting.md - Comprehensive debugging guide for common issues:

•Performance problems and optimization strategies
•Data correctness and consistency issues
•Schema evolution and migration challenges
•Monitoring and health checks
•Common error messages and solutions

clickhouse-materialized-views

Table of Contents