AgentSkillsCN

clickhouse-antipatterns

在Gen200–Gen600屏障框架中,发现ClickHouse SQL的反模式与性能限制。在编写ClickHouse SQL、创建新的屏障/模式生成、修改数组函数、窗口函数、参数扫描、前向数组,或遇到慢查询、内存溢出、NULL值价格、屏障检测错误,或arrayFirstIndex返回0等问题时使用此功能。触发条件——ClickHouse SQL、屏障SQL、数组函数、窗口函数、尾随止损SQL、参数扫描、慢查询、内存溢出、arrayFirstIndex、leadInFrame、groupArray、arrayFold、arrayScan、阈值相对、反模式、性能限制、前向数组、自连接、O(N×M)。

SKILL.md
--- frontmatter
name: clickhouse-antipatterns
description: ClickHouse SQL anti-patterns and performance constraints discovered during Gen200-Gen600 barrier framework. Use when writing ClickHouse SQL, creating new barrier/pattern generations, modifying array functions, window functions, parameter sweeps, forward arrays, or encountering slow queries, OOM, NULL entry prices, wrong barrier detection, or arrayFirstIndex returning 0. TRIGGERS - ClickHouse SQL, barrier SQL, array function, window function, trailing stop SQL, parameter sweep, slow query, OOM, arrayFirstIndex, leadInFrame, groupArray, arrayFold, arrayScan, threshold-relative, anti-pattern, performance constraint, forward arrays, self-join, O(N×M).

ClickHouse Anti-Patterns for Range Bar Pattern SQL

Discovered during Gen200-Gen600 Triple Barrier + Hybrid Feature Sweep framework implementation. Each anti-pattern has been validated through production failures and resolved with tested workarounds.

GitHub Issue: #8 - Anti-Pattern Registry

Quick Lookup

IDAnti-PatternSeveritySection
AP-01groupArray memory explosion (2.36 GB)CRITICALArray Functions
AP-02Lambda closure over outer columns (CH #45028)HIGHArray Functions
AP-03arrayFirstIndex returns 0 for not-foundHIGHArray Functions
AP-04arrayMap + arrayReduce O(n^2) complexityMEDIUMArray Functions
AP-05arrayScan does not exist in ClickHouseLOWArray Functions
AP-06arrayFold returns only final valueLOWArray Functions
AP-07leadInFrame default frame excludes next rowHIGHWindow Functions
AP-08arraySlice before arrayFirstIndexMEDIUMSearch Efficiency
AP-09Absolute % params across thresholdsHIGHParameter Grid
AP-10NEVER expanding window, always rolling 1000CRITICALSignal Detection
AP-11TP/SL from signal close, not entry priceMEDIUMBarrier Alignment
AP-12Same-bar TP+SL ambiguity (SL wins)MEDIUMBarrier Alignment
AP-13Gap-down SL execution priceMEDIUMBarrier Alignment
AP-14Self-join forward arrays O(N×M) bottleneckCRITICALForward Arrays

For detailed descriptions with code examples, see references/anti-patterns.md. For infrastructure-specific issues, see references/infrastructure.md.

Critical Rules (Never Violate)

1. Window-Based Forward Arrays (NOT Self-Join)

sql
-- CORRECT: Pre-compute forward arrays as window functions in base_bars (11s, 1.5 GB)
base_bars AS (
    SELECT *,
        arraySlice(groupArray(high) OVER (
            ORDER BY timestamp_ms ROWS BETWEEN CURRENT ROW AND 101 FOLLOWING
        ), 2, 101) AS fwd_highs,
        arraySlice(groupArray(low) OVER (
            ORDER BY timestamp_ms ROWS BETWEEN CURRENT ROW AND 101 FOLLOWING
        ), 2, 101) AS fwd_lows,
        -- same for fwd_opens, fwd_closes
    FROM range_bars WHERE ...
)
-- Then carry fwd_* arrays through CTEs, no self-join needed

-- WRONG: Self-join for forward arrays (133s, 165 MB but 11x slower)
forward_arrays AS (
    SELECT s.*, groupArray(b.high) AS fwd_highs ...
    FROM signals s INNER JOIN base_bars b ON b.rn BETWEEN s.rn + 1 AND s.rn + 101
    GROUP BY ...
)
-- The range join ON b.rn BETWEEN s.rn + 1 AND s.rn + 101 is O(N×M) — ClickHouse
-- cannot index into a CTE and does a nested loop scan.

Memory tradeoff: Window approach uses ~10x more memory (1.5 GB vs 165 MB) because it computes arrays for ALL bars, not just signals. At 16 parallel queries: 1.5 GB × 16 = 24 GB — safe on 61 GB hosts. For memory-constrained hosts, the self-join is acceptable for sparse patterns (<2% signal coverage).

Gen600 Production Confirmation (2026-02-11): AP-14 window approach confirmed at scale — 284K+ results collected at 3.2 queries/sec (xargs -P16), 3-5s per query regardless of pattern density (sparse 1.2% to dense 49%), zero errors, memory stable at ~24 GB peak (1.5 GB/query × 16 parallel).

Historical note: AP-01 originally recommended self-join over window approach because Gen200 had 1.4M bars × 51 elements = 2.36 GB with the WRONG window frame (ROWS BETWEEN 1 FOLLOWING AND 51 FOLLOWING on ALL bars). The CORRECT window approach uses arraySlice(..., 2, 101) on ROWS BETWEEN CURRENT ROW AND 101 FOLLOWING — slicing off the current row. Gen600 benchmarking proved the window approach is 11x faster for dense patterns (36K+ signals) where the self-join becomes the dominant bottleneck.

2. Pre-Compute Barrier Prices as Columns

sql
-- CORRECT: Pre-compute in separate CTE (avoids CH bug #45028)
param_with_prices AS (
    SELECT *, entry_price * (1.0 + tp_mult * 0.025) AS tp_price FROM param_expanded
),
barrier_scan AS (
    SELECT arrayFirstIndex(x -> x >= tp_price, ...) AS raw_tp_bar FROM param_with_prices
)

-- WRONG: Lambda closure over outer column
SELECT arrayFirstIndex(x -> x >= entry_price * (1.0 + tp_mult * 0.025), fwd_highs)

3. Always Guard arrayFirstIndex with > 0

sql
-- CORRECT: Explicit 0-not-found guards
CASE
    WHEN raw_sl_bar > 0 AND raw_tp_bar > 0 AND raw_sl_bar <= raw_tp_bar THEN 'SL'
    WHEN raw_sl_bar > 0 AND raw_tp_bar > 0 AND raw_tp_bar < raw_sl_bar THEN 'TP'
    WHEN raw_sl_bar > 0 AND raw_tp_bar = 0 THEN 'SL'
    WHEN raw_tp_bar > 0 AND raw_sl_bar = 0 THEN 'TP'
    WHEN window_bars >= max_bars THEN 'TIME'
    ELSE 'INCOMPLETE'
END

-- WRONG: No guard (0 < any positive = always true)
CASE WHEN raw_tp_bar <= raw_sl_bar THEN 'TP' ELSE 'SL' END

4. leadInFrame Requires UNBOUNDED FOLLOWING

sql
-- CORRECT: Explicit frame includes next row
leadInFrame(open, 1) OVER (
    ORDER BY timestamp_ms
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS entry_price

-- WRONG: Default frame excludes next row, returns NULL
leadInFrame(open, 1) OVER (ORDER BY timestamp_ms) AS entry_price

5. Threshold-Relative Parameters

sql
-- CORRECT: Multipliers scale with threshold
entry_price * (1.0 + tp_mult * 0.025) AS tp_price  -- @250dbps
entry_price * (1.0 + tp_mult * 0.05)  AS tp_price  -- @500dbps

-- WRONG: Absolute percentages (don't scale)
entry_price * (1.0 + 0.01) AS tp_price  -- Same 1% regardless of threshold

6. NEVER Expanding Window — Always Rolling 1000-Bar

sql
-- CORRECT: Rolling 1000-bar window
quantileExactExclusive(0.95)(trade_intensity) OVER (
    ORDER BY timestamp_ms
    ROWS BETWEEN 999 PRECEDING AND 1 PRECEDING
) AS ti_p95_rolling

-- WRONG: Expanding window (inflates early-data quality, produces false-positive Kelly)
quantileExactExclusive(0.95)(trade_intensity) OVER (
    ORDER BY timestamp_ms
    ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
) AS ti_p95_expanding

Post-Change Checklist

After modifying ANY Gen200+ SQL file:

  • Forward arrays use window-based approach in base_bars (NOT self-join) — see AP-14 / Critical Rule #1
  • tp_price/sl_price pre-computed as columns (not in lambda)
  • All arrayFirstIndex comparisons have > 0 guards
  • leadInFrame uses ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  • Parameters use threshold-relative multipliers
  • ALL quantiles use rolling 1000-bar window (ROWS BETWEEN 999 PRECEDING AND 1 PRECEDING), NEVER expanding (UNBOUNDED PRECEDING)
  • Warmup guard rn > 1000 present for rolling window stability
  • SL exit price uses least(open, sl_price) for gap-down
  • TP exit price is exactly tp_price (limit fill)
  • Same-bar TP+SL: SL wins (raw_sl_bar <= raw_tp_bar)
  • arraySlice applied before arrayFirstIndex search
  • Query completes < 10s on any pattern density (AP-14 window approach)