Simplifying ClickHouse Queries with Smart Schema Design

Disclaimer: This blog post is automatically generated from project documentation and technical proposals using AI assistance. The content represents our development journey and architectural decisions. Code examples are simplified illustrations and may not reflect the exact production implementation.

The Query Complexity Problem
Three-Tier Improvement Strategy
Tier 1: Zero-Risk Quick Wins
Tier 1.5: Match Timing Lookup Table
Tier 2: Shot Sequence Linking
Tier 3: Score Denormalization
Performance Impact
Key Learnings

The Query Complexity Problem

When Caroline and I first built the ClickHouse analytics for Scores, we followed a pure event sourcing approach: store raw events, derive everything in queries. This kept writes simple but made reads complex.

A typical momentum query looked like this:

WITH match_start AS (
  SELECT match_id, min(event_timestamp) as start_time
  FROM shock.shots
  WHERE match_id = {matchId}
  GROUP BY match_id
),
team_shots AS (
  SELECT
    if(player IN (1, 2), 0, 1) as team,
    floor(date_diff('second', ms.start_time, s.event_timestamp) / 60) as time_window,
    shot_outcome
  FROM shock.shots s
  JOIN match_start ms ON s.match_id = ms.match_id
  WHERE s.match_id = {matchId}
)
SELECT team, time_window, count() as shot_count
FROM team_shots
GROUP BY team, time_window;

Every query recalculated:

Match start time from scratch
Team derivation from player numbers
Time windows from timestamps

This was slow, hard to maintain, and repeated across dozens of queries.

Three-Tier Improvement Strategy

Rather than a big-bang rewrite, Caroline suggested we tackle improvements in tiers based on risk and impact:

graph TB
    T1[Tier 1: Quick Wins<br/>Add Derived Columns] --> T2[Tier 2: Shot Linking<br/>Previous/Next References]
    T2 --> T3[Tier 3: Denormalization<br/>Score State in Shots]

    T1 -.->|"30-40% complexity reduction<br/>Zero risk"| B1[Benefits]
    T2 -.->|"Enable forced errors<br/>Rally analysis"| B1
    T3 -.->|"3-5x performance boost<br/>Single-table queries"| B1

    style T1 fill:#c8e6c9
    style T2 fill:#fff9c4
    style T3 fill:#ffcdd2

Tier 1: Zero-Risk Quick Wins

The first tier added columns we could calculate during event projection—no schema changes, no backfilling, no risk.

Team Column

Instead of calculating team from player on every query:

if(player IN (1, 2), 0, 1) as team

We calculate it once during insertion:

const team = player <= 2 ? 0 : 1;
await clickhouse.insert({
  table: 'shots',
  values: [{ ...event, team }],
});

Sequence Number

ClickHouse doesn’t support auto-increment, so we maintain sequence in the projection handler:

class ShotProjection {
  #sequenceCounters = new Map<string, number>();

  async handleShotPlayed(event: ShotPlayedEvent) {
    const currentSeq = this.#sequenceCounters.get(event.matchId) || 0;
    const nextSeq = currentSeq + 1;
    this.#sequenceCounters.set(event.matchId, nextSeq);

    await clickhouse.insert({
      table: 'shots',
      values: [
        {
          ...event,
          sequence_number: nextSeq,
        },
      ],
    });
  }
}

Match Start Time & Time Window

Instead of subquerying for match start:

WITH match_start AS (
  SELECT match_id, min(event_timestamp) as start_time
  FROM shock.shots
  GROUP BY match_id
)

Cache it on first shot and calculate windows:

if (!this.#matchStarts.has(matchId)) {
  this.#matchStarts.set(matchId, event.timestamp);
}

const matchStart = this.#matchStarts.get(matchId)!;
const secondsElapsed = (event.timestamp - matchStart) / 1000;
const timeWindow = Math.floor(secondsElapsed / 60); // Minute bins

await clickhouse.insert({
  values: [
    {
      ...event,
      match_start_time: matchStart,
      time_window: timeWindow,
    },
  ],
});

Query Simplification

Before:

WITH match_start AS (...), team_calc AS (...)
SELECT if(player IN (1, 2), 0, 1) as team,
       floor(date_diff('second', ms.start_time, event_timestamp) / 60) as time_window
FROM shock.shots s
JOIN match_start ms ON s.match_id = ms.match_id

After:

SELECT team, time_window
FROM shock.shots
WHERE match_id = {matchId}

Result: 30-40% complexity reduction with 30 minutes of work.

Tier 1.5: Match Timing Lookup Table

After implementing Tier 1, Caroline noticed we still had a problem: every momentum and pressure query recalculated match start time using MIN(event_timestamp). This was appearing in CTEs across dozens of queries.

The initial solution in Tier 1 cached the first shot timestamp in the projection handler’s memory. But this had issues:

State management complexity: Projection handler needed to maintain a cache
Data duplication: Could store match_start_time in every shot row, but that’s wasteful
Derived data: We were inferring match start from the first shot, not from explicit domain events

The Domain-Driven Solution

Instead of caching or denormalizing, we added explicit timing events to the domain layer:

// New domain events
interface MatchStartedEvent {
  matchId: string;
  occurredAt: string; // ISO8601 timestamp
}

interface MatchFinishedEvent {
  matchId: string;
  occurredAt: string;
}

The WithTimingMixin emits these events at the right moments:

export const WithTiming = <T extends Constructor<Match>>(Base: T) =>
  class extends Base {
    playShot(shot: Shot): void {
      // Emit MatchStarted BEFORE first shot
      if (this.startedAt === null) {
        this.emit('MatchStarted', {
          matchId: this.id,
          occurredAt: shot.playedAt,
        });
      }

      super.playShot(shot); // Call WithShots mixin

      // Emit MatchFinished when complete
      if (this.isFinished && this.finishedAt === null) {
        this.emit('MatchFinished', {
          matchId: this.id,
          occurredAt: shot.playedAt,
        });
      }
    }
  };

ClickHouse Lookup Table

These timing events project to a dedicated lookup table:

CREATE TABLE shock.match_timing (
  match_id String,
  started_at DateTime64(3),
  finished_at Nullable(DateTime64(3))
) ENGINE = ReplacingMergeTree()
ORDER BY match_id;

Why ReplacingMergeTree? This engine automatically merges multiple inserts for the same match_id, keeping the earliest started_at and latest finished_at. Our mappers don’t need to query—they just insert:

// match-started.mts mapper (stateless!)
function mapMatchStartedEvent(event: MatchStartedEvent) {
  return {
    match_id: event.matchId,
    started_at: new Date(event.occurredAt).getTime(),
    finished_at: null,
  };
}

// match-finished.mts mapper (stateless!)
function mapMatchFinishedEvent(event: MatchFinishedEvent) {
  return {
    match_id: event.matchId,
    started_at: 0, // Placeholder, will be merged
    finished_at: new Date(event.occurredAt).getTime(),
  };
}

Materialized View for Duration

We added a materialized view to pre-compute match duration:

CREATE MATERIALIZED VIEW shock.match_duration
ENGINE = ReplacingMergeTree()
ORDER BY match_id
AS SELECT
  match_id,
  started_at,
  finished_at,
  dateDiff('second', started_at, finished_at) as duration_seconds
FROM shock.match_timing
WHERE finished_at IS NOT NULL;

ClickHouse automatically updates this view when match_timing changes—no application code needed.

Query Transformation

Before (with CTE):

WITH match_start AS (
  SELECT match_id, min(event_timestamp) as start_time
  FROM shock.shots
  WHERE match_id = {matchId}
  GROUP BY match_id
)
SELECT
  floor(dateDiff('second', ms.start_time, s.event_timestamp) / 60) as time_window,
  count() as shots
FROM shock.shots s
JOIN match_start ms ON s.match_id = ms.match_id
WHERE s.match_id = {matchId}
GROUP BY time_window;

After (simple JOIN):

SELECT
  floor(dateDiff('second', mt.started_at, s.event_timestamp) / 60) as time_window,
  count() as shots
FROM shock.shots s
JOIN shock.match_timing mt ON s.match_id = mt.match_id
WHERE s.match_id = {matchId}
GROUP BY time_window;

For duration, just query the materialized view:

SELECT duration_seconds
FROM shock.match_duration
WHERE match_id = {matchId};

Benefits

No more MIN() CTEs: Eliminated from all momentum/pressure queries
Event-driven accuracy: Timing comes from explicit domain events, not derived from shots
Stateless projections: Mappers don’t need to maintain caches or query for state
Single row per match: Not duplicated to every shot (efficient storage)
Automatic duration: Materialized view handles calculation
Clean separation: Timing data separate from shot data

This approach demonstrates a key event sourcing principle: when you need derived data, sometimes the answer is to make it explicit in your domain model rather than calculating it in infrastructure.

Tier 2: Shot Sequence Linking

The second tier solved “forced error” calculations. A forced error is when your shot causes the opponent to miss. This requires linking consecutive shots:

sequenceDiagram
    participant S1 as Shot 1
    participant S2 as Shot 2

    Note over S1: Player 1 hits hard shot
    Note over S2: Player 2 makes forced error
    S1->>S2: next_shot_event_id
    S2->>S1: prev_shot_event_id

    Note over S1: S1 gets credit<br/>next_shot_outcome = FORCED_ERROR

We added three columns:

prev_shot_event_id: Reference to previous shot
next_shot_event_id: Reference to next shot
next_shot_outcome: Outcome of next shot (updated asynchronously)

The projection handler maintains shot chains:

class ShotProjection {
  #lastShots = new Map<string, string>(); // matchId -> eventId

  async handleShotPlayed(event: ShotPlayedEvent) {
    const prevEventId = this.#lastShots.get(event.matchId);

    // Insert current shot with prev reference
    await clickhouse.insert({
      values: [
        {
          ...event,
          prev_shot_event_id: prevEventId,
        },
      ],
    });

    // Update previous shot's next references (async)
    if (prevEventId) {
      await clickhouse.update({
        table: 'shots',
        where: { event_id: prevEventId },
        set: {
          next_shot_event_id: event.id,
          next_shot_outcome: event.outcome,
        },
      });
    }

    this.#lastShots.set(event.matchId, event.id);
  }
}

Now forced error queries are simple:

SELECT player, count() as forced_errors_created
FROM shock.shots
WHERE match_id = {matchId}
  AND next_shot_outcome = 'FORCED_ERROR'
GROUP BY player;

Tier 3: Score Denormalization

The biggest performance win came from denormalizing score state into the shots table. Previously, scores lived in a separate score_changes table requiring complex joins:

SELECT s.*, sc.current_game_point_0, sc.current_game_point_1,
       lagInFrame(sc.current_game_point_0, 1) OVER (...) as prev_game_point_0
FROM shock.shots s
LEFT JOIN shock.score_changes sc ON s.match_id = sc.match_id
  AND sc.event_timestamp <= s.event_timestamp

We moved score state directly into shots:

async handleShotPlayed(event: ShotPlayedEvent) {
  const currentScore = await this.getLatestScore(event.matchId);

  await clickhouse.insert({
    values: [{
      ...event,
      game_point_0: currentScore.game0,
      game_point_1: currentScore.game1,
      prev_game_point_0: currentScore.prevGame0,
      prev_game_point_1: currentScore.prevGame1,
      is_game_point: this.isGamePoint(currentScore),
      is_deuce: currentScore.game0 >= 40 && currentScore.game1 >= 40
    }]
  });
}

Query transformation:

Before (with JOIN):

SELECT s.*, sc.game_point_0, sc.game_point_1
FROM shock.shots s
LEFT JOIN shock.score_changes sc ...
WHERE s.match_id = {matchId}

After (single table):

SELECT event_id, game_point_0, game_point_1, is_game_point
FROM shock.shots
WHERE match_id = {matchId}

Performance Impact

The multi-tier approach delivered measurable improvements:

Metric	Before	After	Improvement
Query Complexity	100%	20-40%	60-80% reduction
Avg Query Time	250ms	50-80ms	3-5x faster
Lines of SQL	45-60	10-20	70% less code
MIN() CTEs Eliminated	~30	0	100%
Maintainability	Low	High	Easier debugging

Specific query examples:

Momentum Analysis:

Before: 52 lines SQL, 3 CTEs, 2 JOINs, 180ms
After: 15 lines SQL, 0 CTEs, 0 JOINs, 45ms

Pressure Analysis:

Before: 68 lines SQL, 4 CTEs, 3 JOINs, 280ms
After: 22 lines SQL, 1 CTE, 0 JOINs, 75ms

Key Learnings

Working through query optimization with Caroline taught us several lessons:

Denormalization Has Costs: We’re writing more data (score state in every shot record), but reads dominate our workload. The trade-off makes sense.

Incremental Improvement Works: The three-tier approach let us see results quickly and adjust strategy. A big-bang rewrite would have been riskier.

Projection Complexity Increases: Adding derived columns made the projection handler more complex, but this complexity is centralized and testable.

ClickHouse Loves Denormalization: Column-oriented storage makes wide tables cheap. In PostgreSQL, we’d worry about row bloat. In ClickHouse, it’s optimal.

Event Sourcing Insight: We have all the data we need during event processing. Pre-computing at write time is essentially free since we’re already handling the events.

Measure Everything: We instrumented queries before and after to validate improvements. Without metrics, it’s just guessing.

Domain Events vs Infrastructure: Tier 1.5 taught us that sometimes the best place to solve infrastructure problems is in the domain layer. By making match timing explicit in our domain events, we got cleaner projections and better query performance.

ReplacingMergeTree is Magic: ClickHouse’s ability to automatically merge records based on sort key eliminated the need for stateful projection handlers. This pattern is incredibly powerful for event sourcing.

Materialized Views Save Time: Instead of calculating duration in application code or queries, we let ClickHouse handle it automatically. This is a perfect example of pushing computation to the database when it makes sense.

This optimization demonstrates how understanding your database’s strengths can dramatically simplify application code. ClickHouse excels at wide, denormalized tables with pre-computed values, and specialized engines like ReplacingMergeTree enable stateless event processing patterns that would be complex in traditional databases.

The simplified queries are easier for new developers to understand and modify. When we add new analytics features, we can focus on business logic instead of fighting with CTEs and window functions. Caroline was right: sometimes the best optimization is just storing the data you need instead of calculating it every time—and sometimes it’s making that data explicit in your domain model.

Table of Contents#

The Query Complexity Problem#

Three-Tier Improvement Strategy#

Tier 1: Zero-Risk Quick Wins#

Team Column#

Sequence Number#

Match Start Time & Time Window#

Query Simplification#

Tier 1.5: Match Timing Lookup Table#

The Domain-Driven Solution#

ClickHouse Lookup Table#

Materialized View for Duration#

Query Transformation#

Benefits#

Tier 2: Shot Sequence Linking#

Tier 3: Score Denormalization#

Performance Impact#

Key Learnings#

Table of Contents