Skip to content

feat(clickhouse): support LowCardinality, FixedString, CODEC, and SAMPLE BY#8

Open
lohanidamodar wants to merge 4 commits intomainfrom
feat/clickhouse-schema-extras
Open

feat(clickhouse): support LowCardinality, FixedString, CODEC, and SAMPLE BY#8
lohanidamodar wants to merge 4 commits intomainfrom
feat/clickhouse-schema-extras

Conversation

@lohanidamodar
Copy link
Copy Markdown
Contributor

Summary

Adds support for four ClickHouse schema features that are common in
production OLAP workloads but currently can't be expressed via the
schema builder, forcing users to drop down to raw DDL — exactly what
a typed schema builder is meant to prevent.

Each addition lives in src/Query/Schema/ alongside the existing
ClickHouse modifiers (ttl, engine, orderBy, settings,
skip-index algorithms). Other dialects throw UnsupportedException
at compile time so misuse is caught early.

What's new

LowCardinality(T) column modifier

$schema->table('events')
    ->bigInteger('id')->primary()
    ->string('status')->lowCardinality()
    ->string('country')->lowCardinality()->nullable()
    ->create();
// ... `status` LowCardinality(String), `country` Nullable(LowCardinality(String)) ...

LowCardinality is a standard ClickHouse storage modifier for string
columns with a bounded number of distinct values — status enums, type
discriminators, country/category codes. Dictionary encoding cuts storage
and accelerates reads, and production OLAP schemas without it are an
anti-pattern. Nullable is applied outside LowCardinality to match
ClickHouse's required wrapping order.

FixedString(N) column type

$schema->table('locations')
    ->fixedString('country_code', 2)   // ISO 3166-1 alpha-2
    ->fixedString('currency_code', 3)  // ISO 4217
    ->fixedString('digest', 32)        // raw MD5
    ->create();

Fixed-length strings are strictly more efficient than String when the
byte length is known and constant — ISO codes, hash digests, fixed-width
identifiers. New Table::fixedString($name, $length) plus a matching
forwarder on Column. Length must be at least 1.

Column-level CODEC(...) clauses

$schema->table('metrics')
    ->bigInteger('id')->primary()
    ->datetime('ts', 3)->codec('Delta(4)')->codec('LZ4')   // monotonic timestamps
    ->bigInteger('value')->codec('T64')->codec('LZ4')      // integer column
    ->string('payload')->codec('ZSTD(3)')                  // text column
    ->create();

Multiple codec() calls accumulate and emit
CODEC(c1, c2, ...). Each codec string is emitted verbatim, so
arguments live inline ('Delta(4)', 'ZSTD(3)') and the modifier
stays a thin wrapper around the underlying DDL. Empty strings and
semicolons are rejected at configure time. Column-level codecs are a
core ClickHouse feature for tuning storage size and read throughput;
the schema builder couldn't express them before this PR.

SAMPLE BY table option

$schema->table('events')
    ->bigInteger('id')->primary()
    ->bigInteger('user_id')->unsigned()
    ->sampleBy('user_id')
    ->create();
// ... ENGINE = MergeTree() ORDER BY (`id`) SAMPLE BY user_id

SAMPLE BY enables approximate-query support
(SELECT ... SAMPLE k) and must be declared at table creation time.
Emitted after ORDER BY and before TTL / SETTINGS. Rejected on
engines that don't take an ORDER BY clause (Memory, Log,
TinyLog, StripeLog).

Why these specifically

The schema builder can already model the standard MergeTree shape, but
production ClickHouse schemas almost always reach for one or more of
these modifiers. Without them, users have to fall back to raw DDL,
which defeats the purpose of a typed builder.

The patches follow the same dialect pattern as the existing ttl,
engine, orderBy, settings, and skip-index features added in #6:
state lives on Column / Table, ClickHouse compiles it, and base
Schema / PostgreSQL / SQLite overrides throw
UnsupportedException so misuse on the wrong dialect is caught at
compile time.

Out of scope (planned follow-ups)

  • ClickHouse aggregate selectors (uniqExact, uniq, uniqCombined,
    uniqHLL12) on Builder — would let users express
    ClickHouse-native exact and approximate distinct-count aggregates
    without dropping to raw expressions.
  • Time-bucket helpers (toStartOfHour, toStartOfDay, toStartOfWeek,
    toStartOfMonth, toStartOfMinute) on Builder — for time-series
    rollups in SELECT and GROUP BY.

These are query-builder features rather than schema features, so a
separate PR keeps this one focused.

Tests

  • New ClickHouse schema tests in tests/Query/Schema/ClickHouseTest.php
    asserting exact DDL output for each feature, plus validation-error
    coverage (zero length, empty/semicolon codec, empty/semicolon SAMPLE
    BY, SAMPLE BY on a non-ORDER BY engine).
  • Cross-dialect throw tests in
    tests/Query/Schema/FluentBuilderTest.php covering LowCardinality,
    FixedString, column CODEC, and SAMPLE BY on MySQL / PostgreSQL /
    SQLite.

Test plan

  • composer test (5197 tests pass)
  • composer lint (Pint passes)
  • composer check (PHPStan max passes)

Adds a `FixedString(N)` column type for ClickHouse, used for fixed-length
string columns whose byte length is known and constant — ISO 3166 country
codes, ISO 4217 currency codes, hash digests, and similar values that
benefit from ClickHouse's columnar storage of fixed-width data.

Exposed via `Table::fixedString($name, $length)` and the matching
`Column` forwarder. Length must be at least 1. Other dialects throw
`UnsupportedException` at compile time via the dialect `compileColumnType`
match — the type has no portable mapping.
Adds a `lowCardinality()` modifier on `Column` that wraps the column
type in `LowCardinality(...)` when compiled for ClickHouse. Useful for
string columns with a small number of distinct values — status enums,
type discriminators, country codes, category labels — where dictionary
encoding cuts storage and accelerates reads.

`Nullable` is applied outside `LowCardinality` to match ClickHouse's
required wrapping order: `Nullable(LowCardinality(String))`.

Other dialects throw `UnsupportedException` at compile time via the
base `Schema::compileColumnDefinition` (and the matching guards in
`PostgreSQL` and `SQLite` overrides).
Adds a `codec()` modifier on `Column` that accumulates codec specs and
emits `CODEC(c1, c2, ...)` after the column type when compiled for
ClickHouse. Multiple calls chain — `->codec('Delta(4)')->codec('LZ4')`
emits `CODEC(Delta(4), LZ4)`.

Each codec string is emitted verbatim; arguments are passed inline
(`'Delta(4)'`, `'ZSTD(3)'`) so the modifier stays a thin wrapper around
the underlying DDL. Empty strings and semicolons are rejected at
configure time.

Other dialects throw `UnsupportedException` at compile time via the
base `Schema::compileColumnDefinition` and the matching guards in the
`PostgreSQL` and `SQLite` overrides.
Adds a `Table::sampleBy($expression)` method (with a forwarder on
`Column`) that registers a ClickHouse `SAMPLE BY` clause. Emitted
after `ORDER BY` and before `TTL` / `SETTINGS` at table creation
time, this enables the approximate-query path
(`SELECT ... SAMPLE k`) on MergeTree-family engines.

The expression is emitted verbatim and must not be empty or contain
a semicolon. SAMPLE BY is rejected on engines that don't take an
`ORDER BY` clause (`Memory`, `Log`, `TinyLog`, `StripeLog`) since
sampling has no meaning there.

Other dialects throw `UnsupportedException` at compile time via the
base `Schema::compileCreate`.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

📊 Coverage

Metric PR Baseline Δ
Lines 91.89% (7094/7720) 91.89% +0.00%
Methods 84.78% (1047/1235) 84.70% +0.07%
Classes 62.50% (105/168) 62.50% +0.00%

Full per-file breakdown in the job summary.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR adds four ClickHouse-specific schema features to the fluent builder: LowCardinality(T) column modifier, FixedString(N) column type, column-level CODEC(...) clauses, and a SAMPLE BY table option. The implementation follows the established pattern from PR #6 — state on Column/Table, compiled in ClickHouse, guards in every other dialect.

  • ClickHouse.php correctly compiles all four features, including wrapping order (LowCardinality inside Nullable) and clause ordering (CODEC after DEFAULT, SAMPLE BY after ORDER BY).
  • Schema.php, PostgreSQL.php, and SQLite.php add UnsupportedException guards; MySQL/SQL inherit the base guards automatically.
  • MongoDB.php fully overrides compileCreate()/compileAlter() and never calls compileColumnDefinition(), so the guards for lowCardinality, codec, and sampleBy are bypassed and those modifiers are silently ignored rather than rejected.

Confidence Score: 3/5

The ClickHouse compilation path and all SQL-family dialects are correct, but MongoDB silently accepts and drops the three new ClickHouse-only modifiers instead of rejecting them.

The MongoDB adapter overrides both compileCreate and compileAlter without calling the parent, and calls only compileColumnType per column — skipping every guard added in this PR for lowCardinality, codecs, and sampleBy. Callers using a MongoDB schema with these modifiers get no error and silently incorrect output, which contradicts the documented compile-time rejection guarantee.

src/Query/Schema/MongoDB.php needs explicit guards for isLowCardinality, codecs, and sampleBy; tests/Query/Schema/FluentBuilderTest.php should gain matching MongoDB throw-tests.

Important Files Changed

Filename Overview
src/Query/Schema/MongoDB.php Only adds FixedString rejection in compileColumnType; silently ignores lowCardinality, codec, and sampleBy because compileCreate/compileAlter are fully overridden and never call compileColumnDefinition or the base compileCreate guards
src/Query/Schema/ClickHouse.php Correctly adds FixedString type, LowCardinality wrapping (before Nullable), CODEC clause (after DEFAULT, before TTL), and SAMPLE BY (after ORDER BY, before TTL/SETTINGS); engine guard for SAMPLE BY on non-ORDER BY engines is correct
src/Query/Schema/Column.php Adds isLowCardinality flag, codecs accumulating list, lowCardinality()/codec() methods with proper validation, and fixedString()/sampleBy() forwarders
src/Query/Schema/Table.php Adds sampleBy property, fixedString() column builder (length >= 1 guard), and sampleBy() table option with empty/semicolon validation
src/Query/Schema.php Adds sampleBy and column-level lowCardinality/codecs guards in base compileCreate and compileColumnDefinition; covers MySQL/SQL dialects that don't override these methods
tests/Query/Schema/FluentBuilderTest.php Cross-dialect throw tests cover MySQL, PostgreSQL, and SQLite but not MongoDB; MongoDB gap aligns with the P1 finding

Comments Outside Diff (1)

  1. src/Query/Schema/MongoDB.php, line 48-104 (link)

    P1 MongoDB silently ignores lowCardinality(), codec(), and sampleBy()

    MongoDB overrides compileCreate() and compileAlter() from scratch — neither method calls parent::compileCreate() — and it only calls compileColumnType() per column, never compileColumnDefinition(). The base Schema::compileCreate() sampleBy guard and the Schema::compileColumnDefinition() guards for isLowCardinality and codecs are therefore never reached. All three modifiers are silently accepted and dropped from the output, contrary to the PR's stated contract that "other dialects throw UnsupportedException at compile time."

Reviews (1): Last reviewed commit: "feat(clickhouse): support SAMPLE BY tabl..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant