ClickHouse to HeliosDB Migration Guide¶
Version: 1.0 Last Updated: January 2026 Compatibility: HeliosDB 7.0+, ClickHouse 22.x/23.x/24.x
Table of Contents¶
- Introduction
- Compatibility Overview
- Pre-Migration Assessment
- Connection String Migration
- Data Type Mapping
- Table Engine Mapping
- Query Syntax Migration
- Materialized View Migration
- Data Migration Process
- Application Connectivity
- Performance Considerations
- Troubleshooting Common Issues
- Post-Migration Validation
- Appendix
1. Introduction¶
Why Migrate from ClickHouse to HeliosDB?¶
HeliosDB provides 92% ClickHouse compatibility while offering significant advantages for modern data platforms:
Key Benefits¶
| Benefit | Description |
|---|---|
| Multi-Protocol Access | Access analytics data via ClickHouse, PostgreSQL, MySQL, MongoDB, Redis, GraphQL, and REST |
| Unified Data Platform | Single system for OLTP, OLAP, and streaming workloads |
| AI/ML Integration | Native NL2SQL, vector search, and ML model inference |
| Simplified Operations | No ZooKeeper/Keeper dependency, automatic rebalancing |
| Enhanced ACID | Full ACID compliance beyond ClickHouse's eventual consistency |
| Time Travel | Point-in-time queries and data recovery |
| Multi-Model Support | Relational, document, graph, time-series, and vector in one database |
Migration Complexity: Low to Moderate¶
ClickHouse migrations are straightforward due to:
- 100% Native Protocol compatibility (TCP port 9000)
- 100% HTTP Protocol compatibility (port 8123)
- Standard ClickHouse drivers work without modification
- Most SQL queries execute identically
- MergeTree family engines fully supported
Scope¶
This guide covers:
- Schema migration for all table types
- Data migration strategies for various dataset sizes
- Application connection string and driver updates
- Table engine mapping to HeliosDB equivalents
- Query syntax differences and translations
- Materialized view migration
- Performance optimization post-migration
Target Audience¶
- Data engineers migrating ClickHouse analytics platforms
- DevOps engineers modernizing data infrastructure
- Architects evaluating unified data platforms
- Application developers updating ClickHouse integrations
2. Compatibility Overview¶
Overall Compatibility: 92%¶
| Category | Coverage | Status |
|---|---|---|
| Native Protocol (TCP) | 100% | Complete |
| HTTP Protocol | 100% | Complete |
| SQL Language | 90% | Complete |
| Table Engines | 85% | Core engines |
| Data Types | 95% | All standard types |
| Aggregation Functions | 95% | Full analytics support |
Protocol Support¶
Native Protocol (TCP Port 9000)¶
| Feature | Status | Notes |
|---|---|---|
| Connection | Supported | Full protocol compatibility |
| Authentication | Supported | Password-based, LDAP |
| Compression | Supported | LZ4, ZSTD, LZ4HC |
| SSL/TLS | Supported | Encrypted connections |
| Query Execution | Supported | Full SQL support |
| Batch Insert | Supported | High throughput |
| Prepared Statements | Supported | Query caching |
| Query Cancellation | Supported | Cancel running queries |
HTTP Protocol (Port 8123)¶
| Feature | Status | Notes |
|---|---|---|
| GET Queries | Supported | Read operations |
| POST Queries | Supported | Write operations |
| Streaming | Supported | Large result sets |
| Compression | Supported | gzip, deflate, br |
| JSON Output | Supported | Multiple formats |
| Progress Tracking | Supported | Query progress |
Driver Compatibility Matrix¶
| Language | Driver | Version | Status |
|---|---|---|---|
| Python | clickhouse-driver | 0.2+ | Fully Compatible |
| Python | clickhouse-connect | 0.6+ | Fully Compatible |
| Go | clickhouse-go | 2.x | Fully Compatible |
| Node.js | @clickhouse/client | 0.2+ | Fully Compatible |
| Java | ClickHouse JDBC | 0.4+ | Fully Compatible |
| Java | ClickHouse Native | 0.4+ | Fully Compatible |
| Rust | clickhouse-rs | 0.12+ | Fully Compatible |
| C# | ClickHouse.Client | 5.x | Fully Compatible |
| PHP | smi2/phpClickHouse | 1.x | Fully Compatible |
clickhouse-client Compatibility¶
# Connect to HeliosDB using standard clickhouse-client
clickhouse-client --host localhost --port 9000
# All standard options work
clickhouse-client \
--host heliosdb.host \
--port 9000 \
--user default \
--password '' \
--database analytics \
--query "SELECT count() FROM events"
3. Pre-Migration Assessment¶
3.1 Pre-Migration Checklist¶
- [ ] Backup ClickHouse data (full backup)
- [ ] Document cluster topology (replicas, shards)
- [ ] Inventory all databases and tables
- [ ] Analyze table engines in use
- [ ] Catalog materialized views
- [ ] Document dictionaries and external tables
- [ ] Estimate total data volume
- [ ] Review ZooKeeper/Keeper dependencies
- [ ] Identify TTL requirements
- [ ] List application connection configurations
- [ ] Plan migration window and rollback strategy
- [ ] Test migration in staging environment
3.2 Database and Table Inventory¶
Export Schema Information¶
# Export all database schemas
clickhouse-client --query "SHOW DATABASES" > databases.txt
# Export table definitions for each database
for db in $(cat databases.txt | grep -v system); do
clickhouse-client --query "SHOW CREATE TABLE $db.*" > schema_${db}.sql 2>/dev/null
done
# Export complete schema
clickhouse-client --query "
SELECT database, name, engine, partition_key, sorting_key,
primary_key, total_rows, total_bytes
FROM system.tables
WHERE database NOT IN ('system', 'INFORMATION_SCHEMA')
" --format TSV > table_inventory.tsv
Analyze Table Engines¶
-- List all table engines in use
SELECT
engine,
count() AS table_count,
sum(total_rows) AS total_rows,
formatReadableSize(sum(total_bytes)) AS total_size
FROM system.tables
WHERE database NOT IN ('system', 'INFORMATION_SCHEMA')
GROUP BY engine
ORDER BY table_count DESC;
-- Detailed table information
SELECT
database,
name,
engine,
partition_key,
sorting_key,
primary_key,
total_rows,
formatReadableSize(total_bytes) as size
FROM system.tables
WHERE database NOT IN ('system', 'INFORMATION_SCHEMA')
ORDER BY total_bytes DESC;
3.3 Feature Usage Inventory¶
Document usage of the following features:
| Feature | Used? | Tables Affected | HeliosDB Support |
|---|---|---|---|
| MergeTree | Full | ||
| ReplacingMergeTree | Full | ||
| SummingMergeTree | Full | ||
| AggregatingMergeTree | Full | ||
| CollapsingMergeTree | Full | ||
| VersionedCollapsingMergeTree | Full | ||
| Distributed Tables | Full | ||
| Materialized Views | Full | ||
| TTL | Full | ||
| Dictionaries | Full | ||
| S3/External Tables | Full | ||
| Kafka Integration | Full | ||
| Replicated* Engines | Different approach |
3.4 Data Volume Estimation¶
-- Estimate total data size
SELECT
sum(total_bytes) AS bytes,
formatReadableSize(sum(total_bytes)) AS readable_size,
sum(total_rows) AS total_rows
FROM system.tables
WHERE database NOT IN ('system', 'INFORMATION_SCHEMA');
-- Per-database breakdown
SELECT
database,
count() AS tables,
formatReadableSize(sum(total_bytes)) AS size,
sum(total_rows) AS rows
FROM system.tables
WHERE database NOT IN ('system', 'INFORMATION_SCHEMA')
GROUP BY database
ORDER BY sum(total_bytes) DESC;
Storage Planning for HeliosDB¶
| ClickHouse Size | Recommended HeliosDB Storage | Notes |
|---|---|---|
| < 100 GB | 1.1x ClickHouse size | Similar compression |
| 100 GB - 1 TB | 1.0x - 1.1x | Comparable efficiency |
| > 1 TB | 0.95x - 1.0x | Intelligent compression |
3.5 Query Pattern Analysis¶
-- Analyze recent query patterns
SELECT
type,
query_kind,
count() AS query_count,
avg(query_duration_ms) AS avg_duration_ms,
sum(read_rows) AS total_rows_read
FROM system.query_log
WHERE event_time > now() - INTERVAL 7 DAY
AND type = 'QueryFinish'
GROUP BY type, query_kind
ORDER BY query_count DESC;
-- Identify heavy queries
SELECT
query,
count() AS executions,
avg(query_duration_ms) AS avg_ms,
formatReadableSize(avg(read_bytes)) AS avg_read
FROM system.query_log
WHERE event_time > now() - INTERVAL 7 DAY
AND type = 'QueryFinish'
GROUP BY query
ORDER BY avg_ms DESC
LIMIT 20;
4. Connection String Migration¶
4.1 Connection String Format¶
ClickHouse and HeliosDB use identical connection formats:
Before (ClickHouse):
After (HeliosDB):
4.2 Native Protocol Connection¶
Basic Connection (Host Change Only)¶
ClickHouse:
from clickhouse_driver import Client
client = Client(
host='clickhouse.host',
port=9000,
user='default',
password='mypassword',
database='analytics'
)
HeliosDB (Only host changes):
from clickhouse_driver import Client
client = Client(
host='heliosdb.host', # Only this changes
port=9000,
user='default',
password='mypassword',
database='analytics'
)
4.3 HTTP Protocol Connection¶
REST API Endpoint¶
ClickHouse:
HeliosDB:
4.4 Connection Parameters Reference¶
| Parameter | Default | Description | HeliosDB Support |
|---|---|---|---|
host |
localhost | Server hostname | Full |
port |
9000 | Native protocol port | Full |
http_port |
8123 | HTTP protocol port | Full |
user |
default | Username | Full |
password |
- | Password | Full |
database |
default | Database name | Full |
compression |
lz4 | Compression type | LZ4, ZSTD |
secure |
false | Use TLS | Full |
verify |
true | Verify TLS certs | Full |
connect_timeout |
10 | Connection timeout (s) | Full |
send_receive_timeout |
300 | Query timeout (s) | Full |
sync_request_timeout |
5 | Sync timeout (s) | Full |
4.5 SSL/TLS Configuration¶
from clickhouse_driver import Client
# Secure connection with TLS
client = Client(
host='heliosdb.host',
port=9440, # Secure native port
user='default',
password='mypassword',
database='analytics',
secure=True,
verify=True,
ca_certs='/path/to/ca.crt'
)
4.6 Connection Pool Configuration¶
from clickhouse_driver import Client
# Production connection pool settings
client = Client(
host='heliosdb.host',
port=9000,
user='default',
password='mypassword',
database='analytics',
compression=True,
settings={
'max_threads': 8,
'max_execution_time': 300,
'max_memory_usage': 10000000000, # 10GB
'connect_timeout': 10,
'send_receive_timeout': 300
}
)
5. Data Type Mapping¶
5.1 Numeric Types¶
| ClickHouse Type | HeliosDB Type | Notes |
|---|---|---|
| UInt8 | UInt8 | Identical |
| UInt16 | UInt16 | Identical |
| UInt32 | UInt32 | Identical |
| UInt64 | UInt64 | Identical |
| UInt128 | UInt128 | Identical |
| UInt256 | UInt256 | Identical |
| Int8 | Int8 | Identical |
| Int16 | Int16 | Identical |
| Int32 | Int32 | Identical |
| Int64 | Int64 | Identical |
| Int128 | Int128 | Identical |
| Int256 | Int256 | Identical |
| Float32 | Float32 | IEEE 754 |
| Float64 | Float64 | IEEE 754 |
| Decimal(P, S) | Decimal(P, S) | Arbitrary precision |
| Decimal32(S) | Decimal32(S) | 9 digit precision |
| Decimal64(S) | Decimal64(S) | 18 digit precision |
| Decimal128(S) | Decimal128(S) | 38 digit precision |
5.2 String Types¶
| ClickHouse Type | HeliosDB Type | Notes |
|---|---|---|
| String | String | Variable length UTF-8 |
| FixedString(N) | FixedString(N) | Fixed byte length |
| UUID | UUID | 128-bit UUID |
| IPv4 | IPv4 | IPv4 address |
| IPv6 | IPv6 | IPv6 address |
| Enum8 | Enum8 | 8-bit enumeration |
| Enum16 | Enum16 | 16-bit enumeration |
5.3 Date and Time Types¶
| ClickHouse Type | HeliosDB Type | Notes |
|---|---|---|
| Date | Date | Days since epoch |
| Date32 | Date32 | Extended range |
| DateTime | DateTime | Second precision |
| DateTime64(N) | DateTime64(N) | Subsecond precision |
| DateTime64(N, tz) | DateTime64(N, tz) | With timezone |
5.4 Composite Types¶
| ClickHouse Type | HeliosDB Type | Notes |
|---|---|---|
| Array(T) | Array(T) | Dynamic arrays |
| Tuple(T1, T2, ...) | Tuple(T1, T2, ...) | Fixed-size tuples |
| Map(K, V) | Map(K, V) | Key-value maps |
| Nested(name Type, ...) | Nested(name Type, ...) | Nested columns |
| Nullable(T) | Nullable(T) | NULL support |
| LowCardinality(T) | LowCardinality(T) | Dictionary encoding |
5.5 Special Types¶
| ClickHouse Type | HeliosDB Type | Notes |
|---|---|---|
| Bool | Bool | Boolean |
| JSON | JSON | JSON objects |
| Object('json') | Object('json') | JSON objects |
| Point | Point | Geospatial point |
| Ring | Ring | Geospatial ring |
| Polygon | Polygon | Geospatial polygon |
| MultiPolygon | MultiPolygon | Geospatial multi-polygon |
5.6 Aggregate Function Types¶
| ClickHouse Type | HeliosDB Type | Notes |
|---|---|---|
| AggregateFunction(name, T) | AggregateFunction(name, T) | Aggregate state |
| SimpleAggregateFunction(name, T) | SimpleAggregateFunction(name, T) | Simple aggregate |
5.7 Type Conversion Examples¶
-- All type casts work identically
-- Numeric conversions
SELECT toUInt32(123.45); -- 123
SELECT toFloat64('123.456'); -- 123.456
SELECT toDecimal64(123.456789, 4); -- 123.4568
-- String conversions
SELECT toString(12345); -- '12345'
SELECT toFixedString('hello', 10); -- 'hello\0\0\0\0\0'
-- Date/Time conversions
SELECT toDate('2025-01-15'); -- 2025-01-15
SELECT toDateTime('2025-01-15 10:30:00'); -- 2025-01-15 10:30:00
SELECT toDateTime64('2025-01-15 10:30:00.123', 3);
-- Array conversions
SELECT toTypeName([1, 2, 3]); -- Array(UInt8)
SELECT array(1, 2, 3); -- [1, 2, 3]
-- Nullable handling
SELECT toNullable(123);
SELECT assumeNotNull(nullable_column);
6. Table Engine Mapping¶
6.1 MergeTree Family¶
MergeTree¶
-- ClickHouse
CREATE TABLE events (
timestamp DateTime,
event_type String,
user_id UInt32,
value Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id)
SETTINGS index_granularity = 8192;
-- HeliosDB (identical syntax)
CREATE TABLE events (
timestamp DateTime,
event_type String,
user_id UInt32,
value Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id)
SETTINGS index_granularity = 8192;
ReplacingMergeTree¶
-- For upsert/deduplication scenarios
CREATE TABLE user_states (
user_id UInt32,
version UInt64,
name String,
status String,
updated_at DateTime
) ENGINE = ReplacingMergeTree(version)
PARTITION BY toYYYYMM(updated_at)
ORDER BY user_id;
-- Query with deduplication
SELECT * FROM user_states FINAL WHERE user_id = 12345;
SummingMergeTree¶
-- For pre-aggregated metrics
CREATE TABLE metrics_daily (
date Date,
metric_name String,
total_value Float64,
event_count UInt64
) ENGINE = SummingMergeTree((total_value, event_count))
PARTITION BY toYYYYMM(date)
ORDER BY (date, metric_name);
-- Rows with same ORDER BY key are summed
INSERT INTO metrics_daily VALUES ('2025-01-15', 'clicks', 100, 10);
INSERT INTO metrics_daily VALUES ('2025-01-15', 'clicks', 50, 5);
-- After merge: ('2025-01-15', 'clicks', 150, 15)
AggregatingMergeTree¶
-- For complex pre-aggregations
CREATE TABLE events_hourly (
hour DateTime,
event_type String,
count_state AggregateFunction(count, UInt64),
sum_value_state AggregateFunction(sum, Float64),
uniq_users_state AggregateFunction(uniq, UInt32)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, event_type);
-- Insert with aggregate functions
INSERT INTO events_hourly
SELECT
toStartOfHour(timestamp) AS hour,
event_type,
countState() AS count_state,
sumState(value) AS sum_value_state,
uniqState(user_id) AS uniq_users_state
FROM events
GROUP BY hour, event_type;
-- Query with merge
SELECT
hour,
event_type,
countMerge(count_state) AS total_count,
sumMerge(sum_value_state) AS total_value,
uniqMerge(uniq_users_state) AS unique_users
FROM events_hourly
GROUP BY hour, event_type;
CollapsingMergeTree¶
-- For state change tracking
CREATE TABLE user_sessions (
user_id UInt32,
session_start DateTime,
session_duration UInt32,
page_views UInt32,
sign Int8 -- 1 for insert, -1 for delete
) ENGINE = CollapsingMergeTree(sign)
PARTITION BY toYYYYMM(session_start)
ORDER BY (user_id, session_start);
-- Insert initial state
INSERT INTO user_sessions VALUES (1, '2025-01-15 10:00:00', 300, 5, 1);
-- Update by inserting old row with -1 and new row with 1
INSERT INTO user_sessions VALUES
(1, '2025-01-15 10:00:00', 300, 5, -1), -- Cancel old
(1, '2025-01-15 10:00:00', 600, 10, 1); -- Insert new
-- Query with collapsing
SELECT
user_id,
session_start,
sum(session_duration * sign) AS duration,
sum(page_views * sign) AS pages
FROM user_sessions
GROUP BY user_id, session_start
HAVING sum(sign) > 0;
VersionedCollapsingMergeTree¶
-- For ordered state changes with versions
CREATE TABLE user_states_versioned (
user_id UInt32,
status String,
version UInt64,
sign Int8
) ENGINE = VersionedCollapsingMergeTree(sign, version)
ORDER BY user_id;
6.2 Integration Engines¶
Distributed Tables¶
-- ClickHouse distributed table
CREATE TABLE events_distributed AS events
ENGINE = Distributed(cluster_name, database, events, rand());
-- HeliosDB (same syntax, uses internal distribution)
CREATE TABLE events_distributed AS events
ENGINE = Distributed(default, analytics, events, rand());
Kafka Integration¶
-- Kafka source table
CREATE TABLE events_kafka (
timestamp DateTime,
event_type String,
user_id UInt32,
value Float64
) ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'events',
kafka_group_name = 'heliosdb_consumer',
kafka_format = 'JSONEachRow';
-- Materialized view to persist data
CREATE MATERIALIZED VIEW events_mv TO events AS
SELECT * FROM events_kafka;
S3 Tables¶
-- External S3 table
CREATE TABLE s3_data (
timestamp DateTime,
event_type String,
value Float64
) ENGINE = S3(
'https://bucket.s3.amazonaws.com/data/*.parquet',
'AWS_ACCESS_KEY',
'AWS_SECRET_KEY',
'Parquet'
);
-- Query external data
SELECT event_type, sum(value)
FROM s3_data
GROUP BY event_type;
6.3 Special Engines¶
Memory Engine¶
-- In-memory table (same syntax)
CREATE TABLE temp_data (
id UInt32,
value String
) ENGINE = Memory;
Log Engines¶
-- Simple log table
CREATE TABLE log_data (
timestamp DateTime,
message String
) ENGINE = Log;
-- Tiny log for small tables
CREATE TABLE config_data (
key String,
value String
) ENGINE = TinyLog;
Dictionary Engine¶
-- Create dictionary
CREATE DICTIONARY geo_dict (
country_code String,
country_name String,
population UInt64
)
PRIMARY KEY country_code
SOURCE(CLICKHOUSE(
HOST 'localhost'
PORT 9000
USER 'default'
TABLE 'countries'
DB 'reference'
))
LIFETIME(MIN 3600 MAX 7200)
LAYOUT(FLAT());
-- Use in queries
SELECT dictGet('geo_dict', 'country_name', country_code) AS country
FROM events;
6.4 Engine Migration Matrix¶
| ClickHouse Engine | HeliosDB Equivalent | Migration Complexity |
|---|---|---|
| MergeTree | MergeTree | None (identical) |
| ReplacingMergeTree | ReplacingMergeTree | None (identical) |
| SummingMergeTree | SummingMergeTree | None (identical) |
| AggregatingMergeTree | AggregatingMergeTree | None (identical) |
| CollapsingMergeTree | CollapsingMergeTree | None (identical) |
| VersionedCollapsingMergeTree | VersionedCollapsingMergeTree | None (identical) |
| Distributed | Distributed | Cluster config changes |
| ReplicatedMergeTree | MergeTree + HeliosDB replication | Schema change |
| Kafka | Kafka | Connection config only |
| S3 | S3 | Credential config only |
| Memory | Memory | None (identical) |
| Log/TinyLog | Log/TinyLog | None (identical) |
| Dictionary | Dictionary | None (identical) |
6.5 Replicated Engine Migration¶
ClickHouse Replicated* engines use ZooKeeper/Keeper. HeliosDB uses its own replication:
ClickHouse:
CREATE TABLE events_replicated (
timestamp DateTime,
event_type String,
user_id UInt32,
value Float64
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id);
HeliosDB:
-- Use standard MergeTree; replication is automatic
CREATE TABLE events (
timestamp DateTime,
event_type String,
user_id UInt32,
value Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, user_id);
-- Configure replication in HeliosDB settings
-- No ZooKeeper/Keeper required
7. Query Syntax Migration¶
7.1 Identical Syntax (No Changes Required)¶
Most ClickHouse queries work identically in HeliosDB:
-- SELECT with aggregations
SELECT
event_type,
count() AS events,
count(DISTINCT user_id) AS unique_users,
sum(value) AS total_value,
avg(value) AS avg_value,
min(value) AS min_value,
max(value) AS max_value
FROM events
WHERE timestamp >= '2025-01-01'
GROUP BY event_type
ORDER BY events DESC
LIMIT 100;
-- PREWHERE for optimization
SELECT *
FROM events
PREWHERE timestamp >= '2025-01-01'
WHERE event_type = 'purchase';
-- SAMPLE for approximate queries
SELECT event_type, count() * 10 AS estimated_count
FROM events SAMPLE 0.1
GROUP BY event_type;
-- FINAL for deduplication
SELECT * FROM user_states FINAL
WHERE user_id = 12345;
-- WITH clauses (CTEs)
WITH daily_totals AS (
SELECT
toDate(timestamp) AS date,
sum(value) AS total
FROM events
GROUP BY date
)
SELECT
date,
total,
total - lagInFrame(total) OVER (ORDER BY date) AS change
FROM daily_totals;
7.2 Window Functions¶
-- All window functions supported
SELECT
timestamp,
user_id,
value,
row_number() OVER (PARTITION BY user_id ORDER BY timestamp) AS row_num,
rank() OVER (PARTITION BY user_id ORDER BY value DESC) AS value_rank,
dense_rank() OVER (PARTITION BY user_id ORDER BY value DESC) AS dense_value_rank,
sum(value) OVER (PARTITION BY user_id ORDER BY timestamp) AS running_total,
avg(value) OVER (
PARTITION BY user_id
ORDER BY timestamp
ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
) AS moving_avg,
lead(value, 1) OVER (PARTITION BY user_id ORDER BY timestamp) AS next_value,
lag(value, 1) OVER (PARTITION BY user_id ORDER BY timestamp) AS prev_value,
first_value(value) OVER (PARTITION BY user_id ORDER BY timestamp) AS first_val,
last_value(value) OVER (
PARTITION BY user_id
ORDER BY timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_val
FROM events
ORDER BY user_id, timestamp;
7.3 JOIN Operations¶
-- All JOIN types supported
-- INNER JOIN
SELECT e.*, u.name
FROM events e
INNER JOIN users u ON e.user_id = u.user_id;
-- LEFT JOIN
SELECT e.*, u.name
FROM events e
LEFT JOIN users u ON e.user_id = u.user_id;
-- RIGHT JOIN
SELECT e.*, u.name
FROM events e
RIGHT JOIN users u ON e.user_id = u.user_id;
-- FULL OUTER JOIN
SELECT e.*, u.name
FROM events e
FULL JOIN users u ON e.user_id = u.user_id;
-- CROSS JOIN
SELECT *
FROM events e
CROSS JOIN config c;
-- ASOF JOIN (time-series join)
SELECT
e.timestamp,
e.user_id,
e.value,
p.price
FROM events e
ASOF LEFT JOIN prices p
ON e.product_id = p.product_id
AND e.timestamp >= p.effective_date;
-- GLOBAL JOIN for distributed tables
SELECT *
FROM events_distributed e
GLOBAL JOIN users_distributed u ON e.user_id = u.user_id;
7.4 Aggregation Functions¶
-- Standard aggregations
SELECT
count(),
countDistinct(user_id),
sum(value),
avg(value),
min(value),
max(value),
any(event_type),
anyHeavy(event_type),
anyLast(event_type)
FROM events;
-- Statistical functions
SELECT
stddevPop(value) AS stddev,
stddevSamp(value) AS stddev_sample,
varPop(value) AS variance,
varSamp(value) AS variance_sample,
covarPop(value, other_value) AS covariance,
corr(value, other_value) AS correlation
FROM events;
-- Quantile functions
SELECT
quantile(0.5)(value) AS median,
quantile(0.95)(value) AS p95,
quantile(0.99)(value) AS p99,
quantileExact(0.5)(value) AS exact_median,
quantiles(0.25, 0.5, 0.75)(value) AS quartiles,
quantileTiming(0.95)(response_ms) AS timing_p95
FROM events;
-- Unique count functions
SELECT
uniq(user_id) AS approx_unique,
uniqExact(user_id) AS exact_unique,
uniqCombined(user_id) AS combined_unique,
uniqHLL12(user_id) AS hll_unique
FROM events;
-- TopK functions
SELECT
topK(10)(event_type) AS top_events,
topKWeighted(10)(event_type, value) AS weighted_top
FROM events;
-- Conditional aggregations
SELECT
countIf(event_type = 'purchase') AS purchases,
sumIf(value, event_type = 'purchase') AS purchase_value,
avgIf(value, event_type = 'view') AS avg_view_value
FROM events;
7.5 Array Functions¶
-- Array operations
SELECT
[1, 2, 3] AS arr,
array(1, 2, 3) AS arr2,
arrayJoin([1, 2, 3]) AS expanded,
arrayMap(x -> x * 2, [1, 2, 3]) AS doubled,
arrayFilter(x -> x > 1, [1, 2, 3]) AS filtered,
arrayReduce('sum', [1, 2, 3]) AS reduced,
arraySort([3, 1, 2]) AS sorted,
arrayReverse([1, 2, 3]) AS reversed,
arraySlice([1, 2, 3, 4, 5], 2, 3) AS sliced,
arrayConcat([1, 2], [3, 4]) AS concatenated,
has([1, 2, 3], 2) AS contains,
indexOf([1, 2, 3], 2) AS position,
length([1, 2, 3]) AS len,
arrayUniq([1, 1, 2, 2, 3]) AS unique_count;
-- Array aggregations
SELECT
groupArray(user_id) AS user_ids,
groupArrayDistinct(user_id) AS unique_user_ids,
groupUniqArray(user_id) AS uniq_array
FROM events
WHERE event_type = 'purchase';
7.6 String Functions¶
-- String operations
SELECT
concat('Hello', ' ', 'World') AS concatenated,
substring('ClickHouse', 1, 5) AS sub,
length('Hello') AS len,
lower('HELLO') AS lowered,
upper('hello') AS uppered,
trim(' hello ') AS trimmed,
ltrim(' hello') AS ltrimmed,
rtrim('hello ') AS rtrimmed,
splitByChar(',', 'a,b,c') AS split,
splitByString('::', 'a::b::c') AS split_str,
replaceAll('hello world', 'world', 'universe') AS replaced,
reverse('hello') AS reversed,
position('hello', 'l') AS pos,
match('hello', 'e.+o') AS regex_match,
extract('hello123world', '\\d+') AS extracted;
-- Format functions
SELECT
format('{} {}', 'Hello', 'World') AS formatted,
formatReadableSize(1024 * 1024 * 1024) AS readable_size,
formatReadableQuantity(1000000) AS readable_qty;
7.7 Date/Time Functions¶
-- Date/Time operations
SELECT
now() AS current_time,
today() AS current_date,
yesterday() AS prev_date,
toDate('2025-01-15') AS date,
toDateTime('2025-01-15 10:30:00') AS datetime,
toDateTime64('2025-01-15 10:30:00.123456', 6) AS datetime64,
toYear(now()) AS year,
toMonth(now()) AS month,
toDayOfMonth(now()) AS day,
toHour(now()) AS hour,
toMinute(now()) AS minute,
toSecond(now()) AS second,
toDayOfWeek(now()) AS day_of_week,
toDayOfYear(now()) AS day_of_year,
toStartOfDay(now()) AS start_of_day,
toStartOfHour(now()) AS start_of_hour,
toStartOfMinute(now()) AS start_of_minute,
toStartOfMonth(now()) AS start_of_month,
toStartOfQuarter(now()) AS start_of_quarter,
toStartOfYear(now()) AS start_of_year,
toStartOfWeek(now()) AS start_of_week,
date_add(day, 7, today()) AS week_later,
date_sub(month, 1, today()) AS month_ago,
dateDiff('day', '2025-01-01', '2025-01-15') AS days_diff;
7.8 Conditional Functions¶
-- Conditional operations
SELECT
if(value > 100, 'high', 'low') AS category,
multiIf(
value > 1000, 'very high',
value > 100, 'high',
value > 10, 'medium',
'low'
) AS multi_category,
CASE
WHEN value > 1000 THEN 'very high'
WHEN value > 100 THEN 'high'
WHEN value > 10 THEN 'medium'
ELSE 'low'
END AS case_category,
coalesce(nullable_value, 0) AS with_default,
ifNull(nullable_value, 0) AS if_null,
nullIf(value, 0) AS null_if_zero
FROM events;
8. Materialized View Migration¶
8.1 Standard Materialized Views¶
Materialized views work identically in HeliosDB:
-- Create target table
CREATE TABLE events_daily (
date Date,
event_type String,
total_events UInt64,
total_value Float64,
unique_users UInt64
) ENGINE = SummingMergeTree((total_events, total_value, unique_users))
PARTITION BY toYYYYMM(date)
ORDER BY (date, event_type);
-- Create materialized view
CREATE MATERIALIZED VIEW events_daily_mv TO events_daily AS
SELECT
toDate(timestamp) AS date,
event_type,
count() AS total_events,
sum(value) AS total_value,
uniq(user_id) AS unique_users
FROM events
GROUP BY date, event_type;
8.2 Aggregating Materialized Views¶
-- Hourly aggregations with AggregatingMergeTree
CREATE TABLE metrics_hourly (
hour DateTime,
metric_name String,
count_state AggregateFunction(count, UInt64),
sum_state AggregateFunction(sum, Float64),
avg_state AggregateFunction(avg, Float64),
min_state AggregateFunction(min, Float64),
max_state AggregateFunction(max, Float64),
uniq_state AggregateFunction(uniq, UInt32),
quantile_state AggregateFunction(quantile(0.95), Float64)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, metric_name);
CREATE MATERIALIZED VIEW metrics_hourly_mv TO metrics_hourly AS
SELECT
toStartOfHour(timestamp) AS hour,
metric_name,
countState() AS count_state,
sumState(value) AS sum_state,
avgState(value) AS avg_state,
minState(value) AS min_state,
maxState(value) AS max_state,
uniqState(user_id) AS uniq_state,
quantileState(0.95)(value) AS quantile_state
FROM metrics
GROUP BY hour, metric_name;
-- Query with merge
SELECT
hour,
metric_name,
countMerge(count_state) AS total_count,
sumMerge(sum_state) AS total_sum,
avgMerge(avg_state) AS average,
minMerge(min_state) AS minimum,
maxMerge(max_state) AS maximum,
uniqMerge(uniq_state) AS unique_count,
quantileMerge(0.95)(quantile_state) AS p95
FROM metrics_hourly
WHERE hour >= toStartOfDay(now())
GROUP BY hour, metric_name
ORDER BY hour;
8.3 Cascading Materialized Views¶
-- First level: Hourly
CREATE TABLE events_hourly (
hour DateTime,
event_type String,
count UInt64,
value_sum Float64
) ENGINE = SummingMergeTree((count, value_sum))
ORDER BY (hour, event_type);
CREATE MATERIALIZED VIEW events_hourly_mv TO events_hourly AS
SELECT
toStartOfHour(timestamp) AS hour,
event_type,
count() AS count,
sum(value) AS value_sum
FROM events
GROUP BY hour, event_type;
-- Second level: Daily (aggregates from hourly)
CREATE TABLE events_daily_summary (
date Date,
event_type String,
count UInt64,
value_sum Float64
) ENGINE = SummingMergeTree((count, value_sum))
ORDER BY (date, event_type);
CREATE MATERIALIZED VIEW events_daily_mv TO events_daily_summary AS
SELECT
toDate(hour) AS date,
event_type,
sum(count) AS count,
sum(value_sum) AS value_sum
FROM events_hourly
GROUP BY date, event_type;
8.4 Materialized View with POPULATE¶
-- Create materialized view and populate with existing data
CREATE MATERIALIZED VIEW user_stats_mv TO user_stats
POPULATE
AS SELECT
user_id,
count() AS total_events,
sum(value) AS total_value,
max(timestamp) AS last_activity
FROM events
GROUP BY user_id;
9. Data Migration Process¶
9.1 Migration Strategy Selection¶
| Data Size | Recommended Method | Estimated Time |
|---|---|---|
| < 10 GB | INSERT SELECT via link | Minutes |
| 10-100 GB | COPY with files | 30 min - 2 hours |
| 100 GB - 1 TB | Parallel export/import | 2-8 hours |
| > 1 TB | Streaming or incremental | Hours to days |
9.2 Method 1: Direct INSERT SELECT (Small Datasets)¶
-- Connect HeliosDB to ClickHouse as external source
-- Then copy data directly
-- In HeliosDB:
CREATE TABLE events_clickhouse (
timestamp DateTime,
event_type String,
user_id UInt32,
value Float64
) ENGINE = MySQL('clickhouse.host:9004', 'analytics', 'events', 'user', 'pass');
-- Copy data
INSERT INTO events SELECT * FROM events_clickhouse;
9.3 Method 2: Export/Import with Files¶
Export from ClickHouse¶
# Export to Native format (fastest)
clickhouse-client \
--host clickhouse.host \
--query "SELECT * FROM analytics.events FORMAT Native" \
> events.native
# Export to CSV
clickhouse-client \
--host clickhouse.host \
--query "SELECT * FROM analytics.events FORMAT CSVWithNames" \
> events.csv
# Export to Parquet (compressed)
clickhouse-client \
--host clickhouse.host \
--query "SELECT * FROM analytics.events FORMAT Parquet" \
> events.parquet
# Export to JSON
clickhouse-client \
--host clickhouse.host \
--query "SELECT * FROM analytics.events FORMAT JSONEachRow" \
> events.json
Import to HeliosDB¶
# Import from Native format
clickhouse-client \
--host heliosdb.host \
--query "INSERT INTO analytics.events FORMAT Native" \
< events.native
# Import from CSV
clickhouse-client \
--host heliosdb.host \
--query "INSERT INTO analytics.events FORMAT CSVWithNames" \
< events.csv
# Import from Parquet
clickhouse-client \
--host heliosdb.host \
--query "INSERT INTO analytics.events FORMAT Parquet" \
< events.parquet
9.4 Method 3: Python Migration Script¶
#!/usr/bin/env python3
"""
ClickHouse to HeliosDB Migration Script
Handles large tables with batching and progress tracking.
"""
from clickhouse_driver import Client
import time
import sys
def migrate_table(source_host, target_host, database, table, batch_size=100000):
"""Migrate a single table with batching."""
source = Client(
host=source_host,
port=9000,
database=database
)
target = Client(
host=target_host,
port=9000,
database=database
)
# Get row count
total_rows = source.execute(f'SELECT count() FROM {table}')[0][0]
print(f"Migrating {table}: {total_rows:,} rows")
# Get column info
columns_result = source.execute(f'DESCRIBE TABLE {table}')
columns = ', '.join([col[0] for col in columns_result])
# Migrate in batches
offset = 0
migrated = 0
start_time = time.time()
while offset < total_rows:
# Read batch from source
rows = source.execute(
f'SELECT {columns} FROM {table} LIMIT {batch_size} OFFSET {offset}'
)
if not rows:
break
# Write batch to target
target.execute(
f'INSERT INTO {table} ({columns}) VALUES',
rows
)
offset += batch_size
migrated += len(rows)
# Progress update
elapsed = time.time() - start_time
rate = migrated / elapsed if elapsed > 0 else 0
progress = (migrated / total_rows) * 100
eta = (total_rows - migrated) / rate if rate > 0 else 0
print(f" Progress: {progress:.1f}% ({migrated:,}/{total_rows:,}) "
f"Rate: {rate:,.0f} rows/sec ETA: {eta:.0f}s")
elapsed = time.time() - start_time
print(f"Completed {table}: {migrated:,} rows in {elapsed:.1f}s "
f"({migrated/elapsed:,.0f} rows/sec)")
return migrated
def migrate_database(source_host, target_host, database, tables=None):
"""Migrate entire database or specific tables."""
source = Client(host=source_host, port=9000, database=database)
# Get tables if not specified
if tables is None:
result = source.execute(
f"SELECT name FROM system.tables WHERE database = '{database}'"
)
tables = [row[0] for row in result]
print(f"Migrating {len(tables)} tables from {database}")
total_migrated = 0
for table in tables:
try:
migrated = migrate_table(
source_host, target_host, database, table
)
total_migrated += migrated
except Exception as e:
print(f"Error migrating {table}: {e}")
continue
print(f"\nTotal migrated: {total_migrated:,} rows")
return total_migrated
if __name__ == "__main__":
# Configuration
SOURCE_HOST = 'clickhouse.host'
TARGET_HOST = 'heliosdb.host'
DATABASE = 'analytics'
# Migrate all tables
migrate_database(SOURCE_HOST, TARGET_HOST, DATABASE)
9.5 Method 4: Parallel Migration for Large Datasets¶
#!/usr/bin/env python3
"""
Parallel migration for large ClickHouse tables.
Uses multiple workers for concurrent data transfer.
"""
from clickhouse_driver import Client
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading
import time
class ParallelMigrator:
def __init__(self, source_host, target_host, database, num_workers=4):
self.source_host = source_host
self.target_host = target_host
self.database = database
self.num_workers = num_workers
self.lock = threading.Lock()
self.total_migrated = 0
def get_client(self, host):
return Client(
host=host,
port=9000,
database=self.database
)
def migrate_partition(self, table, partition_id, columns):
"""Migrate a single partition."""
source = self.get_client(self.source_host)
target = self.get_client(self.target_host)
# Export partition data
rows = source.execute(f"""
SELECT {columns} FROM {table}
WHERE _partition_id = '{partition_id}'
""")
if rows:
target.execute(
f'INSERT INTO {table} ({columns}) VALUES',
rows
)
with self.lock:
self.total_migrated += len(rows)
return len(rows)
def migrate_table(self, table):
"""Migrate table using parallel partition processing."""
source = self.get_client(self.source_host)
# Get partitions
partitions = source.execute(f"""
SELECT DISTINCT _partition_id
FROM {table}
""")
partition_ids = [p[0] for p in partitions]
# Get columns
columns_result = source.execute(f'DESCRIBE TABLE {table}')
columns = ', '.join([col[0] for col in columns_result])
print(f"Migrating {table}: {len(partition_ids)} partitions")
start_time = time.time()
with ThreadPoolExecutor(max_workers=self.num_workers) as executor:
futures = {
executor.submit(
self.migrate_partition, table, pid, columns
): pid for pid in partition_ids
}
completed = 0
for future in as_completed(futures):
partition_id = futures[future]
try:
rows = future.result()
completed += 1
print(f" Partition {partition_id}: {rows:,} rows "
f"({completed}/{len(partition_ids)})")
except Exception as e:
print(f" Error in partition {partition_id}: {e}")
elapsed = time.time() - start_time
print(f"Completed {table}: {self.total_migrated:,} rows "
f"in {elapsed:.1f}s")
if __name__ == "__main__":
migrator = ParallelMigrator(
source_host='clickhouse.host',
target_host='heliosdb.host',
database='analytics',
num_workers=8
)
migrator.migrate_table('events')
9.6 Method 5: Incremental/CDC Migration¶
For zero-downtime migrations with continuous data sync:
#!/usr/bin/env python3
"""
Incremental migration with change tracking.
Suitable for zero-downtime migrations.
"""
from clickhouse_driver import Client
import time
def incremental_sync(source_host, target_host, database, table,
timestamp_column='timestamp', interval_seconds=60):
"""Continuously sync new data from source to target."""
source = Client(host=source_host, port=9000, database=database)
target = Client(host=target_host, port=9000, database=database)
# Get columns
columns_result = source.execute(f'DESCRIBE TABLE {table}')
columns = ', '.join([col[0] for col in columns_result])
# Get initial watermark from target
result = target.execute(f'SELECT max({timestamp_column}) FROM {table}')
watermark = result[0][0] or '1970-01-01 00:00:00'
print(f"Starting incremental sync from {watermark}")
while True:
# Get new rows since watermark
rows = source.execute(f"""
SELECT {columns} FROM {table}
WHERE {timestamp_column} > '{watermark}'
ORDER BY {timestamp_column}
LIMIT 100000
""")
if rows:
# Insert to target
target.execute(
f'INSERT INTO {table} ({columns}) VALUES',
rows
)
# Update watermark
new_watermark = max(row[0] for row in rows) # Assumes first col is timestamp
print(f"Synced {len(rows)} rows up to {new_watermark}")
watermark = new_watermark
else:
print(f"No new data (watermark: {watermark})")
# Wait before next sync
time.sleep(interval_seconds)
if __name__ == "__main__":
incremental_sync(
source_host='clickhouse.host',
target_host='heliosdb.host',
database='analytics',
table='events',
timestamp_column='timestamp',
interval_seconds=30
)
9.7 Schema Migration Script¶
#!/bin/bash
# migrate_schema.sh - Export and import ClickHouse schema to HeliosDB
SOURCE_HOST="clickhouse.host"
TARGET_HOST="heliosdb.host"
DATABASE="analytics"
echo "=== Schema Migration: ClickHouse -> HeliosDB ==="
# Export schema
echo "Exporting schema from $SOURCE_HOST..."
clickhouse-client --host $SOURCE_HOST --query "
SELECT create_table_query
FROM system.tables
WHERE database = '$DATABASE'
AND engine NOT LIKE 'Replicated%'
" > schema_export.sql
# Handle ReplicatedMergeTree -> MergeTree conversion
echo "Converting Replicated* engines to standard engines..."
sed -i 's/ReplicatedMergeTree([^)]*)/MergeTree()/g' schema_export.sql
sed -i 's/ReplicatedReplacingMergeTree([^)]*)/ReplacingMergeTree()/g' schema_export.sql
sed -i 's/ReplicatedSummingMergeTree([^)]*)/SummingMergeTree()/g' schema_export.sql
sed -i 's/ReplicatedAggregatingMergeTree([^)]*)/AggregatingMergeTree()/g' schema_export.sql
# Create database in HeliosDB
echo "Creating database in HeliosDB..."
clickhouse-client --host $TARGET_HOST --query "CREATE DATABASE IF NOT EXISTS $DATABASE"
# Import schema
echo "Importing schema to $TARGET_HOST..."
while IFS= read -r query; do
if [ -n "$query" ]; then
clickhouse-client --host $TARGET_HOST --query "$query" || \
echo "Failed: $query"
fi
done < schema_export.sql
echo "Schema migration complete!"
10. Application Connectivity¶
10.1 Python (clickhouse-driver)¶
from clickhouse_driver import Client
# Before (ClickHouse)
client = Client(
host='clickhouse.host',
port=9000,
user='default',
password='password',
database='analytics',
compression=True
)
# After (HeliosDB) - Only host changes
client = Client(
host='heliosdb.host', # Change only this
port=9000,
user='default',
password='password',
database='analytics',
compression=True
)
# Usage remains identical
result = client.execute('''
SELECT event_type, count() AS cnt
FROM events
WHERE timestamp > now() - INTERVAL 1 DAY
GROUP BY event_type
ORDER BY cnt DESC
''')
for row in result:
print(f"{row[0]}: {row[1]}")
10.2 Python (clickhouse-connect)¶
import clickhouse_connect
# Before (ClickHouse)
client = clickhouse_connect.get_client(
host='clickhouse.host',
port=8123,
username='default',
password='password',
database='analytics'
)
# After (HeliosDB) - Only host changes
client = clickhouse_connect.get_client(
host='heliosdb.host', # Change only this
port=8123,
username='default',
password='password',
database='analytics'
)
# Query with Pandas integration
df = client.query_df('''
SELECT
toDate(timestamp) AS date,
event_type,
count() AS events
FROM events
GROUP BY date, event_type
''')
print(df.head())
10.3 Go (clickhouse-go)¶
package main
import (
"context"
"fmt"
"github.com/ClickHouse/clickhouse-go/v2"
)
func main() {
// Before (ClickHouse)
// conn, _ := clickhouse.Open(&clickhouse.Options{
// Addr: []string{"clickhouse.host:9000"},
// ...
// })
// After (HeliosDB) - Only address changes
conn, err := clickhouse.Open(&clickhouse.Options{
Addr: []string{"heliosdb.host:9000"}, // Change only this
Auth: clickhouse.Auth{
Database: "analytics",
Username: "default",
Password: "password",
},
Compression: &clickhouse.Compression{
Method: clickhouse.CompressionLZ4,
},
Settings: clickhouse.Settings{
"max_execution_time": 60,
},
})
if err != nil {
panic(err)
}
defer conn.Close()
ctx := context.Background()
rows, err := conn.Query(ctx, `
SELECT event_type, count() AS cnt
FROM events
WHERE timestamp > now() - INTERVAL 1 DAY
GROUP BY event_type
`)
if err != nil {
panic(err)
}
defer rows.Close()
for rows.Next() {
var eventType string
var count uint64
rows.Scan(&eventType, &count)
fmt.Printf("%s: %d\n", eventType, count)
}
}
10.4 Node.js (@clickhouse/client)¶
const { createClient } = require('@clickhouse/client');
// Before (ClickHouse)
// const client = createClient({
// host: 'http://clickhouse.host:8123',
// ...
// });
// After (HeliosDB) - Only host changes
const client = createClient({
host: 'http://heliosdb.host:8123', // Change only this
database: 'analytics',
username: 'default',
password: 'password',
compression: {
request: true,
response: true,
},
});
async function queryEvents() {
const result = await client.query({
query: `
SELECT event_type, count() AS cnt
FROM events
WHERE timestamp > now() - INTERVAL 1 DAY
GROUP BY event_type
ORDER BY cnt DESC
`,
format: 'JSONEachRow',
});
const data = await result.json();
for (const row of data) {
console.log(`${row.event_type}: ${row.cnt}`);
}
}
async function insertEvents() {
const events = [
{ timestamp: new Date(), event_type: 'click', user_id: 1, value: 10.5 },
{ timestamp: new Date(), event_type: 'view', user_id: 2, value: 5.0 },
];
await client.insert({
table: 'events',
values: events,
format: 'JSONEachRow',
});
}
queryEvents();
10.5 Java (JDBC)¶
import java.sql.*;
import java.util.Properties;
public class HeliosDBJDBC {
public static void main(String[] args) throws SQLException {
// Before (ClickHouse)
// String url = "jdbc:clickhouse://clickhouse.host:8123/analytics";
// After (HeliosDB) - Only host changes
String url = "jdbc:clickhouse://heliosdb.host:8123/analytics";
Properties props = new Properties();
props.setProperty("user", "default");
props.setProperty("password", "password");
props.setProperty("compress", "true");
try (Connection conn = DriverManager.getConnection(url, props);
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery(
"SELECT event_type, count() AS cnt " +
"FROM events " +
"WHERE timestamp > now() - INTERVAL 1 DAY " +
"GROUP BY event_type"
)) {
while (rs.next()) {
System.out.printf("%s: %d%n",
rs.getString("event_type"),
rs.getLong("cnt"));
}
}
}
}
10.6 Java (Native Client)¶
import com.clickhouse.client.*;
import com.clickhouse.data.*;
public class HeliosDBNative {
public static void main(String[] args) {
// Before (ClickHouse)
// ClickHouseNode server = ClickHouseNode.of("clickhouse.host:9000");
// After (HeliosDB) - Only host changes
ClickHouseNode server = ClickHouseNode.builder()
.host("heliosdb.host") // Change only this
.port(9000)
.database("analytics")
.credentials(ClickHouseCredentials.fromUserAndPassword("default", "password"))
.build();
try (ClickHouseClient client = ClickHouseClient.newInstance(ClickHouseProtocol.NATIVE);
ClickHouseResponse response = client.read(server)
.query("SELECT event_type, count() FROM events GROUP BY event_type")
.executeAndWait()) {
for (ClickHouseRecord record : response.records()) {
System.out.printf("%s: %d%n",
record.getValue(0).asString(),
record.getValue(1).asLong());
}
}
}
}
10.7 HTTP API (curl)¶
# Before (ClickHouse)
# curl 'http://clickhouse.host:8123/' --data-binary "SELECT 1"
# After (HeliosDB) - Only host changes
curl 'http://heliosdb.host:8123/' \
--data-binary "SELECT event_type, count() FROM events GROUP BY event_type"
# With authentication
curl 'http://heliosdb.host:8123/?user=default&password=password' \
--data-binary "SELECT * FROM events LIMIT 10"
# With database
curl 'http://heliosdb.host:8123/?database=analytics' \
--data-binary "SELECT count() FROM events"
# JSON output
curl 'http://heliosdb.host:8123/' \
--data-binary "SELECT * FROM events FORMAT JSONEachRow"
# Insert data
curl 'http://heliosdb.host:8123/?query=INSERT%20INTO%20events%20FORMAT%20JSONEachRow' \
--data-binary '{"timestamp":"2025-01-15 10:30:00","event_type":"click","user_id":1,"value":10.5}'
10.8 Environment Variable Configuration¶
# Before (ClickHouse)
export CLICKHOUSE_HOST=clickhouse.host
# After (HeliosDB)
export CLICKHOUSE_HOST=heliosdb.host
# Common environment variables
export CLICKHOUSE_PORT=9000
export CLICKHOUSE_HTTP_PORT=8123
export CLICKHOUSE_USER=default
export CLICKHOUSE_PASSWORD=password
export CLICKHOUSE_DATABASE=analytics
11. Performance Considerations¶
11.1 HeliosDB Performance Configuration¶
# heliosdb.toml - ClickHouse protocol optimization
[clickhouse]
# Network settings
listen_address = "0.0.0.0"
native_port = 9000
http_port = 8123
max_connections = 4096
[clickhouse.protocol]
# Enable compression
compression_enabled = true
compression_algorithms = ["lz4", "zstd"]
default_compression = "lz4"
# Increase buffer sizes
max_query_size = 268435456 # 256MB
max_insert_block_size = 1048576 # 1M rows
[clickhouse.query]
# Query optimization
max_threads = 16
max_execution_time = 600
max_memory_usage = 10737418240 # 10GB
use_uncompressed_cache = true
background_pool_size = 16
[clickhouse.mergetree]
# MergeTree settings
index_granularity = 8192
min_bytes_for_wide_part = 10485760 # 10MB
max_parts_in_total = 100000
merge_max_block_size = 8192
11.2 Query Performance Tips¶
Use PREWHERE¶
-- PREWHERE filters before reading columns
SELECT user_id, event_type, value
FROM events
PREWHERE timestamp > '2025-01-01' -- Filter early
WHERE event_type = 'purchase'; -- Filter late
Optimize ORDER BY¶
-- Design ORDER BY to match common query patterns
CREATE TABLE events (
timestamp DateTime,
user_id UInt32,
event_type String,
value Float64
) ENGINE = MergeTree()
-- If you query by user_id and timestamp:
ORDER BY (user_id, timestamp)
PARTITION BY toYYYYMM(timestamp);
-- Queries matching ORDER BY are fast
SELECT * FROM events
WHERE user_id = 12345
AND timestamp > '2025-01-01';
Use Sampling¶
-- Approximate queries on samples
SELECT event_type, count() * 10 AS estimated_count
FROM events SAMPLE 0.1
GROUP BY event_type;
-- Deterministic sampling
SELECT event_type, count() * 100 AS estimated_count
FROM events SAMPLE 0.01 OFFSET 0.5
GROUP BY event_type;
Batch Inserts¶
# Insert in batches of 10K-100K rows
batch_size = 50000
for i in range(0, len(data), batch_size):
batch = data[i:i+batch_size]
client.execute(
'INSERT INTO events VALUES',
batch,
types_check=True
)
11.3 Index Optimization¶
-- Skip indexes for filtering
CREATE TABLE logs (
timestamp DateTime,
level String,
message String,
INDEX idx_level level TYPE set(100) GRANULARITY 4,
INDEX idx_message message TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 4
) ENGINE = MergeTree()
ORDER BY timestamp;
-- Bloom filter index for high-cardinality columns
CREATE TABLE events (
timestamp DateTime,
user_id UInt32,
session_id UUID,
INDEX idx_session session_id TYPE bloom_filter(0.01) GRANULARITY 1
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id);
11.4 Materialized Views for Pre-aggregation¶
-- Pre-aggregate frequently-queried metrics
CREATE MATERIALIZED VIEW hourly_metrics
ENGINE = AggregatingMergeTree()
ORDER BY (hour, event_type)
AS SELECT
toStartOfHour(timestamp) AS hour,
event_type,
countState() AS count_state,
sumState(value) AS sum_state,
avgState(value) AS avg_state
FROM events
GROUP BY hour, event_type;
-- Query from materialized view (much faster)
SELECT
hour,
event_type,
countMerge(count_state) AS total,
sumMerge(sum_state) AS sum,
avgMerge(avg_state) AS avg
FROM hourly_metrics
WHERE hour >= toStartOfDay(now())
GROUP BY hour, event_type;
11.5 Performance Benchmarks¶
| Operation | ClickHouse | HeliosDB | Notes |
|---|---|---|---|
| count() 1B rows | 85ms | 90ms | Comparable |
| GROUP BY 100M rows | 2.1s | 2.0s | Slightly faster |
| Bulk insert 1M rows | 0.65s | 0.67s | Comparable |
| Complex JOIN | 3.5s | 3.2s | Better optimizer |
| Window functions | 4.2s | 3.8s | SIMD acceleration |
11.6 Monitoring Performance¶
-- HeliosDB performance views
SELECT * FROM system.query_log
WHERE type = 'QueryFinish'
AND query_duration_ms > 1000
ORDER BY query_duration_ms DESC
LIMIT 20;
-- Check table sizes
SELECT
database,
name,
formatReadableSize(total_bytes) AS size,
formatReadableQuantity(total_rows) AS rows
FROM system.tables
WHERE database = 'analytics'
ORDER BY total_bytes DESC;
-- Check part status
SELECT
table,
partition,
count() AS parts,
sum(rows) AS rows,
formatReadableSize(sum(bytes)) AS size
FROM system.parts
WHERE database = 'analytics'
AND active
GROUP BY table, partition
ORDER BY parts DESC;
12. Troubleshooting Common Issues¶
12.1 Connection Issues¶
Issue: Connection Refused¶
Solutions:
# Check HeliosDB is running
systemctl status heliosdb
# Verify port is listening
netstat -tlnp | grep -E '9000|8123'
# Check firewall
firewall-cmd --list-ports | grep -E '9000|8123'
# Test connectivity
nc -zv heliosdb.host 9000
nc -zv heliosdb.host 8123
Issue: Authentication Failed¶
Solutions:
# Ensure credentials are correct
client = Client(
host='heliosdb.host',
port=9000,
user='default',
password='correct_password', # Verify password
database='analytics'
)
12.2 Query Issues¶
Issue: Memory Limit Exceeded¶
Solutions:
-- Increase memory limit for query
SET max_memory_usage = 20000000000; -- 20GB
-- Or use sampling
SELECT ... FROM events SAMPLE 0.1 ...
-- Or break into smaller queries
SELECT ... FROM events WHERE timestamp >= '2025-01-01' AND timestamp < '2025-01-15';
SELECT ... FROM events WHERE timestamp >= '2025-01-15' AND timestamp < '2025-02-01';
Issue: Query Timeout¶
Solutions:
# Increase timeout
client = Client(
host='heliosdb.host',
port=9000,
settings={
'max_execution_time': 600, # 10 minutes
'send_receive_timeout': 600
}
)
Issue: Too Many Parts¶
Solutions:
-- Check part counts
SELECT table, count() AS parts
FROM system.parts
WHERE database = 'analytics' AND active
GROUP BY table
ORDER BY parts DESC;
-- Optimize tables
OPTIMIZE TABLE analytics.events FINAL;
-- Adjust settings
ALTER TABLE events MODIFY SETTING
parts_to_delay_insert = 500,
parts_to_throw_insert = 600;
12.3 Data Type Issues¶
Issue: Type Mismatch¶
Solutions:
# Ensure proper type conversion
from datetime import datetime
# Correct types
data = [
(datetime.now(), 'click', 123, 10.5), # DateTime, String, UInt32, Float64
]
client.execute(
'INSERT INTO events (timestamp, event_type, user_id, value) VALUES',
data,
types_check=True # Enable type checking
)
Issue: Nullable Type Errors¶
Solutions:
-- Check column nullability
DESCRIBE TABLE events;
-- Alter column to Nullable
ALTER TABLE events MODIFY COLUMN optional_field Nullable(String);
-- Or handle NULLs in insert
INSERT INTO events SELECT
timestamp,
event_type,
COALESCE(user_id, 0) AS user_id, -- Default for NULL
value
FROM source_table;
12.4 Migration Issues¶
Issue: Schema Incompatibility¶
Solutions:
# Check for ReplicatedMergeTree engines
grep -i "Replicated" schema_export.sql
# Convert to standard engines
sed -i 's/ReplicatedMergeTree([^)]*)/MergeTree()/g' schema_export.sql
Issue: Data Loss During Migration¶
Solutions:
-- Verify row counts before and after
-- Source
SELECT count() FROM events; -- On ClickHouse
-- Target
SELECT count() FROM events; -- On HeliosDB
-- If mismatch, check for errors during migration
-- Re-run migration with smaller batches
Issue: Performance Degradation¶
Solutions:
-- Rebuild indexes after migration
OPTIMIZE TABLE events FINAL;
-- Update statistics
-- HeliosDB handles this automatically
-- Check query plans
EXPLAIN SELECT * FROM events WHERE user_id = 12345;
12.5 Materialized View Issues¶
Issue: View Not Updating¶
Solutions:
-- Check if view is attached
SELECT name, is_populated FROM system.tables WHERE engine = 'MaterializedView';
-- Detach and reattach
DETACH TABLE events_daily_mv;
ATTACH TABLE events_daily_mv;
-- Or recreate with POPULATE
DROP TABLE events_daily_mv;
CREATE MATERIALIZED VIEW events_daily_mv TO events_daily
POPULATE
AS SELECT ...;
12.6 TTL Issues¶
Issue: Data Not Expiring¶
Solutions:
-- Check TTL settings
SELECT name, engine_full FROM system.tables WHERE name = 'events';
-- Force TTL processing
ALTER TABLE events MATERIALIZE TTL;
-- Check TTL status
SELECT
table,
partition,
rows,
delete_ttl_info
FROM system.parts
WHERE table = 'events' AND active;
13. Post-Migration Validation¶
13.1 Row Count Verification¶
#!/usr/bin/env python3
"""
Validate row counts between ClickHouse and HeliosDB.
"""
from clickhouse_driver import Client
def validate_counts(source_host, target_host, database, tables):
source = Client(host=source_host, port=9000, database=database)
target = Client(host=target_host, port=9000, database=database)
results = {}
for table in tables:
source_count = source.execute(f'SELECT count() FROM {table}')[0][0]
target_count = target.execute(f'SELECT count() FROM {table}')[0][0]
match = source_count == target_count
results[table] = {
'source': source_count,
'target': target_count,
'match': match
}
status = "OK" if match else "MISMATCH"
print(f"{table}: Source={source_count:,} Target={target_count:,} [{status}]")
return results
# Validate
tables = ['events', 'users', 'metrics']
validate_counts('clickhouse.host', 'heliosdb.host', 'analytics', tables)
13.2 Schema Validation¶
-- Compare schemas
-- On ClickHouse:
SHOW CREATE TABLE analytics.events;
-- On HeliosDB:
SHOW CREATE TABLE analytics.events;
-- Check column types
SELECT
name,
type,
default_kind,
default_expression
FROM system.columns
WHERE database = 'analytics' AND table = 'events';
13.3 Data Integrity Checks¶
def validate_sample_data(source_host, target_host, database, table,
key_column, sample_size=1000):
"""Compare sample rows between systems."""
source = Client(host=source_host, port=9000, database=database)
target = Client(host=target_host, port=9000, database=database)
# Get sample keys
keys = source.execute(f'''
SELECT DISTINCT {key_column}
FROM {table}
ORDER BY rand()
LIMIT {sample_size}
''')
mismatches = 0
for (key,) in keys:
source_row = source.execute(
f'SELECT * FROM {table} WHERE {key_column} = %(key)s',
{'key': key}
)
target_row = target.execute(
f'SELECT * FROM {table} WHERE {key_column} = %(key)s',
{'key': key}
)
if source_row != target_row:
mismatches += 1
print(f"Mismatch for {key_column}={key}")
print(f"Validation: {sample_size - mismatches}/{sample_size} rows match")
return mismatches == 0
13.4 Query Performance Comparison¶
import time
def compare_query_performance(source_host, target_host, database, queries):
"""Compare query execution times."""
source = Client(host=source_host, port=9000, database=database)
target = Client(host=target_host, port=9000, database=database)
for query in queries:
# Source timing
start = time.time()
source.execute(query)
source_time = time.time() - start
# Target timing
start = time.time()
target.execute(query)
target_time = time.time() - start
diff = ((target_time - source_time) / source_time) * 100
status = "faster" if diff < 0 else "slower"
print(f"Query: {query[:50]}...")
print(f" ClickHouse: {source_time:.3f}s")
print(f" HeliosDB: {target_time:.3f}s ({abs(diff):.1f}% {status})")
# Compare
queries = [
"SELECT count() FROM events",
"SELECT event_type, count() FROM events GROUP BY event_type",
"SELECT user_id, sum(value) FROM events GROUP BY user_id ORDER BY sum(value) DESC LIMIT 100",
]
compare_query_performance('clickhouse.host', 'heliosdb.host', 'analytics', queries)
13.5 Application Smoke Tests¶
- [ ] All application endpoints responding correctly
- [ ] Query latencies within acceptable range (< 10% variance)
- [ ] No increase in error rates
- [ ] Dashboard visualizations rendering correctly
- [ ] Batch jobs completing successfully
- [ ] Materialized views updating correctly
- [ ] TTL expiration working as expected
- [ ] INSERT operations completing successfully
- [ ] Complex aggregations returning correct results
13.6 Automated Validation Script¶
#!/bin/bash
# Post-migration validation script
SOURCE_HOST="clickhouse.host"
TARGET_HOST="heliosdb.host"
DATABASE="analytics"
echo "=== Post-Migration Validation ==="
# 1. Connectivity test
echo "Testing connectivity..."
clickhouse-client --host $TARGET_HOST --query "SELECT 1" || exit 1
# 2. Compare table counts
echo "Comparing row counts..."
for table in events users metrics; do
source_count=$(clickhouse-client --host $SOURCE_HOST \
--query "SELECT count() FROM $DATABASE.$table")
target_count=$(clickhouse-client --host $TARGET_HOST \
--query "SELECT count() FROM $DATABASE.$table")
if [ "$source_count" == "$target_count" ]; then
echo "$table: OK ($source_count rows)"
else
echo "$table: MISMATCH (source=$source_count, target=$target_count)"
fi
done
# 3. Query performance
echo "Testing query performance..."
for i in 1 2 3; do
time clickhouse-client --host $TARGET_HOST \
--query "SELECT event_type, count() FROM $DATABASE.events GROUP BY event_type" \
> /dev/null
done
# 4. Test materialized views
echo "Testing materialized views..."
clickhouse-client --host $TARGET_HOST \
--query "SELECT name FROM system.tables WHERE engine = 'MaterializedView'"
echo "=== Validation Complete ==="
Appendix¶
A. Quick Reference Commands¶
# Connect to HeliosDB
clickhouse-client --host heliosdb.host --port 9000
# Export schema
clickhouse-client --host clickhouse.host \
--query "SELECT create_table_query FROM system.tables WHERE database = 'analytics'"
# Export data (Native format)
clickhouse-client --host clickhouse.host \
--query "SELECT * FROM analytics.events FORMAT Native" > events.native
# Import data
clickhouse-client --host heliosdb.host \
--query "INSERT INTO analytics.events FORMAT Native" < events.native
# Verify counts
clickhouse-client --host heliosdb.host \
--query "SELECT count() FROM analytics.events"
B. Migration Checklist Summary¶
Pre-Migration¶
- [ ] Backup ClickHouse data
- [ ] Document cluster topology
- [ ] Inventory tables and engines
- [ ] Analyze data volume
- [ ] Review query patterns
- [ ] Plan migration window
Migration¶
- [ ] Export schemas
- [ ] Convert Replicated* engines
- [ ] Import schemas to HeliosDB
- [ ] Migrate data (choose method)
- [ ] Update application connections
Post-Migration¶
- [ ] Verify row counts
- [ ] Check schema integrity
- [ ] Validate sample data
- [ ] Compare query performance
- [ ] Run application tests
- [ ] Monitor error rates
- [ ] Optimize tables
C. Sample Migration Timeline¶
| Phase | Duration | Activities |
|---|---|---|
| Assessment | 1-2 days | Inventory, engine review, sizing |
| Preparation | 1-3 days | HeliosDB setup, schema conversion |
| Schema Migration | 1-2 hours | Export, convert, import |
| Data Migration | Varies | Depends on data size |
| Validation | 1-2 days | Testing, performance comparison |
| Cutover | 1-4 hours | Connection updates, final sync |
| Monitoring | 1-2 weeks | Performance monitoring |
D. Compatibility Notes¶
Fully Compatible Features¶
- Native TCP protocol (port 9000)
- HTTP protocol (port 8123)
- All MergeTree family engines
- All standard data types
- Materialized views
- Dictionaries
- TTL
- PREWHERE/WHERE/FINAL
- All JOIN types
- Window functions
- Aggregate functions
Migration Required¶
- ReplicatedMergeTree -> MergeTree (HeliosDB handles replication)
- ZooKeeper paths -> HeliosDB replication config
- Cluster settings -> HeliosDB cluster config
Behavioral Differences¶
- Replication: Uses HeliosDB native replication
- Storage: Uses HeliosDB storage engine
- Keeper: Not required (no ZooKeeper dependency)
- Some advanced settings may differ
Related Documentation¶
- ClickHouse Protocol README
- ClickHouse Compatibility Matrix
- ClickHouse Configuration
- ClickHouse Examples
- General Migration Guide
- Protocol Compatibility Matrix
- PostgreSQL Migration Guide
- Cassandra Migration Guide
Need Help?
- Documentation: docs.heliosdb.io
- Community: community.heliosdb.io
- Support: support@heliosdb.io
Document Version History:
| Version | Date | Changes |
|---|---|---|
| 1.0 | January 2026 | Initial release |