Cassandra to HeliosDB Migration Guide¶
Version: 1.0 Last Updated: January 2026
Table of Contents¶
- Introduction
- Compatibility Overview
- Pre-Migration Assessment
- Conceptual Mapping
- Step-by-Step Migration
- CQL Query Compatibility
- Feature Mapping
- Consistency and Replication
- Post-Migration Validation
- Performance Tuning
- Common Issues and Troubleshooting
1. Introduction¶
Why Migrate from Cassandra to HeliosDB?¶
HeliosDB provides full CQL wire protocol compatibility while offering significant advantages over Apache Cassandra:
Performance Benefits¶
- Lower Latency: HeliosDB's unified storage engine provides sub-millisecond query response times
- Higher Throughput: Optimized query execution delivers 100K+ operations per second
- Efficient Compression: Advanced compression algorithms (LZ4, Snappy) with better compression ratios
- Intelligent Caching: Automatic query and data caching without manual tuning
Operational Benefits¶
- Simplified Operations: No need for nodetool, repair operations, or complex tuning
- Zero-Downtime Scaling: Automatic rebalancing without manual intervention
- Unified Management: Single interface for all protocols (SQL, CQL, MongoDB, etc.)
- Built-in Monitoring: Comprehensive metrics without JMX configuration
Feature Enhancements¶
- Multi-Protocol Access: Access same data via SQL, CQL, REST, or GraphQL
- Advanced Analytics: Full SQL support including window functions and CTEs
- ACID Transactions: True ACID compliance beyond Cassandra's lightweight transactions
- AI Integration: Built-in ML/AI capabilities and natural language queries
Cost Reduction¶
- Lower Hardware Requirements: More efficient resource utilization
- Reduced Operational Overhead: Fewer administrators needed
- No Licensing Fees: Open-source with enterprise features included
Migration Compatibility¶
HeliosDB supports Apache Cassandra Native Protocol v4/v5 with 100% CQL language compatibility. This means:
- Existing Cassandra drivers work without modification
- CQL queries execute without changes
- Application code requires minimal to no changes
- Schema can be migrated directly
2. Compatibility Overview¶
CQL Wire Protocol Support¶
| Feature | Compatibility | Notes |
|---|---|---|
| Native Protocol v4 | 100% | Full support |
| Native Protocol v5 | 100% | Full support |
| Frame Compression (LZ4) | 100% | Recommended |
| Frame Compression (Snappy) | 100% | Supported |
| SASL Authentication | 100% | Password, LDAP |
| Prepared Statements | 100% | Full caching |
| Batch Operations | 100% | All batch types |
| Paging | 100% | Cursor-based |
| Event Notifications | 100% | SCHEMA_CHANGE, etc. |
Driver Compatibility Matrix¶
| Language | Driver | Version | Status |
|---|---|---|---|
| Java | DataStax Java Driver | 4.x | Fully Compatible |
| Python | cassandra-driver | 3.x | Fully Compatible |
| Node.js | cassandra-driver | 4.x | Fully Compatible |
| Go | gocql | 1.x | Fully Compatible |
| C# | CassandraCSharpDriver | 3.x | Fully Compatible |
| C++ | cpp-driver | 2.x | Fully Compatible |
| Rust | scylla-rust-driver | 0.x | Fully Compatible |
cqlsh Compatibility¶
# Connect to HeliosDB using standard cqlsh
cqlsh localhost 9042
# All cqlsh commands work identically
DESCRIBE KEYSPACES;
DESCRIBE TABLES;
TRACING ON;
3. Pre-Migration Assessment¶
3.1 Pre-Migration Checklist¶
- [ ] Complete backup of Cassandra cluster
- [ ] Document current cluster topology
- [ ] Inventory all keyspaces and tables
- [ ] Analyze partition key distribution
- [ ] Estimate total data volume
- [ ] Identify data with TTL requirements
- [ ] Review materialized views
- [ ] Catalog user-defined types (UDTs)
- [ ] List secondary indexes
- [ ] Document consistency level requirements
- [ ] Inventory application connection configurations
- [ ] Plan migration window and rollback strategy
- [ ] Test migration in development environment
3.2 Keyspace and Table Analysis¶
Export Keyspace Schema¶
# Export all keyspace schemas
cqlsh -e "DESCRIBE KEYSPACES" > keyspace_list.txt
# Export detailed schema for each keyspace
for ks in $(cqlsh -e "DESCRIBE KEYSPACES" | tr ' ' '\n' | grep -v '^$'); do
cqlsh -e "DESCRIBE KEYSPACE $ks" > schema_${ks}.cql
done
# Export full cluster schema
cqlsh -e "DESCRIBE CLUSTER" > cluster_info.txt
cqlsh -e "DESCRIBE SCHEMA" > full_schema.cql
Analyze Table Structure¶
-- List all tables in a keyspace
SELECT table_name, id
FROM system_schema.tables
WHERE keyspace_name = 'your_keyspace';
-- Get table details
SELECT *
FROM system_schema.columns
WHERE keyspace_name = 'your_keyspace'
AND table_name = 'your_table';
-- Check for clustering columns
SELECT column_name, clustering_order, kind
FROM system_schema.columns
WHERE keyspace_name = 'your_keyspace'
AND table_name = 'your_table'
AND kind IN ('partition_key', 'clustering');
3.3 Data Volume Estimation¶
Cassandra Data Size Analysis¶
# Check data directory size
du -sh /var/lib/cassandra/data/*
# Per-table statistics using nodetool
nodetool tablestats your_keyspace.your_table
Query Table Statistics¶
-- Estimate row counts (approximate)
SELECT count(*) FROM your_keyspace.your_table;
-- For large tables, use token ranges
SELECT count(*) FROM your_keyspace.your_table
WHERE token(partition_key) > token(0)
AND token(partition_key) <= token(1000000);
Storage Planning for HeliosDB¶
| Cassandra Size | Recommended HeliosDB Storage | Notes |
|---|---|---|
| < 100 GB | 1.2x Cassandra size | Compression efficiency |
| 100 GB - 1 TB | 1.0x - 1.1x | Index optimization |
| > 1 TB | 0.9x - 1.0x | Better compression ratio |
3.4 Partition Key Analysis¶
Understanding partition key distribution is critical for migration planning:
-- Analyze partition key cardinality
SELECT partition_key, count(*) as row_count
FROM your_keyspace.your_table
GROUP BY partition_key
LIMIT 1000;
-- Identify hot partitions
SELECT partition_key, count(*) as row_count
FROM your_keyspace.your_table
GROUP BY partition_key
HAVING count(*) > 10000
ALLOW FILTERING;
Identify Large Partitions¶
# Use nodetool to find large partitions
nodetool tablehistograms your_keyspace your_table
# Check for partition tombstones
nodetool cfstats your_keyspace.your_table | grep -E "Tombstones|Partition"
3.5 Feature Usage Inventory¶
Document usage of the following features:
| Feature | Used? | Tables Affected | Notes |
|---|---|---|---|
| Materialized Views | |||
| Secondary Indexes | |||
| User-Defined Types | |||
| Counter Columns | |||
| TTL (Time-To-Live) | |||
| Lightweight Transactions | |||
| Collections (List/Set/Map) | |||
| Custom Types | |||
| UDFs/UDAs |
4. Conceptual Mapping¶
4.1 Keyspaces to Schemas¶
Cassandra keyspaces map directly to HeliosDB schemas:
| Cassandra | HeliosDB | Notes |
|---|---|---|
| Keyspace | Schema | 1:1 mapping |
| Replication Strategy | Replication Config | Simplified management |
| Durable Writes | Transaction Mode | Always durable |
Keyspace Migration Example¶
-- Cassandra keyspace definition
CREATE KEYSPACE production WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'dc1': 3,
'dc2': 2
} AND DURABLE_WRITES = true;
-- HeliosDB equivalent (same syntax works)
CREATE KEYSPACE production WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'dc1': 3,
'dc2': 2
} AND DURABLE_WRITES = true;
-- Note: HeliosDB manages replication automatically
-- Replication strategy is honored for compatibility
4.2 Tables with Clustering Columns¶
Cassandra's wide-column model maps to HeliosDB's relational storage:
| Cassandra Concept | HeliosDB Mapping | Description |
|---|---|---|
| Partition Key | Primary Key (first column) | Hash distribution |
| Clustering Columns | Composite Primary Key | Sort order within partition |
| Regular Columns | Standard Columns | Data columns |
| Static Columns | Indexed Columns | Shared within partition |
Table Migration Example¶
-- Cassandra table with clustering columns
CREATE TABLE user_events (
user_id UUID,
event_date DATE,
event_time TIMESTAMP,
event_type TEXT,
event_data MAP<TEXT, TEXT>,
PRIMARY KEY ((user_id, event_date), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC)
AND default_time_to_live = 86400
AND gc_grace_seconds = 864000;
-- HeliosDB: Same DDL works identically
CREATE TABLE user_events (
user_id UUID,
event_date DATE,
event_time TIMESTAMP,
event_type TEXT,
event_data MAP<TEXT, TEXT>,
PRIMARY KEY ((user_id, event_date), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC)
AND default_time_to_live = 86400
AND gc_grace_seconds = 864000;
4.3 Materialized Views Mapping¶
Cassandra materialized views are fully supported:
-- Cassandra materialized view
CREATE MATERIALIZED VIEW users_by_email AS
SELECT * FROM users
WHERE email IS NOT NULL
AND user_id IS NOT NULL
PRIMARY KEY (email, user_id);
-- HeliosDB: Identical syntax, automatic maintenance
CREATE MATERIALIZED VIEW users_by_email AS
SELECT * FROM users
WHERE email IS NOT NULL
AND user_id IS NOT NULL
PRIMARY KEY (email, user_id);
-- HeliosDB advantage: Views can also be created as SQL views
-- for more complex transformations
CREATE VIEW users_with_orders AS
SELECT u.*, COUNT(o.order_id) as order_count
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id;
4.4 Data Type Mapping¶
| Cassandra Type | HeliosDB Type | Notes |
|---|---|---|
| ASCII | TEXT | UTF-8 normalized |
| BIGINT | BIGINT | Identical |
| BLOB | BYTEA | Binary storage |
| BOOLEAN | BOOLEAN | Identical |
| COUNTER | BIGINT + trigger | Counter semantics |
| DATE | DATE | Identical |
| DECIMAL | DECIMAL/NUMERIC | Arbitrary precision |
| DOUBLE | DOUBLE PRECISION | 64-bit IEEE |
| DURATION | INTERVAL | Time intervals |
| FLOAT | REAL | 32-bit IEEE |
| INET | INET | IPv4/IPv6 |
| INT | INTEGER | 32-bit signed |
| SMALLINT | SMALLINT | 16-bit signed |
| TEXT | TEXT | UTF-8 string |
| TIME | TIME | Time of day |
| TIMESTAMP | TIMESTAMPTZ | With timezone |
| TIMEUUID | UUID | Time-based UUID |
| TINYINT | SMALLINT | Mapped to 16-bit |
| UUID | UUID | 128-bit UUID |
| VARCHAR | TEXT | UTF-8 string |
| VARINT | NUMERIC | Arbitrary integer |
| LIST |
JSONB array | Ordered collection |
| SET |
JSONB array | Unique values |
| MAP |
JSONB object | Key-value pairs |
| TUPLE<...> | JSONB array | Fixed-length |
| FROZEN |
JSONB | Immutable |
| UDT | JSONB | User-defined type |
5. Step-by-Step Migration¶
5.1 Phase 1: Schema Translation¶
Export Cassandra Schema¶
# Export complete schema
cqlsh cassandra-host -e "DESCRIBE SCHEMA" > cassandra_schema.cql
# Export specific keyspace
cqlsh cassandra-host -e "DESCRIBE KEYSPACE production" > production_schema.cql
Schema Validation Script¶
#!/usr/bin/env python3
"""
Cassandra to HeliosDB Schema Validator
Validates CQL schema compatibility before migration.
"""
import re
import sys
def validate_schema(schema_file):
"""Validate CQL schema for HeliosDB compatibility."""
with open(schema_file, 'r') as f:
schema = f.read()
warnings = []
# Check for unsupported features
if 'CUSTOM INDEX' in schema.upper():
warnings.append("CUSTOM INDEX: SASI indexes have partial support")
if 'TINYINT' in schema.upper():
warnings.append("TINYINT: Mapped to SMALLINT in HeliosDB")
if 'COUNTER' in schema.upper():
warnings.append("COUNTER: Uses trigger-based implementation")
# Report findings
if warnings:
print("Schema Compatibility Warnings:")
for w in warnings:
print(f" - {w}")
else:
print("Schema is fully compatible with HeliosDB")
return len(warnings) == 0
if __name__ == "__main__":
validate_schema(sys.argv[1])
Import Schema to HeliosDB¶
# Connect to HeliosDB and import schema
cqlsh heliosdb-host 9042 -f cassandra_schema.cql
# Verify schema import
cqlsh heliosdb-host 9042 -e "DESCRIBE KEYSPACES"
cqlsh heliosdb-host 9042 -e "DESCRIBE KEYSPACE production"
5.2 Phase 2: Data Export from Cassandra¶
Method 1: COPY TO (Recommended for < 100GB)¶
-- Export table data to CSV
COPY production.users TO '/data/export/users.csv'
WITH HEADER = true
AND DELIMITER = '|'
AND NULL = ''
AND MAXREQUESTS = 6
AND PAGESIZE = 5000;
-- Export with specific columns
COPY production.events (user_id, event_time, event_type, event_data)
TO '/data/export/events.csv'
WITH HEADER = true;
Method 2: Spark for Large Datasets¶
// Spark DataFrame export for large tables
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.cassandra._
val spark = SparkSession.builder()
.appName("CassandraExport")
.config("spark.cassandra.connection.host", "cassandra-host")
.getOrCreate()
// Read from Cassandra
val df = spark.read
.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "production", "table" -> "users"))
.load()
// Export to Parquet for efficient transfer
df.write
.mode("overwrite")
.parquet("/data/export/users_parquet")
Method 3: sstableloader (Advanced)¶
# Export SSTables (for very large datasets)
# Note: This requires direct access to Cassandra data directory
# First, flush memtables
nodetool flush production users
# Copy SSTable files
cp -r /var/lib/cassandra/data/production/users-*/* /data/export/sstables/
# For HeliosDB import, convert to CQL or CSV format
# HeliosDB provides conversion utility
heliosdb-tools convert-sstable \
--input /data/export/sstables \
--output /data/export/users.cql \
--keyspace production \
--table users
5.3 Phase 3: Data Import into HeliosDB¶
Method 1: COPY FROM¶
-- Import data from CSV
USE production;
COPY users FROM '/data/export/users.csv'
WITH HEADER = true
AND DELIMITER = '|'
AND NULL = ''
AND CHUNKSIZE = 1000;
-- Verify import
SELECT COUNT(*) FROM users;
Method 2: Bulk Insert with Python¶
#!/usr/bin/env python3
"""
Bulk data migration from Cassandra to HeliosDB.
"""
from cassandra.cluster import Cluster
from cassandra.query import BatchStatement, BatchType
import csv
import uuid
# Source: Cassandra
source_cluster = Cluster(['cassandra-host'])
source_session = source_cluster.connect('production')
# Target: HeliosDB
target_cluster = Cluster(['heliosdb-host'])
target_session = target_cluster.connect('production')
def migrate_table(table_name, batch_size=100):
"""Migrate a single table with batching."""
# Get all rows from source
rows = source_session.execute(f"SELECT * FROM {table_name}")
# Prepare insert statement
columns = rows.column_names
placeholders = ', '.join(['?' for _ in columns])
insert_cql = f"INSERT INTO {table_name} ({', '.join(columns)}) VALUES ({placeholders})"
prepared = target_session.prepare(insert_cql)
# Batch insert
batch = BatchStatement(batch_type=BatchType.UNLOGGED)
count = 0
total = 0
for row in rows:
batch.add(prepared, list(row))
count += 1
if count >= batch_size:
target_session.execute(batch)
total += count
print(f"Migrated {total} rows...")
batch = BatchStatement(batch_type=BatchType.UNLOGGED)
count = 0
# Final batch
if count > 0:
target_session.execute(batch)
total += count
print(f"Migration complete: {total} rows migrated to {table_name}")
return total
# Migrate tables
tables = ['users', 'events', 'orders']
for table in tables:
migrate_table(table)
source_cluster.shutdown()
target_cluster.shutdown()
Method 3: Parallel Migration for Large Datasets¶
#!/usr/bin/env python3
"""
Parallel token-range migration for large datasets.
"""
from cassandra.cluster import Cluster
from concurrent.futures import ThreadPoolExecutor, as_completed
import math
def get_token_ranges(num_ranges=16):
"""Generate token ranges for parallel processing."""
min_token = -2**63
max_token = 2**63 - 1
range_size = (max_token - min_token) // num_ranges
ranges = []
current = min_token
for i in range(num_ranges):
end = current + range_size if i < num_ranges - 1 else max_token
ranges.append((current, end))
current = end
return ranges
def migrate_token_range(source_session, target_session, table, token_range, prepared):
"""Migrate a single token range."""
start_token, end_token = token_range
query = f"""
SELECT * FROM {table}
WHERE token(partition_key) >= {start_token}
AND token(partition_key) < {end_token}
"""
rows = source_session.execute(query)
count = 0
for row in rows:
target_session.execute(prepared, list(row))
count += 1
return count
def parallel_migrate(table_name, partition_key_column, num_workers=8):
"""Migrate table using parallel token ranges."""
source = Cluster(['cassandra-host']).connect('production')
target = Cluster(['heliosdb-host']).connect('production')
# Prepare insert
# ... (similar to previous example)
token_ranges = get_token_ranges(num_workers * 2)
total_migrated = 0
with ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [
executor.submit(migrate_token_range, source, target, table_name, tr, prepared)
for tr in token_ranges
]
for future in as_completed(futures):
total_migrated += future.result()
print(f"Progress: {total_migrated} rows migrated")
return total_migrated
5.4 Phase 4: Application CQL Driver Configuration¶
Java Application Configuration¶
// Before (Cassandra)
CqlSession session = CqlSession.builder()
.addContactPoint(new InetSocketAddress("cassandra-host", 9042))
.withLocalDatacenter("dc1")
.withKeyspace("production")
.build();
// After (HeliosDB) - Minimal changes
CqlSession session = CqlSession.builder()
.addContactPoint(new InetSocketAddress("heliosdb-host", 9042)) // Change host
.withLocalDatacenter("datacenter1") // HeliosDB default datacenter
.withKeyspace("production")
.build();
Python Application Configuration¶
# Before (Cassandra)
from cassandra.cluster import Cluster
cluster = Cluster(['cassandra-node1', 'cassandra-node2', 'cassandra-node3'])
session = cluster.connect('production')
# After (HeliosDB) - Change contact points only
cluster = Cluster(['heliosdb-host']) # Single endpoint, HeliosDB handles routing
session = cluster.connect('production')
Node.js Application Configuration¶
// Before (Cassandra)
const cassandra = require('cassandra-driver');
const client = new cassandra.Client({
contactPoints: ['cassandra-node1', 'cassandra-node2'],
localDataCenter: 'dc1',
keyspace: 'production'
});
// After (HeliosDB)
const client = new cassandra.Client({
contactPoints: ['heliosdb-host'],
localDataCenter: 'datacenter1',
keyspace: 'production'
});
Go Application Configuration¶
// Before (Cassandra)
cluster := gocql.NewCluster("cassandra-node1", "cassandra-node2")
cluster.Keyspace = "production"
cluster.Consistency = gocql.Quorum
// After (HeliosDB)
cluster := gocql.NewCluster("heliosdb-host")
cluster.Keyspace = "production"
cluster.Consistency = gocql.Quorum // Consistency levels work identically
6. CQL Query Compatibility¶
6.1 Supported CQL Statements¶
Data Definition Language (DDL)¶
| Statement | Support | Example |
|---|---|---|
| CREATE KEYSPACE | Full | CREATE KEYSPACE ks WITH REPLICATION = {...} |
| ALTER KEYSPACE | Full | ALTER KEYSPACE ks WITH REPLICATION = {...} |
| DROP KEYSPACE | Full | DROP KEYSPACE IF EXISTS ks |
| CREATE TABLE | Full | CREATE TABLE t (id UUID PRIMARY KEY, ...) |
| ALTER TABLE | Full | ALTER TABLE t ADD column TEXT |
| DROP TABLE | Full | DROP TABLE IF EXISTS t |
| TRUNCATE | Full | TRUNCATE TABLE t |
| CREATE INDEX | Full | CREATE INDEX ON t (column) |
| DROP INDEX | Full | DROP INDEX idx_name |
| CREATE TYPE | Full | CREATE TYPE address (street TEXT, ...) |
| DROP TYPE | Full | DROP TYPE address |
| CREATE FUNCTION | Full | CREATE FUNCTION fn(...) ... |
| DROP FUNCTION | Full | DROP FUNCTION fn |
| CREATE MATERIALIZED VIEW | Full | CREATE MATERIALIZED VIEW mv AS ... |
| DROP MATERIALIZED VIEW | Full | DROP MATERIALIZED VIEW mv |
Data Manipulation Language (DML)¶
-- All standard DML operations are supported
-- SELECT with all clauses
SELECT * FROM users
WHERE user_id = ?
AND event_time >= ?
AND event_time < ?
ORDER BY event_time DESC
LIMIT 100;
-- SELECT with IN clause
SELECT * FROM users WHERE user_id IN (?, ?, ?);
-- SELECT with token range
SELECT * FROM users WHERE token(user_id) > ? LIMIT 1000;
-- SELECT with ALLOW FILTERING
SELECT * FROM users WHERE email = ? ALLOW FILTERING;
-- SELECT as JSON
SELECT JSON * FROM users WHERE user_id = ?;
-- INSERT with options
INSERT INTO users (user_id, name, email)
VALUES (?, ?, ?)
USING TTL 86400 AND TIMESTAMP 1234567890;
-- INSERT IF NOT EXISTS
INSERT INTO users (user_id, name) VALUES (?, ?)
IF NOT EXISTS;
-- UPDATE with conditions
UPDATE users SET email = ?
WHERE user_id = ?
IF name = ?;
-- DELETE with conditions
DELETE FROM users WHERE user_id = ? IF EXISTS;
-- BATCH operations
BEGIN BATCH
INSERT INTO users (user_id, name) VALUES (?, ?);
UPDATE user_stats SET login_count = login_count + 1 WHERE user_id = ?;
APPLY BATCH;
6.2 Lightweight Transactions (LWT)¶
HeliosDB fully supports Cassandra's lightweight transactions:
-- Insert if not exists
INSERT INTO users (user_id, username, email)
VALUES (uuid(), 'newuser', 'new@example.com')
IF NOT EXISTS;
-- Update with condition
UPDATE users SET email = 'verified@example.com'
WHERE user_id = 550e8400-e29b-41d4-a716-446655440000
IF email = 'unverified@example.com';
-- Multiple conditions
UPDATE users SET status = 'active'
WHERE user_id = ?
IF status = 'pending' AND created_at < ?;
-- Delete with existence check
DELETE FROM users
WHERE user_id = ?
IF EXISTS;
-- LWT with IN operator
UPDATE users SET verified = true
WHERE user_id = ?
IF status IN ('pending', 'unverified');
-- LWT returns applied status
-- [applied] | user_id | ...
-- True | ... | ...
-- False | ... | ... (shows current values if not applied)
6.3 Time-Series Data Handling¶
HeliosDB excels at time-series workloads common in Cassandra:
-- Time-series table design (identical to Cassandra best practices)
CREATE TABLE sensor_readings (
sensor_id UUID,
bucket_date DATE,
reading_time TIMESTAMP,
temperature DOUBLE,
humidity DOUBLE,
pressure DOUBLE,
PRIMARY KEY ((sensor_id, bucket_date), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC)
AND default_time_to_live = 2592000; -- 30 days
-- Efficient time-range queries
SELECT * FROM sensor_readings
WHERE sensor_id = ?
AND bucket_date = ?
AND reading_time >= ?
AND reading_time < ?
ORDER BY reading_time DESC;
-- Aggregations (HeliosDB enhancement: more aggregate functions)
SELECT
sensor_id,
bucket_date,
AVG(temperature) as avg_temp,
MAX(temperature) as max_temp,
MIN(temperature) as min_temp,
COUNT(*) as reading_count
FROM sensor_readings
WHERE sensor_id = ? AND bucket_date = ?;
-- HeliosDB bonus: Window functions for time-series analysis
-- (Via SQL interface, same data accessible)
SELECT
reading_time,
temperature,
AVG(temperature) OVER (
PARTITION BY sensor_id
ORDER BY reading_time
ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
) as moving_avg
FROM sensor_readings
WHERE sensor_id = ? AND bucket_date = ?;
7. Feature Mapping¶
7.1 TTL Support¶
TTL (Time-To-Live) is fully supported at both row and column level:
-- Table-level default TTL
CREATE TABLE sessions (
session_id UUID PRIMARY KEY,
user_id UUID,
data BLOB
) WITH default_time_to_live = 3600; -- 1 hour
-- Row-level TTL on INSERT
INSERT INTO sessions (session_id, user_id, data)
VALUES (uuid(), ?, ?)
USING TTL 7200; -- 2 hours
-- Row-level TTL on UPDATE
UPDATE sessions USING TTL 1800
SET data = ?
WHERE session_id = ?;
-- Query TTL value
SELECT session_id, TTL(data) FROM sessions WHERE session_id = ?;
-- HeliosDB extension: Modify TTL without changing data
UPDATE sessions USING TTL 3600
SET data = data -- No-op assignment
WHERE session_id = ?;
7.2 Collections (Sets, Lists, Maps)¶
All collection types are supported with full operation semantics:
SET Operations¶
-- Create table with SET
CREATE TABLE user_tags (
user_id UUID PRIMARY KEY,
tags SET<TEXT>
);
-- Add to set
UPDATE user_tags SET tags = tags + {'premium', 'verified'}
WHERE user_id = ?;
-- Remove from set
UPDATE user_tags SET tags = tags - {'trial'}
WHERE user_id = ?;
-- Replace entire set
UPDATE user_tags SET tags = {'member'}
WHERE user_id = ?;
-- Query set contents
SELECT tags FROM user_tags WHERE user_id = ?;
LIST Operations¶
-- Create table with LIST
CREATE TABLE user_history (
user_id UUID PRIMARY KEY,
recent_pages LIST<TEXT>
);
-- Append to list
UPDATE user_history SET recent_pages = recent_pages + ['/dashboard']
WHERE user_id = ?;
-- Prepend to list
UPDATE user_history SET recent_pages = ['/home'] + recent_pages
WHERE user_id = ?;
-- Remove by value
UPDATE user_history SET recent_pages = recent_pages - ['/old-page']
WHERE user_id = ?;
-- Update by index
UPDATE user_history SET recent_pages[0] = '/new-home'
WHERE user_id = ?;
-- Delete by index
DELETE recent_pages[0] FROM user_history WHERE user_id = ?;
MAP Operations¶
-- Create table with MAP
CREATE TABLE user_preferences (
user_id UUID PRIMARY KEY,
settings MAP<TEXT, TEXT>
);
-- Add/update map entries
UPDATE user_preferences SET settings['theme'] = 'dark'
WHERE user_id = ?;
UPDATE user_preferences SET settings = settings + {'lang': 'en', 'tz': 'UTC'}
WHERE user_id = ?;
-- Remove map entries
DELETE settings['old_setting'] FROM user_preferences WHERE user_id = ?;
UPDATE user_preferences SET settings = settings - {'deprecated'}
WHERE user_id = ?;
-- Query map
SELECT settings['theme'] FROM user_preferences WHERE user_id = ?;
7.3 User-Defined Types (UDT)¶
UDTs are fully supported and stored efficiently as JSONB:
-- Create UDT
CREATE TYPE address (
street TEXT,
city TEXT,
state TEXT,
zip_code TEXT,
country TEXT
);
CREATE TYPE contact_info (
email TEXT,
phone TEXT,
address FROZEN<address>
);
-- Use UDT in table
CREATE TABLE customers (
customer_id UUID PRIMARY KEY,
name TEXT,
primary_contact FROZEN<contact_info>,
shipping_addresses LIST<FROZEN<address>>
);
-- Insert with UDT
INSERT INTO customers (customer_id, name, primary_contact, shipping_addresses)
VALUES (
uuid(),
'John Doe',
{
email: 'john@example.com',
phone: '+1-555-1234',
address: {
street: '123 Main St',
city: 'Springfield',
state: 'IL',
zip_code: '62701',
country: 'USA'
}
},
[
{street: '456 Oak Ave', city: 'Chicago', state: 'IL', zip_code: '60601', country: 'USA'}
]
);
-- Query UDT fields
SELECT
name,
primary_contact.email,
primary_contact.address.city
FROM customers
WHERE customer_id = ?;
-- Update UDT field
UPDATE customers
SET primary_contact.phone = '+1-555-5678'
WHERE customer_id = ?;
7.4 Counter Columns¶
Counter columns are supported with atomic increment/decrement:
-- Counter table
CREATE TABLE page_views (
page_id TEXT PRIMARY KEY,
view_count COUNTER,
unique_visitors COUNTER
);
-- Increment counters
UPDATE page_views SET view_count = view_count + 1 WHERE page_id = '/home';
UPDATE page_views SET unique_visitors = unique_visitors + 1 WHERE page_id = '/home';
-- Decrement counters
UPDATE page_views SET view_count = view_count - 1 WHERE page_id = '/home';
-- Counter batches
BEGIN COUNTER BATCH
UPDATE page_views SET view_count = view_count + 1 WHERE page_id = '/home';
UPDATE page_views SET view_count = view_count + 1 WHERE page_id = '/about';
APPLY BATCH;
-- Note: Counter limitations (same as Cassandra)
-- - Counter columns cannot be indexed
-- - Counter tables can only contain counter columns plus the primary key
-- - Cannot mix counter and non-counter columns
-- - Cannot use TTL with counters
7.5 Secondary Indexes¶
-- Create secondary index
CREATE INDEX users_by_email ON users (email);
-- Create index with custom name
CREATE INDEX idx_users_country ON users (country);
-- Index on collection elements
CREATE INDEX ON users (KEYS(preferences)); -- Index map keys
CREATE INDEX ON users (VALUES(tags)); -- Index set/list values
CREATE INDEX ON users (ENTRIES(preferences)); -- Index map entries
-- Query using index
SELECT * FROM users WHERE email = 'user@example.com';
-- Drop index
DROP INDEX IF EXISTS users_by_email;
8. Consistency and Replication¶
8.1 Consistency Level Mapping¶
All Cassandra consistency levels are supported:
| Consistency Level | HeliosDB Behavior | Recommendation |
|---|---|---|
| ANY | Write acknowledged | Development only |
| ONE | Single replica | Low-latency reads |
| TWO | Two replicas | Moderate durability |
| THREE | Three replicas | High durability |
| QUORUM | Majority replicas | Recommended default |
| ALL | All replicas | Maximum consistency |
| LOCAL_QUORUM | Local DC quorum | Multi-DC deployments |
| EACH_QUORUM | Quorum in each DC | Strong multi-DC |
| LOCAL_ONE | Single local replica | DC-aware reads |
| SERIAL | Linearizable (LWT) | LWT operations |
| LOCAL_SERIAL | Local linearizable | Local LWT |
Setting Consistency Levels¶
from cassandra import ConsistencyLevel
from cassandra.query import SimpleStatement
# Per-query consistency
query = SimpleStatement(
"SELECT * FROM users WHERE user_id = ?",
consistency_level=ConsistencyLevel.LOCAL_QUORUM
)
session.execute(query, [user_id])
# Session default consistency
from cassandra.cluster import ExecutionProfile
from cassandra.policies import RoundRobinPolicy
profile = ExecutionProfile(
load_balancing_policy=RoundRobinPolicy(),
consistency_level=ConsistencyLevel.LOCAL_QUORUM,
serial_consistency_level=ConsistencyLevel.LOCAL_SERIAL
)
cluster = Cluster(
['heliosdb-host'],
execution_profiles={'default': profile}
)
8.2 Replication Strategy Configuration¶
-- SimpleStrategy (single datacenter)
CREATE KEYSPACE development WITH REPLICATION = {
'class': 'SimpleStrategy',
'replication_factor': 1
};
-- NetworkTopologyStrategy (multi-datacenter)
CREATE KEYSPACE production WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'dc1': 3,
'dc2': 3,
'dc3': 2
};
-- Modify replication
ALTER KEYSPACE production WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'dc1': 3,
'dc2': 3,
'dc3': 3
};
8.3 HeliosDB Replication Advantages¶
Unlike Cassandra, HeliosDB provides:
- Automatic Rebalancing: No manual repair operations needed
- Faster Replica Sync: Streaming replication with parallel recovery
- Cross-Protocol Replication: Data replicated across protocol interfaces
- Integrated Backup: Point-in-time recovery built-in
9. Post-Migration Validation¶
9.1 Data Integrity Validation¶
Row Count Verification¶
#!/usr/bin/env python3
"""
Validate data integrity after migration.
"""
from cassandra.cluster import Cluster
def validate_row_counts(keyspace, tables):
"""Compare row counts between Cassandra and HeliosDB."""
source = Cluster(['cassandra-host']).connect(keyspace)
target = Cluster(['heliosdb-host']).connect(keyspace)
results = {}
for table in tables:
source_count = source.execute(f"SELECT COUNT(*) FROM {table}").one()[0]
target_count = target.execute(f"SELECT COUNT(*) FROM {table}").one()[0]
results[table] = {
'source': source_count,
'target': target_count,
'match': source_count == target_count
}
status = "OK" if results[table]['match'] else "MISMATCH"
print(f"{table}: Source={source_count}, Target={target_count} [{status}]")
source.cluster.shutdown()
target.cluster.shutdown()
return results
# Validate all tables
tables = ['users', 'events', 'orders', 'products']
validate_row_counts('production', tables)
Data Sampling Validation¶
def validate_sample_data(keyspace, table, primary_key_column, sample_size=100):
"""Compare sample rows between systems."""
source = Cluster(['cassandra-host']).connect(keyspace)
target = Cluster(['heliosdb-host']).connect(keyspace)
# Get sample primary keys from source
sample_query = f"SELECT {primary_key_column} FROM {table} LIMIT {sample_size}"
sample_keys = [row[0] for row in source.execute(sample_query)]
mismatches = []
for key in sample_keys:
source_row = source.execute(
f"SELECT * FROM {table} WHERE {primary_key_column} = ?", [key]
).one()
target_row = target.execute(
f"SELECT * FROM {table} WHERE {primary_key_column} = ?", [key]
).one()
if source_row != target_row:
mismatches.append({
'key': key,
'source': dict(source_row._asdict()) if source_row else None,
'target': dict(target_row._asdict()) if target_row else None
})
if mismatches:
print(f"Found {len(mismatches)} mismatches in {table}")
for m in mismatches[:5]: # Show first 5
print(f" Key: {m['key']}")
else:
print(f"{table}: All {sample_size} samples match")
return mismatches
9.2 Schema Validation¶
-- Compare schema definitions
-- On Cassandra:
DESCRIBE KEYSPACE production;
-- On HeliosDB:
DESCRIBE KEYSPACE production;
-- Verify specific table
DESCRIBE TABLE production.users;
-- Check indexes
SELECT * FROM system_schema.indexes WHERE keyspace_name = 'production';
-- Check materialized views
SELECT * FROM system_schema.views WHERE keyspace_name = 'production';
-- Check UDTs
SELECT * FROM system_schema.types WHERE keyspace_name = 'production';
9.3 Application Validation Checklist¶
- [ ] All application endpoints responding correctly
- [ ] Query latencies within acceptable range
- [ ] No increase in error rates
- [ ] Authentication working for all users
- [ ] Batch operations completing successfully
- [ ] TTL expiration working correctly
- [ ] Counter increments accurate
- [ ] Collection operations (add/remove) working
- [ ] LWT operations returning expected results
- [ ] Prepared statements caching effectively
9.4 Automated Validation Script¶
#!/bin/bash
# Post-migration validation script
CASSANDRA_HOST="cassandra-host"
HELIOSDB_HOST="heliosdb-host"
KEYSPACE="production"
echo "=== Post-Migration Validation ==="
# 1. Connectivity test
echo "Testing connectivity..."
cqlsh $HELIOSDB_HOST -e "SELECT now() FROM system.local" || exit 1
# 2. Schema comparison
echo "Comparing schemas..."
cqlsh $CASSANDRA_HOST -e "DESCRIBE KEYSPACE $KEYSPACE" > /tmp/cassandra_schema.cql
cqlsh $HELIOSDB_HOST -e "DESCRIBE KEYSPACE $KEYSPACE" > /tmp/heliosdb_schema.cql
diff /tmp/cassandra_schema.cql /tmp/heliosdb_schema.cql
# 3. Row count comparison
echo "Comparing row counts..."
for table in users events orders; do
cassandra_count=$(cqlsh $CASSANDRA_HOST -e "SELECT COUNT(*) FROM $KEYSPACE.$table" | grep -E '^\s*[0-9]+')
heliosdb_count=$(cqlsh $HELIOSDB_HOST -e "SELECT COUNT(*) FROM $KEYSPACE.$table" | grep -E '^\s*[0-9]+')
echo "$table: Cassandra=$cassandra_count, HeliosDB=$heliosdb_count"
done
# 4. Query performance test
echo "Testing query performance..."
time cqlsh $HELIOSDB_HOST -e "SELECT * FROM $KEYSPACE.users LIMIT 1000"
echo "=== Validation Complete ==="
10. Performance Tuning¶
10.1 HeliosDB Configuration Optimization¶
# heliosdb.toml - Cassandra protocol optimization
[cassandra]
# Network settings
listen_address = "0.0.0.0"
port = 9042
max_connections = 2000
[cassandra.protocol]
# Enable compression for better throughput
compression_enabled = true
compression_algorithms = ["lz4"]
# Increase frame size for large batches
max_frame_size = 536870912 # 512MB
[cassandra.query]
# Optimize prepared statement cache
prepared_cache_enabled = true
prepared_cache_size = 5000
# Adjust page size for your workload
default_page_size = 10000
max_page_size = 100000
[cassandra.batch]
# Increase batch limits
max_batch_size = 100000
batch_timeout = 30000
[cassandra.performance]
# Enable query cache for repeated queries
query_cache_enabled = true
query_cache_size = 512 # MB
# Log slow queries for optimization
log_slow_queries = true
slow_query_threshold_ms = 100
10.2 Client-Side Optimization¶
Connection Pool Tuning (Java)¶
CqlSession session = CqlSession.builder()
.addContactPoint(new InetSocketAddress("heliosdb-host", 9042))
.withLocalDatacenter("datacenter1")
.withConfigLoader(
DriverConfigLoader.programmaticBuilder()
.withInt(DefaultDriverOption.CONNECTION_POOL_LOCAL_SIZE, 4)
.withInt(DefaultDriverOption.CONNECTION_POOL_REMOTE_SIZE, 2)
.withInt(DefaultDriverOption.CONNECTION_MAX_REQUESTS, 32768)
.withDuration(DefaultDriverOption.REQUEST_TIMEOUT, Duration.ofSeconds(10))
.build()
)
.build();
Connection Pool Tuning (Python)¶
from cassandra.cluster import Cluster, ExecutionProfile
from cassandra.policies import DCAwareRoundRobinPolicy, TokenAwarePolicy
# Optimize execution profile
profile = ExecutionProfile(
load_balancing_policy=TokenAwarePolicy(
DCAwareRoundRobinPolicy(local_dc='datacenter1')
),
request_timeout=10.0,
row_factory=tuple_factory # More efficient than named tuples
)
cluster = Cluster(
['heliosdb-host'],
execution_profiles={'default': profile},
protocol_version=5,
compression=True,
idle_heartbeat_interval=30,
connection_class=ShardAwareConnection # If using multiple shards
)
10.3 Query Optimization Tips¶
Use Prepared Statements¶
# Always use prepared statements for repeated queries
prepared = session.prepare(
"SELECT * FROM users WHERE user_id = ? AND event_time >= ? AND event_time < ?"
)
# Execute with parameters
for user_id in user_ids:
result = session.execute(prepared, [user_id, start_time, end_time])
Optimize Batch Operations¶
from cassandra.query import BatchStatement, BatchType
# Use UNLOGGED batches for better performance when atomicity not required
batch = BatchStatement(batch_type=BatchType.UNLOGGED)
# Keep batches reasonable size (100-500 statements)
for i, row in enumerate(rows):
batch.add(prepared, row)
if (i + 1) % 100 == 0:
session.execute(batch)
batch = BatchStatement(batch_type=BatchType.UNLOGGED)
Avoid Anti-Patterns¶
-- AVOID: Full table scans
SELECT * FROM users; -- Bad
SELECT * FROM users WHERE status = 'active' ALLOW FILTERING; -- Bad
-- BETTER: Use partition keys
SELECT * FROM users WHERE user_id = ?; -- Good
SELECT * FROM users WHERE user_id IN (?, ?, ?); -- Good
-- AVOID: Large IN clauses
SELECT * FROM users WHERE user_id IN (/* 1000 values */); -- Bad
-- BETTER: Multiple queries or token range
SELECT * FROM users WHERE token(user_id) >= ? AND token(user_id) < ?;
10.4 Monitoring Performance¶
-- HeliosDB performance views (accessible via SQL interface)
SELECT * FROM heliosdb_stats.cql_query_latency ORDER BY avg_latency_ms DESC LIMIT 10;
SELECT * FROM heliosdb_stats.cql_throughput WHERE timestamp > now() - interval '1 hour';
-- Via CQL
SELECT * FROM system.local; -- Connection info
SELECT * FROM system.peers; -- Cluster topology
11. Common Issues and Troubleshooting¶
11.1 Connection Issues¶
Issue: Connection Refused¶
Solutions:
# Check HeliosDB is running
systemctl status heliosdb
# Verify port is listening
netstat -tlnp | grep 9042
# Check firewall
firewall-cmd --list-ports | grep 9042
# Test connectivity
nc -zv heliosdb-host 9042
Issue: Authentication Failed¶
Solutions:
# Ensure credentials are correct
from cassandra.auth import PlainTextAuthProvider
auth = PlainTextAuthProvider(
username='your_username',
password='your_password'
)
cluster = Cluster(['heliosdb-host'], auth_provider=auth)
11.2 Schema Issues¶
Issue: Keyspace Not Found¶
Solutions:
-- Create keyspace if missing
CREATE KEYSPACE IF NOT EXISTS xxx WITH REPLICATION = {
'class': 'SimpleStrategy',
'replication_factor': 1
};
-- Or import schema first
cqlsh heliosdb-host -f schema.cql
Issue: Type Mismatch¶
Solutions:
# Ensure proper type conversion
from cassandra.util import uuid_from_time
import uuid
# For UUID columns
user_id = uuid.UUID('550e8400-e29b-41d4-a716-446655440000')
# For TIMESTAMP columns
from datetime import datetime
event_time = datetime.utcnow()
# For collections
tags = {'tag1', 'tag2'} # Set
prefs = {'key': 'value'} # Map
items = ['item1', 'item2'] # List
11.3 Query Issues¶
Issue: Timeout on Large Queries¶
Solutions:
# Increase timeout
from cassandra.cluster import Cluster, ExecutionProfile
profile = ExecutionProfile(request_timeout=30.0) # 30 seconds
cluster = Cluster(
['heliosdb-host'],
execution_profiles={'default': profile}
)
# Or use paging for large results
query = SimpleStatement(
"SELECT * FROM large_table",
fetch_size=1000 # Page size
)
for row in session.execute(query):
process(row)
Issue: ALLOW FILTERING Warning¶
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance
Solutions:
-- Add index for frequently filtered columns
CREATE INDEX ON users (email);
-- Or redesign schema with proper partition keys
-- Instead of:
SELECT * FROM users WHERE status = 'active' ALLOW FILTERING;
-- Create a table partitioned by status:
CREATE TABLE users_by_status (
status TEXT,
user_id UUID,
name TEXT,
PRIMARY KEY (status, user_id)
);
11.4 Data Migration Issues¶
Issue: Migration Performance Degradation¶
Solutions:
# Use parallel migration
from concurrent.futures import ThreadPoolExecutor
def migrate_partition(partition_key):
# Migrate single partition
pass
with ThreadPoolExecutor(max_workers=8) as executor:
executor.map(migrate_partition, partition_keys)
Issue: Out of Memory During Migration¶
Solutions:
# Use streaming instead of loading all data
rows = session.execute("SELECT * FROM large_table")
# Process in chunks
batch = []
for row in rows:
batch.append(row)
if len(batch) >= 1000:
process_batch(batch)
batch = []
11.5 Performance Issues¶
Issue: High Latency After Migration¶
Diagnosis:
Solutions: 1. Check if indexes are properly created 2. Verify partition key access patterns 3. Enable query caching 4. Increase connection pool size
Issue: Inconsistent Query Results¶
Solutions:
# Use appropriate consistency level
from cassandra import ConsistencyLevel
from cassandra.query import SimpleStatement
query = SimpleStatement(
"SELECT * FROM users WHERE user_id = ?",
consistency_level=ConsistencyLevel.QUORUM
)
11.6 Collection Operation Issues¶
Issue: Collection Update Failures¶
Solutions:
-- Frozen collections are immutable; replace entirely
-- Instead of:
UPDATE users SET address.street = 'New St' WHERE user_id = ?; -- Fails if frozen
-- Replace the entire frozen field:
UPDATE users SET address = {street: 'New St', city: 'Chicago', ...}
WHERE user_id = ?;
-- Or use non-frozen for mutable fields
ALTER TYPE address DROP street;
ALTER TYPE address ADD street TEXT;
11.7 TTL Issues¶
Issue: Data Expiring Unexpectedly¶
Diagnosis:
-- Check TTL values
SELECT user_id, TTL(name), TTL(email) FROM users WHERE user_id = ?;
-- Check table default TTL
DESCRIBE TABLE users;
Solutions:
-- Update TTL on existing data
UPDATE users USING TTL 86400
SET name = name, email = email
WHERE user_id = ?;
-- Remove TTL (set to 0)
UPDATE users USING TTL 0
SET name = name
WHERE user_id = ?;
Quick Reference Card¶
Essential Commands¶
| Task | Command |
|---|---|
| Connect | cqlsh heliosdb-host 9042 |
| List keyspaces | DESCRIBE KEYSPACES; |
| Use keyspace | USE keyspace_name; |
| List tables | DESCRIBE TABLES; |
| Table schema | DESCRIBE TABLE table_name; |
| Enable tracing | TRACING ON; |
| Import schema | cqlsh -f schema.cql |
Migration Checklist Summary¶
- [ ] Pre-migration assessment complete
- [ ] Schema exported and validated
- [ ] Schema imported to HeliosDB
- [ ] Data exported from Cassandra
- [ ] Data imported to HeliosDB
- [ ] Row counts validated
- [ ] Sample data verified
- [ ] Application configuration updated
- [ ] Application tested end-to-end
- [ ] Performance benchmarked
- [ ] Monitoring configured
- [ ] Rollback plan documented
Related Documentation¶
- Cassandra Protocol README - Protocol overview
- Cassandra Compatibility Matrix - Feature compatibility
- Cassandra Configuration - Configuration reference
- Cassandra CQL Examples - Usage examples
- General Migration Guide - Migration best practices
- Protocol Compatibility Matrix - Cross-protocol comparison
Document Version: 1.0 Last Updated: January 2026 Compatibility: HeliosDB 7.0+, Apache Cassandra 3.x/4.x