HeliosDB Query Optimizer¶
Version: 7.0 Status: Production Ready Last Updated: 2026-01-04
Overview¶
The HeliosDB Query Optimizer is a sophisticated cost-based optimizer that transforms SQL queries into efficient execution plans. It combines traditional optimization techniques with advanced features like neural network-based planning, quantum-inspired optimization, and machine learning-driven cardinality estimation.
Key Capabilities¶
- Cost-Based Optimization: Evaluates multiple execution strategies and selects the lowest-cost plan
- Statistics-Driven Planning: Uses table and column statistics for accurate cost estimation
- Adaptive Query Processing: Adjusts plans at runtime based on actual data characteristics
- Multi-Model Support: Optimizes across SQL, document, graph, and time-series workloads
- Distributed Query Planning: Optimizes queries across sharded and replicated data
Architecture¶
+------------------+
| SQL Parser |
+--------+---------+
|
v
+------------------+
| Logical Planner |
+--------+---------+
|
v
+------------+ +------------------+ +---------------+
| Statistics |----->| Query Optimizer |<-----| Configuration |
| Catalog | +--------+---------+ | Parameters |
+------------+ | +---------------+
v
+------------------+
| Physical Planner |
+--------+---------+
|
v
+------------------+
| Execution Engine |
+------------------+
Optimization Phases¶
- Parsing: SQL text converted to Abstract Syntax Tree (AST)
- Semantic Analysis: Validates references, resolves types
- Logical Planning: Creates initial logical plan tree
- Logical Optimization: Applies rule-based transformations
- Physical Planning: Generates physical execution strategies
- Cost-Based Selection: Evaluates costs, selects optimal plan
- Execution: Runs the chosen plan
Cost-Based Optimizer¶
The cost-based optimizer evaluates query plans using a sophisticated cost model.
Cost Model Components¶
-- View optimizer cost parameters
SHOW ALL LIKE '%cost%';
/*
Parameter | Value | Description
---------------------|-------|----------------------------------
seq_page_cost | 1.0 | Cost to read one page sequentially
random_page_cost | 4.0 | Cost to read one page randomly
cpu_tuple_cost | 0.01 | Cost to process one tuple
cpu_operator_cost | 0.0025| Cost to execute one operator
cpu_index_tuple_cost | 0.005 | Cost to process one index entry
parallel_setup_cost | 1000 | Cost to start parallel workers
parallel_tuple_cost | 0.1 | Cost to transfer tuple to leader
*/
Cost Calculation Example¶
EXPLAIN (VERBOSE, COSTS)
SELECT * FROM orders WHERE customer_id = 123;
/*
Index Scan using idx_orders_customer on orders
(cost=0.43..8.45 rows=1 width=120)
Cost Breakdown:
- Index lookup: 0.43 (startup cost)
- Index entries processed: 1 x 0.005 = 0.005
- Random page reads: 1 x 4.0 = 4.0
- Tuple processing: 1 x 0.01 = 0.01
- Total: 8.45
*/
Plan Selection Process¶
The optimizer considers multiple strategies and selects the lowest-cost option:
EXPLAIN ANALYZE (WHY_NOT)
SELECT * FROM users WHERE age > 25;
/*
Optimizer Decision: Scan Strategy
CHOSEN: Sequential Scan (cost: 1000.00, rows: 50000)
Reasoning: Low selectivity (50% of rows match).
Sequential scan more efficient than index.
REJECTED: Index Scan on idx_users_age (cost: 2500.00)
Reason: High number of matching rows (50%) causes
excessive random I/O. Random page cost (4.0)
makes index scan 2.5x more expensive.
*/
Statistics Collection¶
Accurate statistics are essential for optimal query plans.
Automatic Statistics¶
-- HeliosDB automatically collects statistics
-- Configure auto-analyze threshold
SET autovacuum_analyze_threshold = 50;
SET autovacuum_analyze_scale_factor = 0.1;
-- Statistics collected:
-- - Row count per table
-- - Column value distributions (histograms)
-- - Most common values (MCV)
-- - Null fraction
-- - Average column width
-- - Correlation (physical vs logical ordering)
Manual Statistics Collection¶
-- Analyze single table
ANALYZE orders;
-- Analyze specific columns
ANALYZE orders (customer_id, order_date);
-- Analyze with sampling
ANALYZE (SAMPLE 10 PERCENT) large_table;
-- Verbose output
ANALYZE VERBOSE orders;
/*
INFO: analyzing "public.orders"
INFO: "orders": 1000000 rows sampled, 1000000 estimated total rows
INFO: "orders": 5 columns analyzed
*/
Extended Statistics¶
For correlated columns, create extended statistics:
-- Create dependency statistics
CREATE STATISTICS orders_stats (dependencies)
ON customer_id, order_date FROM orders;
-- Create ndistinct statistics
CREATE STATISTICS product_region_stats (ndistinct)
ON product_id, region FROM sales;
-- Analyze to populate
ANALYZE orders;
-- View statistics
SELECT * FROM pg_statistic_ext_data
WHERE stxname = 'orders_stats';
Viewing Statistics¶
-- Table-level statistics
SELECT
relname,
n_live_tup as rows,
n_dead_tup as dead_rows,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'orders';
-- Column-level statistics
SELECT
attname,
null_frac,
avg_width,
n_distinct,
most_common_vals,
most_common_freqs
FROM pg_stats
WHERE tablename = 'orders';
Join Strategies¶
The optimizer selects from multiple join algorithms based on data characteristics.
Hash Join¶
Best for: Large tables with equality conditions
EXPLAIN ANALYZE
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;
/*
Hash Join (cost=250.00..1500.00 rows=100000 width=150)
(actual time=15.2..125.8 ms rows=100000 loops=1)
Hash Cond: (o.customer_id = c.id)
-> Seq Scan on orders o (cost=0.00..1000.00 rows=100000)
-> Hash (cost=200.00..200.00 rows=10000)
-> Seq Scan on customers c (cost=0.00..200.00 rows=10000)
Characteristics:
- Builds hash table from smaller table (customers)
- Probes hash table for each row from larger table
- O(n + m) time complexity
- Memory usage: work_mem for hash table
*/
Merge Join¶
Best for: Pre-sorted inputs or when results need ordering
EXPLAIN ANALYZE
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
ORDER BY o.customer_id;
/*
Merge Join (cost=500.00..800.00 rows=100000 width=150)
(actual time=45.2..95.8 ms rows=100000 loops=1)
Merge Cond: (o.customer_id = c.id)
-> Index Scan using idx_orders_customer on orders o
-> Index Scan using customers_pkey on customers c
Characteristics:
- Requires sorted inputs
- Uses indexes or explicit sorts
- O(n log n + m log m) with sorting
- Excellent for ordered results
*/
Nested Loop Join¶
Best for: Small outer tables or indexed inner lookups
EXPLAIN ANALYZE
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.id = 12345;
/*
Nested Loop (cost=0.57..16.60 rows=1 width=150)
(actual time=0.05..0.06 ms rows=1 loops=1)
-> Index Scan using orders_pkey on orders o
Index Cond: (id = 12345)
-> Index Scan using customers_pkey on customers c
Index Cond: (id = o.customer_id)
Characteristics:
- Iterates outer, looks up inner for each row
- O(n * m) worst case
- Excellent when outer is small or inner has index
- No memory requirements for join itself
*/
Join Strategy Selection¶
-- Force specific join strategy (for testing)
SET enable_hashjoin = false;
SET enable_mergejoin = false;
SET enable_nestloop = true;
-- View optimizer decision
EXPLAIN ANALYZE (WHY_NOT)
SELECT * FROM a JOIN b ON a.id = b.a_id;
/*
Optimizer Decision: Join Strategy
CHOSEN: Hash Join (cost: 500.00)
Reasoning: Medium-sized tables with equality condition.
Hash table fits in work_mem (256MB).
REJECTED: Nested Loop (cost: 5000.00, 10.0x more expensive)
Reason: Outer table too large (10000 rows). Would require
10000 index lookups.
REJECTED: Merge Join (cost: 800.00, 1.6x more expensive)
Reason: Neither input is sorted on join key. Sort overhead
exceeds hash join cost.
*/
Index Selection¶
The optimizer automatically selects the most efficient index.
Index Types Supported¶
-- B-Tree (default): Range and equality queries
CREATE INDEX idx_orders_date ON orders(order_date);
-- Hash: Equality queries only
CREATE INDEX idx_users_email ON users USING HASH(email);
-- GiST: Spatial and full-text
CREATE INDEX idx_locations_geo ON locations USING GIST(coordinates);
-- GIN: Arrays and JSON
CREATE INDEX idx_products_tags ON products USING GIN(tags);
-- BRIN: Large sequential tables
CREATE INDEX idx_logs_time ON logs USING BRIN(timestamp);
Index Selection Criteria¶
EXPLAIN ANALYZE (WHY_NOT)
SELECT * FROM orders
WHERE customer_id = 123
AND order_date > '2025-01-01';
/*
Index Selection Analysis:
CHOSEN: idx_orders_customer_date (composite B-Tree)
- Selectivity: 0.001 (0.1% of rows)
- Estimated rows: 10
- Cost: 8.45
CONSIDERED: idx_orders_customer
- Selectivity: 0.01 (1% of rows)
- Would need filter on order_date
- Cost: 150.00
CONSIDERED: idx_orders_date
- Selectivity: 0.3 (30% of rows)
- Low selectivity makes index scan expensive
- Cost: 3000.00
NOT CONSIDERED: Sequential Scan
- Would scan all 1M rows
- Cost: 10000.00
*/
Covering Indexes¶
-- Create covering index
CREATE INDEX idx_orders_covering
ON orders (customer_id, order_date)
INCLUDE (amount, status);
EXPLAIN ANALYZE
SELECT customer_id, order_date, amount, status
FROM orders
WHERE customer_id = 123;
/*
Index Only Scan using idx_orders_covering
(cost=0.43..4.45 rows=10 width=50)
(actual time=0.02..0.03 ms rows=10 loops=1)
Heap Fetches: 0
Note: All columns satisfied from index (no table access)
*/
Parallel Query Execution¶
HeliosDB automatically parallelizes queries across CPU cores.
Parallel Configuration¶
-- View parallel settings
SHOW max_parallel_workers; -- 8 (cluster-wide)
SHOW max_parallel_workers_per_gather; -- 4 (per query)
SHOW min_parallel_table_scan_size; -- 8MB
SHOW min_parallel_index_scan_size; -- 512KB
SHOW parallel_setup_cost; -- 1000
SHOW parallel_tuple_cost; -- 0.1
Parallel Operations¶
EXPLAIN ANALYZE
SELECT region, SUM(amount)
FROM sales
GROUP BY region;
/*
Finalize GroupAggregate (cost=15000.00..15100.00 rows=10)
(actual time=150.2..150.5 ms rows=10 loops=1)
Group Key: region
-> Gather Merge (cost=15000.00..15050.00 rows=40)
Workers Planned: 4
Workers Launched: 4
-> Partial GroupAggregate (cost=14000.00..14010.00 rows=10)
(actual time=120.5..120.8 ms rows=10 loops=5)
-> Parallel Seq Scan on sales
(actual time=0.01..80.5 ms rows=200000 loops=5)
Performance: 4x speedup with 4 workers
Total rows: 1,000,000
Each worker processed: ~200,000 rows
*/
Parallel Join¶
EXPLAIN ANALYZE
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2025-01-01';
/*
Gather (cost=5000.00..15000.00 rows=500000 width=150)
Workers Planned: 4
-> Parallel Hash Join (cost=4000.00..12000.00 rows=125000)
Hash Cond: (o.customer_id = c.id)
-> Parallel Seq Scan on orders o
Filter: (order_date > '2025-01-01')
-> Parallel Hash
-> Parallel Seq Scan on customers c
*/
Distributed Query Optimization¶
For sharded deployments, the optimizer minimizes data movement.
Shard-Aware Planning¶
-- Table sharded by customer_id
CREATE TABLE orders (
id BIGINT,
customer_id INTEGER,
amount DECIMAL(10,2)
) SHARD BY HASH(customer_id) SHARDS 8;
-- Single-shard query (no data movement)
EXPLAIN DISTRIBUTED
SELECT * FROM orders WHERE customer_id = 123;
/*
Distributed Query Plan:
-> Route to Shard 3 (hash(123) mod 8 = 3)
-> Local Index Scan on orders
Index Cond: (customer_id = 123)
Network: 0 bytes shuffled (single shard)
*/
Distributed Join Optimization¶
-- Both tables sharded by customer_id (co-located)
EXPLAIN DISTRIBUTED
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;
/*
Distributed Query Plan:
-> Parallel execution on 8 shards
Each shard executes locally:
-> Hash Join
-> Local Scan on orders partition
-> Local Scan on customers partition
-> Gather results at coordinator
Network: Minimal (only results transferred)
Optimization: Co-located join (both tables sharded same way)
*/
Cross-Shard Aggregation¶
EXPLAIN DISTRIBUTED
SELECT region, SUM(amount) as total
FROM sales
GROUP BY region;
/*
Distributed Query Plan:
Phase 1: Partial Aggregation (on each shard)
-> Each shard computes partial SUM per region
-> Data shuffled: 8 shards x 10 regions x 16 bytes = 1.3KB
Phase 2: Final Aggregation (at coordinator)
-> Merge partial results
-> Compute final SUM per region
Network: 1.3KB shuffled
Optimization: Pushdown partial aggregation to shards
*/
Advanced Optimization Features¶
Adaptive Query Execution¶
-- Enable adaptive execution
SET adaptive_execution = ON;
EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.amount > 1000;
/*
Adaptive Execution Report:
Initial Plan: Hash Join (estimated 1000 rows)
Runtime Adjustment: Switched to Nested Loop at row 50
Reason: Actual selectivity (0.05%) much lower than estimated (1%)
Performance Impact:
- Original plan would have: 150ms
- Adaptive plan achieved: 5ms
- Improvement: 30x
*/
Query Plan Caching¶
-- View cached plans
SELECT * FROM pg_prepared_statements;
-- Prepare statement (caches plan)
PREPARE get_orders AS
SELECT * FROM orders WHERE customer_id = $1;
-- Execute with cached plan
EXECUTE get_orders(123);
-- View plan cache statistics
SELECT * FROM pg_stat_statements
WHERE query LIKE '%get_orders%';
Optimizer Hints¶
-- Force index usage
SELECT /*+ IndexScan(orders idx_orders_date) */
* FROM orders WHERE order_date > '2025-01-01';
-- Force join order
SELECT /*+ Leading(c o) */
o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;
-- Force parallel execution
SELECT /*+ Parallel(orders 8) */
SUM(amount) FROM orders;
-- Disable specific optimization
SELECT /*+ NoHashJoin */
* FROM a JOIN b ON a.id = b.a_id;
Related Documentation¶
- Quick Start Guide - Getting started with query optimization
- User Guide - Comprehensive optimization techniques
- Troubleshooting Guide - Debugging slow queries
- EXPLAIN Plan Guide - Understanding execution plans
- Index Strategy Guide - Index design patterns
HeliosDB Query Optimizer - Enterprise-grade query optimization for any workload.