HeliosDB Query Optimizer¶

Version: 7.0 Status: Production Ready Last Updated: 2026-01-04

Overview¶

The HeliosDB Query Optimizer is a sophisticated cost-based optimizer that transforms SQL queries into efficient execution plans. It combines traditional optimization techniques with advanced features like neural network-based planning, quantum-inspired optimization, and machine learning-driven cardinality estimation.

Key Capabilities¶

Cost-Based Optimization: Evaluates multiple execution strategies and selects the lowest-cost plan
Statistics-Driven Planning: Uses table and column statistics for accurate cost estimation
Adaptive Query Processing: Adjusts plans at runtime based on actual data characteristics
Multi-Model Support: Optimizes across SQL, document, graph, and time-series workloads
Distributed Query Planning: Optimizes queries across sharded and replicated data

Architecture¶

                    +------------------+
                    |    SQL Parser    |
                    +--------+---------+
                             |
                             v
                    +------------------+
                    |  Logical Planner |
                    +--------+---------+
                             |
                             v
+------------+      +------------------+      +---------------+
| Statistics |----->| Query Optimizer  |<-----| Configuration |
| Catalog    |      +--------+---------+      | Parameters    |
+------------+               |                +---------------+
                             v
                    +------------------+
                    | Physical Planner |
                    +--------+---------+
                             |
                             v
                    +------------------+
                    | Execution Engine |
                    +------------------+

Optimization Phases¶

Parsing: SQL text converted to Abstract Syntax Tree (AST)
Semantic Analysis: Validates references, resolves types
Logical Planning: Creates initial logical plan tree
Logical Optimization: Applies rule-based transformations
Physical Planning: Generates physical execution strategies
Cost-Based Selection: Evaluates costs, selects optimal plan
Execution: Runs the chosen plan

Cost-Based Optimizer¶

The cost-based optimizer evaluates query plans using a sophisticated cost model.

Cost Model Components¶

-- View optimizer cost parameters
SHOW ALL LIKE '%cost%';

/*
Parameter            | Value | Description
---------------------|-------|----------------------------------
seq_page_cost        | 1.0   | Cost to read one page sequentially
random_page_cost     | 4.0   | Cost to read one page randomly
cpu_tuple_cost       | 0.01  | Cost to process one tuple
cpu_operator_cost    | 0.0025| Cost to execute one operator
cpu_index_tuple_cost | 0.005 | Cost to process one index entry
parallel_setup_cost  | 1000  | Cost to start parallel workers
parallel_tuple_cost  | 0.1   | Cost to transfer tuple to leader
*/

Cost Calculation Example¶

EXPLAIN (VERBOSE, COSTS)
SELECT * FROM orders WHERE customer_id = 123;

/*
Index Scan using idx_orders_customer on orders
  (cost=0.43..8.45 rows=1 width=120)

Cost Breakdown:
  - Index lookup: 0.43 (startup cost)
  - Index entries processed: 1 x 0.005 = 0.005
  - Random page reads: 1 x 4.0 = 4.0
  - Tuple processing: 1 x 0.01 = 0.01
  - Total: 8.45
*/

Plan Selection Process¶

The optimizer considers multiple strategies and selects the lowest-cost option:

EXPLAIN ANALYZE (WHY_NOT)
SELECT * FROM users WHERE age > 25;

/*
Optimizer Decision: Scan Strategy
  CHOSEN: Sequential Scan (cost: 1000.00, rows: 50000)
  Reasoning: Low selectivity (50% of rows match).
             Sequential scan more efficient than index.

  REJECTED: Index Scan on idx_users_age (cost: 2500.00)
  Reason: High number of matching rows (50%) causes
          excessive random I/O. Random page cost (4.0)
          makes index scan 2.5x more expensive.
*/

Statistics Collection¶

Accurate statistics are essential for optimal query plans.

Automatic Statistics¶

-- HeliosDB automatically collects statistics
-- Configure auto-analyze threshold
SET autovacuum_analyze_threshold = 50;
SET autovacuum_analyze_scale_factor = 0.1;

-- Statistics collected:
-- - Row count per table
-- - Column value distributions (histograms)
-- - Most common values (MCV)
-- - Null fraction
-- - Average column width
-- - Correlation (physical vs logical ordering)

Manual Statistics Collection¶

-- Analyze single table
ANALYZE orders;

-- Analyze specific columns
ANALYZE orders (customer_id, order_date);

-- Analyze with sampling
ANALYZE (SAMPLE 10 PERCENT) large_table;

-- Verbose output
ANALYZE VERBOSE orders;
/*
INFO:  analyzing "public.orders"
INFO:  "orders": 1000000 rows sampled, 1000000 estimated total rows
INFO:  "orders": 5 columns analyzed
*/

Extended Statistics¶

For correlated columns, create extended statistics:

-- Create dependency statistics
CREATE STATISTICS orders_stats (dependencies)
ON customer_id, order_date FROM orders;

-- Create ndistinct statistics
CREATE STATISTICS product_region_stats (ndistinct)
ON product_id, region FROM sales;

-- Analyze to populate
ANALYZE orders;

-- View statistics
SELECT * FROM pg_statistic_ext_data
WHERE stxname = 'orders_stats';

Viewing Statistics¶

-- Table-level statistics
SELECT
  relname,
  n_live_tup as rows,
  n_dead_tup as dead_rows,
  last_analyze,
  last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'orders';

-- Column-level statistics
SELECT
  attname,
  null_frac,
  avg_width,
  n_distinct,
  most_common_vals,
  most_common_freqs
FROM pg_stats
WHERE tablename = 'orders';

Join Strategies¶

The optimizer selects from multiple join algorithms based on data characteristics.

Hash Join¶

Best for: Large tables with equality conditions

EXPLAIN ANALYZE
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

/*
Hash Join (cost=250.00..1500.00 rows=100000 width=150)
  (actual time=15.2..125.8 ms rows=100000 loops=1)
  Hash Cond: (o.customer_id = c.id)
  -> Seq Scan on orders o (cost=0.00..1000.00 rows=100000)
  -> Hash (cost=200.00..200.00 rows=10000)
       -> Seq Scan on customers c (cost=0.00..200.00 rows=10000)

Characteristics:
- Builds hash table from smaller table (customers)
- Probes hash table for each row from larger table
- O(n + m) time complexity
- Memory usage: work_mem for hash table
*/

Merge Join¶

Best for: Pre-sorted inputs or when results need ordering

EXPLAIN ANALYZE
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
ORDER BY o.customer_id;

/*
Merge Join (cost=500.00..800.00 rows=100000 width=150)
  (actual time=45.2..95.8 ms rows=100000 loops=1)
  Merge Cond: (o.customer_id = c.id)
  -> Index Scan using idx_orders_customer on orders o
  -> Index Scan using customers_pkey on customers c

Characteristics:
- Requires sorted inputs
- Uses indexes or explicit sorts
- O(n log n + m log m) with sorting
- Excellent for ordered results
*/

Nested Loop Join¶

Best for: Small outer tables or indexed inner lookups

EXPLAIN ANALYZE
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.id = 12345;

/*
Nested Loop (cost=0.57..16.60 rows=1 width=150)
  (actual time=0.05..0.06 ms rows=1 loops=1)
  -> Index Scan using orders_pkey on orders o
       Index Cond: (id = 12345)
  -> Index Scan using customers_pkey on customers c
       Index Cond: (id = o.customer_id)

Characteristics:
- Iterates outer, looks up inner for each row
- O(n * m) worst case
- Excellent when outer is small or inner has index
- No memory requirements for join itself
*/

Join Strategy Selection¶

-- Force specific join strategy (for testing)
SET enable_hashjoin = false;
SET enable_mergejoin = false;
SET enable_nestloop = true;

-- View optimizer decision
EXPLAIN ANALYZE (WHY_NOT)
SELECT * FROM a JOIN b ON a.id = b.a_id;

/*
Optimizer Decision: Join Strategy
  CHOSEN: Hash Join (cost: 500.00)
  Reasoning: Medium-sized tables with equality condition.
             Hash table fits in work_mem (256MB).

  REJECTED: Nested Loop (cost: 5000.00, 10.0x more expensive)
  Reason: Outer table too large (10000 rows). Would require
          10000 index lookups.

  REJECTED: Merge Join (cost: 800.00, 1.6x more expensive)
  Reason: Neither input is sorted on join key. Sort overhead
          exceeds hash join cost.
*/

Index Selection¶

The optimizer automatically selects the most efficient index.

Index Types Supported¶

-- B-Tree (default): Range and equality queries
CREATE INDEX idx_orders_date ON orders(order_date);

-- Hash: Equality queries only
CREATE INDEX idx_users_email ON users USING HASH(email);

-- GiST: Spatial and full-text
CREATE INDEX idx_locations_geo ON locations USING GIST(coordinates);

-- GIN: Arrays and JSON
CREATE INDEX idx_products_tags ON products USING GIN(tags);

-- BRIN: Large sequential tables
CREATE INDEX idx_logs_time ON logs USING BRIN(timestamp);

Index Selection Criteria¶

EXPLAIN ANALYZE (WHY_NOT)
SELECT * FROM orders
WHERE customer_id = 123
  AND order_date > '2025-01-01';

/*
Index Selection Analysis:

CHOSEN: idx_orders_customer_date (composite B-Tree)
  - Selectivity: 0.001 (0.1% of rows)
  - Estimated rows: 10
  - Cost: 8.45

CONSIDERED: idx_orders_customer
  - Selectivity: 0.01 (1% of rows)
  - Would need filter on order_date
  - Cost: 150.00

CONSIDERED: idx_orders_date
  - Selectivity: 0.3 (30% of rows)
  - Low selectivity makes index scan expensive
  - Cost: 3000.00

NOT CONSIDERED: Sequential Scan
  - Would scan all 1M rows
  - Cost: 10000.00
*/

Covering Indexes¶

-- Create covering index
CREATE INDEX idx_orders_covering
ON orders (customer_id, order_date)
INCLUDE (amount, status);

EXPLAIN ANALYZE
SELECT customer_id, order_date, amount, status
FROM orders
WHERE customer_id = 123;

/*
Index Only Scan using idx_orders_covering
  (cost=0.43..4.45 rows=10 width=50)
  (actual time=0.02..0.03 ms rows=10 loops=1)
  Heap Fetches: 0

Note: All columns satisfied from index (no table access)
*/

Parallel Query Execution¶

HeliosDB automatically parallelizes queries across CPU cores.

Parallel Configuration¶

-- View parallel settings
SHOW max_parallel_workers;              -- 8 (cluster-wide)
SHOW max_parallel_workers_per_gather;   -- 4 (per query)
SHOW min_parallel_table_scan_size;      -- 8MB
SHOW min_parallel_index_scan_size;      -- 512KB
SHOW parallel_setup_cost;               -- 1000
SHOW parallel_tuple_cost;               -- 0.1

Parallel Operations¶

EXPLAIN ANALYZE
SELECT region, SUM(amount)
FROM sales
GROUP BY region;

/*
Finalize GroupAggregate (cost=15000.00..15100.00 rows=10)
  (actual time=150.2..150.5 ms rows=10 loops=1)
  Group Key: region
  -> Gather Merge (cost=15000.00..15050.00 rows=40)
       Workers Planned: 4
       Workers Launched: 4
       -> Partial GroupAggregate (cost=14000.00..14010.00 rows=10)
            (actual time=120.5..120.8 ms rows=10 loops=5)
            -> Parallel Seq Scan on sales
                 (actual time=0.01..80.5 ms rows=200000 loops=5)

Performance: 4x speedup with 4 workers
Total rows: 1,000,000
Each worker processed: ~200,000 rows
*/

Parallel Join¶

EXPLAIN ANALYZE
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2025-01-01';

/*
Gather (cost=5000.00..15000.00 rows=500000 width=150)
  Workers Planned: 4
  -> Parallel Hash Join (cost=4000.00..12000.00 rows=125000)
       Hash Cond: (o.customer_id = c.id)
       -> Parallel Seq Scan on orders o
            Filter: (order_date > '2025-01-01')
       -> Parallel Hash
            -> Parallel Seq Scan on customers c
*/

Distributed Query Optimization¶

For sharded deployments, the optimizer minimizes data movement.

Shard-Aware Planning¶

-- Table sharded by customer_id
CREATE TABLE orders (
    id BIGINT,
    customer_id INTEGER,
    amount DECIMAL(10,2)
) SHARD BY HASH(customer_id) SHARDS 8;

-- Single-shard query (no data movement)
EXPLAIN DISTRIBUTED
SELECT * FROM orders WHERE customer_id = 123;

/*
Distributed Query Plan:
  -> Route to Shard 3 (hash(123) mod 8 = 3)
     -> Local Index Scan on orders
        Index Cond: (customer_id = 123)

Network: 0 bytes shuffled (single shard)
*/

Distributed Join Optimization¶

-- Both tables sharded by customer_id (co-located)
EXPLAIN DISTRIBUTED
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

/*
Distributed Query Plan:
  -> Parallel execution on 8 shards
     Each shard executes locally:
     -> Hash Join
          -> Local Scan on orders partition
          -> Local Scan on customers partition
  -> Gather results at coordinator

Network: Minimal (only results transferred)
Optimization: Co-located join (both tables sharded same way)
*/

Cross-Shard Aggregation¶

EXPLAIN DISTRIBUTED
SELECT region, SUM(amount) as total
FROM sales
GROUP BY region;

/*
Distributed Query Plan:
  Phase 1: Partial Aggregation (on each shard)
    -> Each shard computes partial SUM per region
    -> Data shuffled: 8 shards x 10 regions x 16 bytes = 1.3KB

  Phase 2: Final Aggregation (at coordinator)
    -> Merge partial results
    -> Compute final SUM per region

Network: 1.3KB shuffled
Optimization: Pushdown partial aggregation to shards
*/

Advanced Optimization Features¶

Adaptive Query Execution¶

-- Enable adaptive execution
SET adaptive_execution = ON;

EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.amount > 1000;

/*
Adaptive Execution Report:
  Initial Plan: Hash Join (estimated 1000 rows)
  Runtime Adjustment: Switched to Nested Loop at row 50
  Reason: Actual selectivity (0.05%) much lower than estimated (1%)

  Performance Impact:
    - Original plan would have: 150ms
    - Adaptive plan achieved: 5ms
    - Improvement: 30x
*/

Query Plan Caching¶

-- View cached plans
SELECT * FROM pg_prepared_statements;

-- Prepare statement (caches plan)
PREPARE get_orders AS
SELECT * FROM orders WHERE customer_id = $1;

-- Execute with cached plan
EXECUTE get_orders(123);

-- View plan cache statistics
SELECT * FROM pg_stat_statements
WHERE query LIKE '%get_orders%';

Optimizer Hints¶

-- Force index usage
SELECT /*+ IndexScan(orders idx_orders_date) */
  * FROM orders WHERE order_date > '2025-01-01';

-- Force join order
SELECT /*+ Leading(c o) */
  o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

-- Force parallel execution
SELECT /*+ Parallel(orders 8) */
  SUM(amount) FROM orders;

-- Disable specific optimization
SELECT /*+ NoHashJoin */
  * FROM a JOIN b ON a.id = b.a_id;

Quick Start Guide - Getting started with query optimization
User Guide - Comprehensive optimization techniques
Troubleshooting Guide - Debugging slow queries
EXPLAIN Plan Guide - Understanding execution plans
Index Strategy Guide - Index design patterns

HeliosDB Query Optimizer - Enterprise-grade query optimization for any workload.