HeliosDB GPU - Known Limitations & Workarounds¶
Version: 1.0 Date: November 2025 Status: Production Release with Identified Feature Gaps
This document provides a comprehensive overview of GPU acceleration limitations in HeliosDB, including unimplemented features, workarounds, and performance expectations.
Table of Contents¶
- Executive Summary
- GPU Window Functions (14 Features)
- GPU Statistical Aggregations (11 Features)
- GPU Hash Joins
- Multi-Region Metrics
- Performance Expectations
- Fallback & Recovery
- Configuration & Tuning
- Future Roadmap
Executive Summary¶
HeliosDB GPU acceleration provides significant speedups for core operations: - OLTP Queries: 2.7x speedup (when supported operations) - OLAP Queries: 7.5x speedup (when supported operations) - CPU Fallback: Always available, still competitive with industry databases
Key Limitation Areas: 1. Window functions (14 types not yet GPU-optimized) 2. Statistical aggregations (11 functions not yet GPU-optimized) 3. Complex hash joins (7 kernel configurations incomplete) 4. Multi-region metrics collection (now fixed in latest release)
Bottom Line: For queries using supported GPU operations, expect 2-10x speedup. For unsupported operations, queries automatically fall back to CPU execution with no performance penalty beyond normal CPU query time.
GPU Window Functions (14 Features)¶
Overview¶
Window functions are analytical SQL operations that compute results over a partition of rows. While HeliosDB has GPU window function infrastructure, the following 14 functions currently either: - Return empty result sets (requires fix in kernel execution) - Fall back to CPU execution (after recent update)
Unimplemented Window Functions¶
| Function | Type | Current Behavior | Typical Use Case |
|---|---|---|---|
ROW_NUMBER() |
Ranking | CPU fallback | Sequential numbering within partition |
RANK() |
Ranking | CPU fallback | Ranking with gaps (ties get same rank) |
DENSE_RANK() |
Ranking | CPU fallback | Ranking without gaps |
LAG(column, offset) |
Offset | CPU fallback | Access previous row value |
LEAD(column, offset) |
Offset | CPU fallback | Access next row value |
FIRST_VALUE(column) |
Frame | CPU fallback | First value in window frame |
LAST_VALUE(column) |
Frame | CPU fallback | Last value in window frame |
NTILE(n) |
Distribution | CPU fallback | Divide partition into n buckets |
CUME_DIST() |
Distribution | CPU fallback | Cumulative distribution (0.0-1.0) |
PERCENT_RANK() |
Distribution | CPU fallback | Relative rank (0.0-1.0) |
Prefix Sum |
Running Aggregate | CPU fallback | Cumulative sum operation |
Running SUM |
Running Aggregate | CPU fallback | Cumulative sum over window frame |
Running AVG |
Running Aggregate | CPU fallback | Cumulative average |
Running MIN/MAX |
Running Aggregate | CPU fallback | Cumulative min/max |
Example: ROW_NUMBER() Query¶
-- This query will execute on CPU, not GPU
SELECT
employee_id,
salary,
ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM employees;
Current Behavior: - Query routes to CPU execution - Performance: Standard CPU query speed (no GPU acceleration) - Result: Correct results, just not accelerated
Status & Root Cause¶
Root Cause: Window function kernels have incomplete execution paths. The CUDA kernel code exists but certain code paths don't properly marshal results back from GPU memory.
Fix Status: - Kernel code: Complete (src/window/window_functions.cu) - Rust wrapper: Complete (src/window_functions.rs) - Kernel execution: Incomplete (result marshaling) - Estimated fix time: 2-3 weeks
Workarounds¶
Option 1: Use CPU Execution (Default)¶
-- Automatically falls back to CPU
SELECT
employee_id,
salary,
ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM employees;
Option 2: Disable GPU Routing (Explicit CPU)¶
-- PostgreSQL compatibility: Disable GPU explicitly
SET gpu_enabled = false;
SELECT
employee_id,
salary,
ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM employees;
Option 3: Rewrite as Subquery (Partial GPU)¶
-- Use GPU for aggregation, then CPU for window function
WITH temp_ranked AS (
SELECT
department_id,
salary,
COUNT(*) OVER (PARTITION BY department_id) as dept_size
FROM employees
)
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM temp_ranked;
Performance Impact¶
For a typical window function query on 10M rows: - CPU Only: ~2,300 ms - With CPU Fallback: ~2,300 ms (same, automatic fallback) - Expected GPU (when available): ~150-300 ms
Note: There is no performance penalty for the fallback. GPU-incompatible queries execute at normal CPU speed.
GPU Statistical Aggregations (11 Features)¶
Overview¶
Statistical aggregation functions are not yet optimized on GPU. These functions compute statistical measures like standard deviation, variance, and correlation over groups of data.
Unimplemented Statistical Functions¶
| Function | Category | Example | Current Behavior |
|---|---|---|---|
STDDEV(expr) |
Variance | Sample standard deviation | CPU fallback |
STDDEV_POP(expr) |
Variance | Population standard deviation | CPU fallback |
VARIANCE(expr) |
Variance | Sample variance | CPU fallback |
VAR_POP(expr) |
Variance | Population variance | CPU fallback |
PERCENTILE(expr, p) |
Percentile | Percentile value (0.25, 0.5, 0.75) | CPU fallback |
APPROX_PERCENTILE(expr, p) |
Percentile | Approximate percentile (faster) | CPU fallback |
CORR(x, y) |
Correlation | Pearson correlation coefficient | CPU fallback |
COVAR_POP(x, y) |
Covariance | Population covariance | CPU fallback |
COVAR_SAMP(x, y) |
Covariance | Sample covariance | CPU fallback |
REGR_SLOPE(y, x) |
Regression | Linear regression slope | CPU fallback |
REGR_INTERCEPT(y, x) |
Regression | Linear regression intercept | CPU fallback |
Example: Statistical Query¶
-- This query will use CPU, not GPU
SELECT
product_category,
COUNT(*) as order_count,
AVG(order_amount) as avg_amount,
STDDEV(order_amount) as stddev_amount, -- CPU fallback
PERCENTILE(order_amount, 0.5) as median -- CPU fallback
FROM orders
GROUP BY product_category;
Current Behavior: - Basic aggregates (COUNT, AVG, MAX, MIN, SUM): GPU accelerated - Statistical functions: CPU fallback - Group BY operation: Partially GPU accelerated
Supported vs. Unsupported Aggregates¶
-- GPU Accelerated
SELECT
category,
COUNT(*) as cnt, -- GPU
SUM(amount) as total, -- GPU
AVG(price) as avg_price, -- GPU
MIN(price) as min_price, -- GPU
MAX(price) as max_price -- GPU
FROM sales
GROUP BY category;
-- ❌ Uses CPU Fallback (Still runs correctly)
SELECT
category,
STDDEV(price) as std_price, -- CPU
VARIANCE(price) as var_price, -- CPU
CORR(price, quantity) as corr -- CPU
FROM sales
GROUP BY category;
Status & Root Cause¶
Root Cause: Statistical functions require iterative or multi-pass algorithms that are more complex to parallelize on GPU:
- STDDEV: Requires computing mean, then deviations, then sqrt (3+ passes)
- PERCENTILE: Requires sorting or histogram computation
- CORR/COVAR: Requires parallel covariance computation
- REGR_*: Requires parallel linear algebra
Implementation Status: - Aggregation framework: Complete - Basic aggregates: GPU-optimized - Statistical functions: CPU-only implementation exists - GPU kernels for statistics: Not yet implemented
Workarounds¶
Option 1: Accept CPU Execution¶
Statistical functions execute on CPU at standard performance levels. For most datasets, this is acceptable.
-- Query runs correctly on CPU
SELECT
department,
STDDEV(salary) as salary_stddev,
PERCENTILE(salary, 0.75) as q3_salary
FROM employees
GROUP BY department;
Option 2: Approximate Methods (When Available)¶
-- Use approximate percentile for faster computation
SELECT
product_id,
APPROX_PERCENTILE(rating, 0.5) as approx_median_rating
FROM reviews
GROUP BY product_id;
Option 3: Compute Manually with GPU¶
-- Pre-compute with GPU-accelerated operations
WITH stats AS (
SELECT
department,
COUNT(*) as cnt,
SUM(salary) as total_salary,
SUM(salary * salary) as sum_squared,
AVG(salary) as mean_salary -- GPU
FROM employees
GROUP BY department
)
SELECT
department,
mean_salary,
SQRT(sum_squared / cnt - mean_salary * mean_salary) as stddev
FROM stats;
Pros: Gets GPU acceleration for component operations Cons: More complex query, requires mathematical knowledge
Performance Impact¶
For a 100M row dataset with GROUP BY: - Basic aggregates (SUM, AVG, COUNT): ~500 ms (GPU) - With statistical functions: ~1,200 ms (CPU fallback for STDDEV/PERCENTILE) - Expected GPU (when available): ~400-600 ms
GPU Hash Joins¶
Overview¶
Hash join operations on GPU have partial implementation with 7 incomplete kernel configurations. While basic hash joins work, certain edge cases and configurations fall back to CPU.
Unimplemented Hash Join Variants¶
| Join Type | Configuration | Status | Use Case |
|---|---|---|---|
| Inner Join | Linear probing | Implemented | Standard equi-joins |
| Inner Join | Cuckoo hashing | Partial | High collision rates |
| Inner Join | Grace partitioning | Partial | Datasets > GPU memory |
| Outer Join | Left + Right | CPU fallback | All rows from both sides |
| Outer Join | Full outer | CPU fallback | Complete outer join |
| Semi-join | With bloom filter | Partial | Existence checks |
| Anti-join | High selectivity | Partial | Set difference operations |
Example: Hash Join Query¶
-- Basic inner join: GPU accelerated
SELECT
o.order_id,
o.amount,
c.customer_name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;
-- Outer join: CPU fallback
SELECT
o.order_id,
c.customer_name
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id;
Status & Root Cause¶
Root Cause: Hash join implementation is complex with multiple code paths:
- Hash table construction: GPU-accelerated
- Probe phase: GPU-accelerated for inner joins
- Output materialization: Incomplete for outer joins
The remaining 7 kernels involve: - Multi-way hash joins (join 3+ tables) - Outer join output construction - Non-equi join conditions (>, <, BETWEEN) - String key hashing
Implementation Gap: - Basic hash join: 90% complete - Advanced variants: 40% complete - Estimated effort: 3-4 weeks
Workarounds¶
Option 1: Rewrite as Inner Join¶
-- Convert to inner join if NULL rows aren't needed
SELECT
o.order_id,
c.customer_name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
-- Filter conditions replace LEFT JOIN logic
WHERE o.customer_id IS NOT NULL;
Option 2: Use Subqueries¶
-- GPU accelerates the subquery, CPU does the left join
SELECT
o.order_id,
c.customer_name
FROM orders o
LEFT JOIN (
SELECT id, customer_name FROM customers WHERE status = 'active'
) c ON o.customer_id = c.id;
Option 3: Split into Multiple Queries¶
-- Execute separate queries and join in application
SELECT * FROM orders WHERE customer_id IS NOT NULL; -- GPU-accelerated if needed
SELECT * FROM customers; -- GPU-accelerated if needed
-- Join in application layer with correct NULL handling
Performance Impact¶
For joining 10M orders to 1M customers: - Inner join: ~800 ms (GPU accelerated) - Left join: ~1,500 ms (CPU fallback) - Expected GPU (when available): ~600-800 ms
Multi-Region Metrics¶
Overview¶
Status: FIXED in latest release
Previously, multi-region metrics were hardcoded to return fake values. This has been corrected to properly calculate metrics across distributed regions.
What Was Fixed¶
| Metric | Previous Behavior | Current Behavior | Impact |
|---|---|---|---|
| Cross-region latency | Hardcoded 50ms | Actual measurement | Accurate monitoring |
| Replication lag | Hardcoded 10ms | Real-time calculation | Reliable SLA tracking |
| Network throughput | Hardcoded 1GB/s | Actual measurement | Cost optimization |
| Failover time | Hardcoded 5s | Real timing | SLA compliance |
Example: Multi-Region Metrics¶
// Previously returned fake values:
// latency_ms: 50 (hardcoded)
// replication_lag_ms: 10 (hardcoded)
// Now returns actual measurements:
let metrics = monitor.get_region_metrics()?;
println!("Latency to us-west: {} ms", metrics.latency_ms); // Real value
println!("Replication lag: {} ms", metrics.replication_lag_ms); // Real value
println!("Throughput: {} MB/s", metrics.network_throughput_mbps); // Real value
Verification¶
These metrics are now calculated from: - Network round-trip time measurements - Replication log analysis - Actual data transfer observation - Real failover execution
No configuration changes needed. All multi-region deployments automatically benefit from accurate metrics.
Performance Expectations¶
When GPU Acceleration Works Best¶
GPU acceleration provides maximum benefit for: - Large datasets: 1M+ rows - Simple operations: Aggregation, filtering, basic joins - OLAP workloads: Analytical queries on columns - Repetitive queries: Same query many times
Performance Targets¶
OLTP Workloads¶
| Operation | CPU Time | GPU Time | Speedup | Conditions |
|---|---|---|---|---|
| Simple filter | 100 ms | 50 ms | 2.0x | >100K rows |
| Basic aggregation | 200 ms | 50 ms | 4.0x | >500K rows |
| Hash join | 300 ms | 110 ms | 2.7x | >1M rows |
| Average | - | - | 2.7x | Typical OLTP |
OLAP Workloads¶
| Operation | CPU Time | GPU Time | Speedup | Conditions |
|---|---|---|---|---|
| Group by aggregation | 2,000 ms | 200 ms | 10.0x | >10M rows |
| Complex filter | 1,500 ms | 250 ms | 6.0x | >5M rows |
| Multiple aggregates | 2,500 ms | 330 ms | 7.5x | >10M rows |
| Average | - | - | 7.5x | Typical OLAP |
When GPU Acceleration Doesn't Help¶
GPU acceleration provides minimal or no benefit for: - Small queries: <100K rows (CPU faster due to overhead) - Complex logic: Window functions, statistical aggregations - String operations: Text processing, regex - OLTP point queries: Single row lookups
Query Size Thresholds¶
GPU Cost: Data transfer (1-5ms) + Kernel execution (5-100ms) + Result transfer (1-5ms)
Small Query (10K rows):
CPU: 10ms
GPU: 5ms transfer + 5ms kernel + 5ms result = 15ms overhead
Result: CPU faster
Medium Query (1M rows):
CPU: 200ms
GPU: 5ms transfer + 50ms kernel + 5ms result = 60ms
Result: GPU 3.3x faster
Large Query (100M rows):
CPU: 5000ms
GPU: 10ms transfer + 300ms kernel + 10ms result = 320ms
Result: GPU 15.6x faster
Real-World Example: E-Commerce Analytics¶
-- Query: Top products by revenue for November
SELECT
product_id,
product_name,
COUNT(*) as order_count,
SUM(order_amount) as revenue
FROM orders
WHERE order_date >= '2025-11-01'
AND order_date < '2025-12-01'
GROUP BY product_id, product_name
ORDER BY revenue DESC
LIMIT 10;
Results:
- Dataset: 50M orders
- CPU execution: 3,200 ms
- GPU execution: 420 ms
- Speedup: 7.6x
- User experience: 2.78 seconds faster
Fallback & Recovery¶
Automatic Fallback Behavior¶
When a GPU operation is unavailable or fails, HeliosDB automatically falls back to CPU execution:
// Pseudocode showing fallback mechanism
fn execute_window_function(query) {
try {
let gpu_result = gpu_execute(query)?;
return gpu_result;
} catch (gpu_error) {
warn!("GPU execution failed: {}. Falling back to CPU.", gpu_error);
let cpu_result = cpu_execute(query)?;
return cpu_result;
}
}
Circuit Breaker¶
If GPU execution fails multiple times for the same operation:
Failure Count:
1st failure: Attempt GPU, fallback to CPU
2nd failure: Log warning
3rd failure: Circuit breaker opens
With open circuit:
- Route to CPU automatically
- Retry GPU every 5 minutes
- Log metrics for investigation
Recovery Conditions¶
GPU execution resumes automatically when: - Device becomes available (after restart) - Memory pressure decreases - Error condition is resolved
No manual intervention required.
Monitoring Fallback Events¶
-- Check fallback statistics
SELECT
operation_type,
COUNT(*) as fallback_count,
AVG(cpu_time_ms) as avg_cpu_time,
AVG(expected_gpu_time_ms) as expected_gpu_time,
ROUND(AVG(cpu_time_ms) / AVG(expected_gpu_time_ms), 2) as speedup_ratio
FROM gpu_fallback_metrics
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY operation_type
ORDER BY fallback_count DESC;
Configuration & Tuning¶
Enable/Disable GPU Acceleration¶
-- Disable GPU for entire session
SET gpu_enabled = false;
-- Disable GPU for single query
SELECT /*+ GPU_DISABLED */ * FROM large_table WHERE condition;
-- Re-enable GPU
SET gpu_enabled = true;
Memory Configuration¶
-- Allocate 4GB to GPU memory pool
SET gpu_memory_limit = 4294967296; -- bytes
-- Monitor current usage
SELECT
gpu_id,
memory_allocated_mb,
memory_used_mb,
memory_free_mb
FROM gpu_memory_stats;
Routing Thresholds¶
-- Set minimum row count for GPU routing
SET gpu_min_rows = 100000; -- Don't use GPU for queries with <100K rows
-- Set maximum GPU cost for query
SET gpu_max_cost_ms = 5000; -- Fallback to CPU if GPU would take >5s
Performance Hints¶
-- Force GPU execution
SELECT /*+ GPU_ENABLED */ * FROM table WHERE condition;
-- Force CPU execution
SELECT /*+ CPU_ONLY */ * FROM table WHERE condition;
-- Use specific join algorithm
SELECT /*+ HASH_JOIN(orders, customers) */
o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;
Future Roadmap¶
Q4 2025 (Next 2-3 Weeks)¶
- [ ] Window function kernel fixes (14 functions)
- [ ] Result marshaling improvements
- [ ] Testing with TPC-H workloads
Q1 2026¶
- [ ] Statistical aggregation GPU kernels
- [ ] STDDEV, VARIANCE GPU implementations
- [ ] PERCENTILE optimization
Q2 2026¶
- [ ] Complex hash join variants
- [ ] Outer join GPU support
- [ ] Non-equi join optimization
Q3 2026+¶
- [ ] Custom user-defined functions (UDF) on GPU
- [ ] Machine learning model inference on GPU
- [ ] Advanced query optimization
Estimated Impact¶
| Quarter | Feature | Expected Speedup | LOCS | Tests |
|---|---|---|---|---|
| Q4 2025 | Window functions | 10-50x | ~2K | 50+ |
| Q1 2026 | Statistics | 5-20x | ~3K | 75+ |
| Q2 2026 | Complex joins | 5-15x | ~2K | 40+ |
Support & Contact¶
Getting Help¶
For GPU-related issues:
1. Check this document for known limitations
2. Enable GPU debugging: SET gpu_debug = true;
3. Review logs in /var/log/heliosdb/gpu.log
4. Contact support with:
- Query that caused issue
- GPU model and driver version
- Dataset size and schema
- Expected vs. actual performance
Feature Request Process¶
To request GPU acceleration for a specific operation:
- Submit with GitHub issue including:
- SQL query example
- Dataset characteristics (rows, columns, data types)
- Current workaround
-
Expected vs. observed performance
-
Product team will:
- Evaluate feasibility
- Estimate effort
- Prioritize with other requests
Reporting Bugs¶
If GPU execution produces incorrect results:
- Verify with
SET gpu_enabled = false(CPU-only query) - Compare results between GPU and CPU execution
- Report with:
- Exact SQL query
- Sample data (or data generation script)
- GPU model and driver version
- Expected vs. actual results
This is a critical issue and will be addressed immediately.
Appendix: Technical Details¶
GPU Architecture Overview¶
Query Input
↓
[Query Router] ← CPU or GPU?
↓
┌─────────────────┬──────────────────┐
│ │ │
↓ ↓ ↓
GPU Path CPU Path Fallback
(High-speed) (Universal) (Automatic)
↓ ↓ ↓
Result Marshaling ← ← ← ← ← ← ← ←
↓
[Result Return]
↓
Application
Memory Layout (GPU)¶
GPU Memory (e.g., 24GB V100)
├── Kernel Code & Constants: ~50MB
├── GPU Buffers:
│ ├── Input data
│ ├── Intermediate results
│ └── Output data
├── Memory Pool: ~80% available
└── Reserved: ~5% for system
Latency Breakdown (10M row query)¶
GPU Execution:
├── H2D Transfer: ~50ms (host to device)
├── Kernel Execution: ~150ms
├── D2H Transfer: ~40ms (device to host)
└── Total: ~240ms
CPU Execution:
├── Data loading: ~200ms
├── Computation: ~1500ms
└── Total: ~1700ms
Speedup: 1700/240 = 7.1x
Document History¶
| Version | Date | Changes |
|---|---|---|
| 1.0 | Nov 2025 | Initial comprehensive limitations documentation |