Skip to content

Performance Tracing & Bottleneck Analysis

HeliosDB Full ships with a built-in, zero-cost-when-disabled performance tracing system that instruments every layer of the database — from SQL parsing through storage I/O. Use it to pinpoint bottlenecks during performance tests, diagnose slow queries, and generate production-grade reports without external tooling.

Key Differences from Nano/Lite

Capability Nano Lite Full
Granularity Per-query phases Per-query phases Per-operation across 8 components
Span count ~8 phase types ~6 phase types 57 instrumented operations
Overhead when disabled Branch check per query Option<QueryTrace> check Single atomic load (AtomicBool)
Overhead when enabled ~1-2us per query ~1-2us per query ~50ns per span (lock-free)
Percentiles None avg/p50/p95/p99 min/avg/p50/p95/p99 per operation
Hierarchical spans No No Yes (parent-child relationships)
Component breakdown No No Yes (8 components, % of session)
Output formats Structured log events Text report Text report + JSON + programmatic API
Thread safety Single-threaded Single-threaded Lock-free DashMap + thread-local stacks
SQL interface RUST_LOG env var SET helios.trace_queries Programmatic API (see below)

Architecture

                    PerfTracer (global singleton)
                    ├── enabled: AtomicBool          ← zero-cost gate
                    ├── next_span_id: AtomicU64      ← lock-free ID generator
                    ├── completed: DashMap            ← all finished spans
                    └── aggregated: DashMap           ← running per-op statistics

 Thread 1                Thread 2                Thread 3
 ┌─────────────┐        ┌─────────────┐        ┌─────────────┐
 │ SPAN_STACK  │        │ SPAN_STACK  │        │ SPAN_STACK  │
 │ (thread-    │        │ (thread-    │        │ (thread-    │
 │  local)     │        │  local)     │        │  local)     │
 └──────┬──────┘        └──────┬──────┘        └──────┬──────┘
        │ drop guard           │ drop guard           │ drop guard
        └──────────────────────┴──────────────────────┘
                    ┌──────────▼──────────┐
                    │   DashMap shard N   │  ← lock-free aggregation
                    └─────────────────────┘

Zero-cost when disabled: Every perf_span! macro call starts with self.enabled.load(Ordering::Acquire). When tracing is off, this is a single CPU instruction (no branch misprediction after warmup) and the RAII guard stores span_id = 0, making Drop a no-op.

Lock-free when enabled: Each thread maintains its own span stack (thread_local!). On span completion, timing data is inserted into a DashMap (sharded concurrent hash map) — no global mutex, no contention between threads.


Quick Start

use heliosdb_common::perf_tracer::{PerfTracer, Component};
use heliosdb_common::perf_span;

// 1. Enable tracing (resets previous data, starts session clock)
PerfTracer::global().enable();

// 2. Run your workload
//    All instrumented operations automatically record timing spans.
//    (execute queries, run benchmarks, etc.)

// 3. Generate the bottleneck report
let report = PerfTracer::global().generate_report();

// Human-readable text report
println!("{}", report.to_text());

// Machine-readable JSON (for dashboards, CI pipelines)
let json = report.to_json();
println!("{}", serde_json::to_string_pretty(&json).unwrap());

// 4. Disable tracing (returns to zero overhead)
PerfTracer::global().disable();

One-Liner for Quick Profiling

// Wrap any code block with enable/disable
PerfTracer::global().enable();
run_benchmark();
let report = PerfTracer::global().generate_report();
println!("{}", report.to_text());
PerfTracer::global().disable();

Instrumented Operations

Every operation below is automatically traced when the PerfTracer is enabled. No code changes needed — the spans are already wired into the hot paths.

Protocol Layer (10 spans)

Operation Description File
sql_parse SQL text to AST heliosdb-protocols/src/postgres/sql_parser.rs
handle_query Full query lifecycle (top-level) heliosdb-protocols/src/postgres/handler.rs
execute_select SELECT execution heliosdb-protocols/src/postgres/handler.rs
execute_insert INSERT execution heliosdb-protocols/src/postgres/handler.rs
execute_update UPDATE execution heliosdb-protocols/src/postgres/handler.rs
execute_delete DELETE execution heliosdb-protocols/src/postgres/handler.rs
execute_create_table DDL CREATE TABLE heliosdb-protocols/src/postgres/handler.rs
execute_create_function DDL CREATE FUNCTION heliosdb-protocols/src/postgres/handler.rs
execute_call_procedure Stored procedure call heliosdb-protocols/src/postgres/handler.rs
execute_create_trigger DDL CREATE TRIGGER heliosdb-protocols/src/postgres/handler.rs

Compute Layer (9 spans)

Operation Description File
execute_plan Full query plan execution heliosdb-compute/src/executor/query_executor.rs
table_scan Sequential table scan heliosdb-compute/src/executor/query_executor.rs
index_scan Index-based lookup heliosdb-compute/src/executor/query_executor.rs
filter Predicate evaluation heliosdb-compute/src/executor/query_executor.rs
project Column projection heliosdb-compute/src/executor/query_executor.rs
hash_join Hash join build + probe heliosdb-compute/src/executor/query_executor.rs
aggregate GROUP BY / aggregation heliosdb-compute/src/executor/query_executor.rs
sort ORDER BY sort heliosdb-compute/src/executor/query_executor.rs
window_function Window function evaluation heliosdb-compute/src/executor/query_executor.rs

Optimizer Layer (2 spans)

Operation Description File
optimize_query Full optimization pass heliosdb-compute/src/optimizer_legacy.rs
cost_estimation Cost model evaluation heliosdb-compute/src/optimizer_legacy.rs

Parallel Execution (2 spans)

Operation Description File
parallel_execute Work-stealing parallel execution heliosdb-compute/src/parallel_query.rs
partition_data Data partitioning for parallelism heliosdb-compute/src/parallel_query.rs

Storage Layer (9 spans)

Operation Description File
lsm_put LSM tree write (memtable insert) heliosdb-storage/src/lsm.rs
lsm_get LSM tree point read heliosdb-storage/src/lsm.rs
lsm_delete LSM tree delete (tombstone write) heliosdb-storage/src/lsm.rs
lsm_scan LSM tree range scan (k-way merge) heliosdb-storage/src/lsm.rs
wal_write Write-ahead log entry heliosdb-storage/src/lsm.rs
memtable_flush Memtable flush to SSTable heliosdb-storage/src/lsm.rs
compaction SSTable level compaction heliosdb-storage/src/lsm.rs
sstable_write SSTable file creation heliosdb-storage/src/sstable.rs
sstable_get SSTable point lookup (bloom + index) heliosdb-storage/src/sstable.rs
sstable_scan SSTable range scan heliosdb-storage/src/sstable.rs

Cluster Layer (7 spans)

Operation Description File
raft_append Raft log append heliosdb-cluster/src/consensus/mod.rs, log.rs
raft_election Raft leader election heliosdb-cluster/src/consensus/mod.rs, raft_node.rs
raft_commit Raft entry commit heliosdb-cluster/src/consensus/mod.rs
cache_fusion_lookup Cache fusion remote lookup heliosdb-cluster/src/cache_fusion.rs
cache_fusion_transfer Cache fusion block transfer heliosdb-cluster/src/cache_fusion.rs
distributed_scan Distributed smart scan coordination heliosdb-storage/src/distributed_smart_scan.rs

Metadata Layer (2 operation types, 8 spans)

Operation Description File
schema_lookup Schema/catalog resolution heliosdb-metadata/src/schema_service.rs, raft_service.rs
ddl_execute DDL operation execution heliosdb-metadata/src/schema_service.rs, raft_service.rs

Understanding the Report

Sample Output

+======================================================================+
|              HeliosDB Performance Bottleneck Report                   |
+======================================================================+

Session Duration: 12.45s  |  Total Spans: 45,230

-- Component Breakdown ----------------------------------------------------
Component      Total Time      Spans   % Sess
--------------------------------------------------------------
Storage           8.92s       22,100    71.6%
Compute           2.34s       11,500    18.8%
Protocol          0.89s        5,600     7.1%
Optimizer         0.21s        5,600     1.7%
Metadata          0.09s          430     0.7%

-- Top Bottlenecks (by total time) ----------------------------------------
Component    Operation                       Total     Count        Avg        P99  % Sess
------------------------------------------------------------------------------------------
Storage      lsm_scan                        5.12s      2800      1.8ms     12.3ms   41.1%
Storage      sstable_scan                    2.45s      8400      291us      1.2ms   19.7%
Compute      hash_join                       1.23s       450      2.7ms     15.1ms    9.9%
Compute      sort                            0.67s       320      2.1ms      8.4ms    5.4%
Protocol     execute_select                  0.56s      2800      200us      1.1ms    4.5%
Storage      lsm_put                         0.45s      5000       90us      450us    3.6%
Storage      memtable_flush                  0.38s        12     31.7ms     45.2ms    3.1%
Compute      aggregate                       0.34s       280      1.2ms      4.5ms    2.7%
Optimizer    optimize_query                  0.21s      5600       38us      120us    1.7%
Protocol     sql_parse                       0.18s      5600       32us       95us    1.4%

-- All Operations ---------------------------------------------------------
Component    Operation                      Count      Total        Min        Avg        P95        P99
---------------------------------------------------------------------------------------------------------
Storage      lsm_scan                        2800      5.12s      120us      1.8ms      8.5ms     12.3ms
Storage      sstable_scan                    8400      2.45s       15us      291us      850us      1.2ms
...

How to Read It

Component Breakdown shows where time is spent at the macro level. In the example above, 71.6% of traced time is in Storage — typical for I/O-bound workloads. If Optimizer or Protocol dominate, investigate complex queries or parsing overhead.

Top Bottlenecks ranks individual operations by total cumulative time. The most impactful optimization target is the #1 bottleneck. Key metrics:

  • Total: Cumulative wall time across all invocations. The primary ranking metric.
  • Count: Number of times this operation was called. High count with low avg = many small operations. Low count with high avg = few expensive operations.
  • Avg: Average duration per invocation.
  • P99: 99th percentile — the "worst case" that 1% of invocations exceed. If P99 >> Avg, you have tail latency outliers.
  • % Sess: Percentage of the total session duration. Tells you the maximum speedup you'd get by eliminating this bottleneck entirely (Amdahl's law).

All Operations provides the full statistical breakdown for every traced operation.


Performance Testing Workflow

1. Baseline Measurement

use heliosdb_common::perf_tracer::PerfTracer;

// Reset and enable
PerfTracer::global().enable();

// Run baseline workload
run_oltp_benchmark(1000); // e.g., 1000 transactions

// Capture baseline report
let baseline = PerfTracer::global().generate_report();
let baseline_json = baseline.to_json();
println!("=== BASELINE ===\n{}", baseline.to_text());
PerfTracer::global().disable();

2. Make Changes

// Add an index, tune a config parameter, change a query, etc.

3. Compare

PerfTracer::global().enable();
run_oltp_benchmark(1000); // same workload
let after = PerfTracer::global().generate_report();
println!("=== AFTER CHANGE ===\n{}", after.to_text());
PerfTracer::global().disable();

// Programmatic comparison
let b = &baseline_json["bottlenecks"][0];
let a = &after.to_json()["bottlenecks"][0];
println!(
    "Top bottleneck: {} -> {} (was {}us, now {}us)",
    b["operation"], a["operation"],
    b["avg_duration_us"], a["avg_duration_us"],
);

4. Drill Into Specific Operations

let report = PerfTracer::global().generate_report();

// Find all storage operations slower than 1ms average
for op in &report.operations {
    if op.component == heliosdb_common::perf_tracer::Component::Storage
        && op.avg_duration > std::time::Duration::from_millis(1)
    {
        println!(
            "SLOW: {} — avg {:?}, p99 {:?}, count {}",
            op.operation, op.avg_duration, op.p99_duration, op.count
        );
    }
}

Adding Custom Spans

Instrument your own code with the perf_span! macro. Spans nest automatically via the thread-local stack.

use heliosdb_common::perf_tracer::Component;
use heliosdb_common::perf_span;

fn process_batch(batch: &[Row]) {
    // Simple span (operation name only)
    let _span = perf_span!(Component::Compute, "process_batch");

    for row in batch {
        // Span with key-value attributes (for debugging, not aggregation)
        let _row_span = perf_span!(
            Component::Compute, "process_row",
            "batch_size" => batch.len(),
            "row_key" => row.key
        );
        // ... processing ...
        // _row_span records timing on drop
    }
    // _span records timing on drop (includes all child spans)
}

Span Nesting

Spans automatically form parent-child relationships within each thread:

{
    let _outer = perf_span!(Component::Protocol, "handle_query");
    {
        let _parse = perf_span!(Component::Protocol, "sql_parse");
        // parse_span.parent_id == outer_span.span_id
    }
    {
        let _optimize = perf_span!(Component::Optimizer, "optimize_query");
        // optimize_span.parent_id == outer_span.span_id
    }
    {
        let _exec = perf_span!(Component::Compute, "execute_plan");
        {
            let _scan = perf_span!(Component::Storage, "lsm_scan");
            // scan_span.parent_id == exec_span.span_id
            // scan_span.depth == 2
        }
    }
}

Component Selection

Choose the component that owns the operation:

Component Use For
Component::Storage Anything touching disk, memtable, WAL, SSTables, compaction
Component::Compute Query execution, joins, sorts, aggregation, expression evaluation
Component::Optimizer Query planning, cost estimation, rule application
Component::Protocol SQL parsing, wire protocol handling, result serialization
Component::Network Connection management, transport, RPC
Component::Cluster Raft consensus, replication, distributed coordination
Component::Cache Block cache, query cache, tiered cache lookups
Component::Metadata Schema catalog, DDL operations

JSON Output for CI/Dashboards

The JSON report format is designed for automated analysis:

{
  "session_duration_us": 12450000,
  "total_spans": 45230,
  "components": [
    {
      "component": "Storage",
      "total_duration_us": 8920000,
      "span_count": 22100,
      "pct_of_session": 71.6
    }
  ],
  "bottlenecks": [
    {
      "component": "Storage",
      "operation": "lsm_scan",
      "total_duration_us": 5120000,
      "count": 2800,
      "avg_duration_us": 1828,
      "p99_duration_us": 12300,
      "pct_of_session": 41.1
    }
  ],
  "operations": [
    {
      "component": "Storage",
      "operation": "lsm_scan",
      "count": 2800,
      "total_duration_us": 5120000,
      "min_duration_us": 120,
      "avg_duration_us": 1828,
      "p50_duration_us": 1450,
      "p95_duration_us": 8500,
      "p99_duration_us": 12300
    }
  ]
}

CI Integration Example

let report = PerfTracer::global().generate_report();
let json = report.to_json();

// Fail CI if any operation's p99 exceeds threshold
for op in json["operations"].as_array().unwrap() {
    let p99 = op["p99_duration_us"].as_u64().unwrap();
    let name = op["operation"].as_str().unwrap();
    if p99 > 50_000 { // 50ms threshold
        panic!("PERF REGRESSION: {} p99 = {}us (threshold: 50000us)", name, p99);
    }
}

Interpreting Common Patterns

Pattern Likely Cause Action
Storage > 70% of session I/O-bound workload Check lsm_scan and sstable_scan counts; add indexes; tune compaction
lsm_scan P99 >> Avg Tail latency from scanning many SSTables Run compaction; check level count
memtable_flush high avg Large memtable or slow disk Increase flush parallelism; check I/O bandwidth
compaction high total Background compaction stealing I/O Tune compaction rate limiter; schedule off-peak
hash_join dominates Compute Large hash tables for joins Check join order; add join indexes; increase memory
sort high total Many ORDER BY without index Add covering indexes for sorted access patterns
optimize_query > 10% session Complex queries trigger expensive planning Simplify queries; use prepared statements
sql_parse high total Very long SQL strings or many distinct queries Use parameterized queries; batch operations
wal_write P99 high WAL fsync latency spikes Check disk IOPS; enable group commit (DirectIoWal)
raft_append high in cluster Consensus latency between nodes Check network latency; reduce replication factor for reads
cache_fusion_transfer high Remote cache misses Co-locate hot data; increase local cache size
Optimizer cost low but Compute high Good plans but slow execution Focus on storage I/O and compute optimization
Protocol > 20% session Parsing or serialization overhead Use prepared statements; reduce result set size

Overhead Characteristics

Disabled (Default) — Zero Cost

perf_span!() call:
  1. AtomicBool::load(Acquire)     ~1ns   (single CPU instruction)
  2. Branch: span_id == 0 → skip   ~0ns   (predicted after warmup)
  3. Drop: span_id == 0 → no-op    ~0ns

Total overhead per span: ~1ns (unmeasurable in practice)

The compiler inlines the #[inline(always)] is_enabled() check. After branch predictor warmup, the disabled path adds zero measurable overhead — verified by benchmarking with and without instrumentation.

Enabled — Low Overhead

perf_span!() call:
  1. AtomicU64::fetch_add(Relaxed)  ~5ns   (span ID)
  2. Instant::now()                 ~25ns  (clock read)
  3. thread_local push              ~10ns  (stack push)

Drop (span completion):
  4. Instant::elapsed()             ~25ns  (clock read)
  5. thread_local pop               ~10ns  (stack pop)
  6. DashMap::entry().record()      ~30ns  (lock-free insert)

Total overhead per span: ~100ns when enabled

For a workload producing 10,000 spans/second, that's ~1ms of tracing overhead per second (0.1%). For 100,000 spans/second, ~10ms/second (1%). This is acceptable for staging and targeted production debugging.

Memory Usage

  • Per completed span: ~200 bytes (component, operation string, duration, attributes, IDs)
  • Per aggregated operation: ~500 bytes + 8 bytes per recorded duration
  • 10,000 spans session: ~2MB memory
  • 100,000 spans session: ~20MB memory

Call PerfTracer::global().reset() between test runs to free accumulated data.


API Reference

PerfTracer

// Get the global singleton
PerfTracer::global() -> &'static PerfTracer

// Lifecycle
tracer.enable()              // Enable + reset + start session clock
tracer.disable()             // Disable (spans become no-ops)
tracer.is_enabled() -> bool  // Fast check (atomic load)
tracer.reset()               // Clear all data without changing enabled state

// Span creation
tracer.begin_span(component, operation, attributes) -> PerfSpanGuard

// Reporting
tracer.generate_report() -> PerfReport
tracer.span_count() -> usize

PerfReport

report.to_text() -> String              // Human-readable table
report.to_json() -> serde_json::Value   // Machine-readable JSON

// Fields
report.session_duration: Duration
report.total_spans: u64
report.operations: Vec<OperationStats>      // All ops, sorted by total time desc
report.component_summary: Vec<ComponentSummary>  // Per-component aggregation
report.bottlenecks: Vec<Bottleneck>         // Top 10 by total time
report.spans: Vec<CompletedSpan>            // All individual spans (for analysis)

perf_span! Macro

// Simple span (no attributes)
let _span = perf_span!(Component::Storage, "operation_name");

// Span with key-value attributes
let _span = perf_span!(Component::Storage, "operation_name",
    "key1" => value1,
    "key2" => value2
);

// The guard (_span) MUST be bound to a variable.
// If not bound, it drops immediately and records zero duration.
// WRONG: perf_span!(Component::Storage, "op");  // drops immediately!
// RIGHT: let _span = perf_span!(Component::Storage, "op");

Comparison with External Tracing Tools

Feature PerfTracer (built-in) tracing crate Jaeger/OpenTelemetry
Setup Zero config Subscriber setup Agent + collector infrastructure
Overhead disabled ~1ns (atomic load) ~5ns (subscriber check) N/A (not embedded)
Overhead enabled ~100ns/span ~200ns/span ~1-5us/span (network export)
Percentile stats Built-in (p50/p95/p99) Requires aggregator Requires backend query
Bottleneck ranking Automatic Manual analysis Manual query
Thread safety Lock-free (DashMap) Layer-dependent Async export
Distributed tracing Span hierarchy only Full distributed Full distributed
Production use Targeted debugging Always-on possible Always-on recommended

Use PerfTracer for: Performance testing, benchmarking, targeted bottleneck analysis, CI regression checks.

Use tracing crate for: Always-on structured logging, distributed request tracing, log aggregation.

Use OpenTelemetry for: Multi-service distributed tracing, production observability, SLO monitoring.

PerfTracer is complementary to these tools — it provides instant, zero-dependency bottleneck analysis without any infrastructure setup.