Performance Tracing & Bottleneck Analysis¶

HeliosDB Full ships with a built-in, zero-cost-when-disabled performance tracing system that instruments every layer of the database — from SQL parsing through storage I/O. Use it to pinpoint bottlenecks during performance tests, diagnose slow queries, and generate production-grade reports without external tooling.

Key Differences from Nano/Lite¶

Capability	Nano	Lite	Full
Granularity	Per-query phases	Per-query phases	Per-operation across 8 components
Span count	~8 phase types	~6 phase types	57 instrumented operations
Overhead when disabled	Branch check per query	`Option<QueryTrace>` check	Single atomic load (`AtomicBool`)
Overhead when enabled	~1-2us per query	~1-2us per query	~50ns per span (lock-free)
Percentiles	None	avg/p50/p95/p99	min/avg/p50/p95/p99 per operation
Hierarchical spans	No	No	Yes (parent-child relationships)
Component breakdown	No	No	Yes (8 components, % of session)
Output formats	Structured log events	Text report	Text report + JSON + programmatic API
Thread safety	Single-threaded	Single-threaded	Lock-free DashMap + thread-local stacks
SQL interface	`RUST_LOG` env var	`SET helios.trace_queries`	Programmatic API (see below)

Architecture¶

                    PerfTracer (global singleton)
                    ├── enabled: AtomicBool          ← zero-cost gate
                    ├── next_span_id: AtomicU64      ← lock-free ID generator
                    ├── completed: DashMap            ← all finished spans
                    └── aggregated: DashMap           ← running per-op statistics

 Thread 1                Thread 2                Thread 3
 ┌─────────────┐        ┌─────────────┐        ┌─────────────┐
 │ SPAN_STACK  │        │ SPAN_STACK  │        │ SPAN_STACK  │
 │ (thread-    │        │ (thread-    │        │ (thread-    │
 │  local)     │        │  local)     │        │  local)     │
 └──────┬──────┘        └──────┬──────┘        └──────┬──────┘
        │ drop guard           │ drop guard           │ drop guard
        └──────────────────────┴──────────────────────┘
                               │
                    ┌──────────▼──────────┐
                    │   DashMap shard N   │  ← lock-free aggregation
                    └─────────────────────┘

Zero-cost when disabled: Every perf_span! macro call starts with self.enabled.load(Ordering::Acquire). When tracing is off, this is a single CPU instruction (no branch misprediction after warmup) and the RAII guard stores span_id = 0, making Drop a no-op.

Lock-free when enabled: Each thread maintains its own span stack (thread_local!). On span completion, timing data is inserted into a DashMap (sharded concurrent hash map) — no global mutex, no contention between threads.

Quick Start¶

Rust API (Recommended)¶

use heliosdb_common::perf_tracer::{PerfTracer, Component};
use heliosdb_common::perf_span;

// 1. Enable tracing (resets previous data, starts session clock)
PerfTracer::global().enable();

// 2. Run your workload
//    All instrumented operations automatically record timing spans.
//    (execute queries, run benchmarks, etc.)

// 3. Generate the bottleneck report
let report = PerfTracer::global().generate_report();

// Human-readable text report
println!("{}", report.to_text());

// Machine-readable JSON (for dashboards, CI pipelines)
let json = report.to_json();
println!("{}", serde_json::to_string_pretty(&json).unwrap());

// 4. Disable tracing (returns to zero overhead)
PerfTracer::global().disable();

One-Liner for Quick Profiling¶

// Wrap any code block with enable/disable
PerfTracer::global().enable();
run_benchmark();
let report = PerfTracer::global().generate_report();
println!("{}", report.to_text());
PerfTracer::global().disable();

Instrumented Operations¶

Every operation below is automatically traced when the PerfTracer is enabled. No code changes needed — the spans are already wired into the hot paths.

Protocol Layer (10 spans)¶

Operation	Description	File
`sql_parse`	SQL text to AST	`heliosdb-protocols/src/postgres/sql_parser.rs`
`handle_query`	Full query lifecycle (top-level)	`heliosdb-protocols/src/postgres/handler.rs`
`execute_select`	SELECT execution	`heliosdb-protocols/src/postgres/handler.rs`
`execute_insert`	INSERT execution	`heliosdb-protocols/src/postgres/handler.rs`
`execute_update`	UPDATE execution	`heliosdb-protocols/src/postgres/handler.rs`
`execute_delete`	DELETE execution	`heliosdb-protocols/src/postgres/handler.rs`
`execute_create_table`	DDL CREATE TABLE	`heliosdb-protocols/src/postgres/handler.rs`
`execute_create_function`	DDL CREATE FUNCTION	`heliosdb-protocols/src/postgres/handler.rs`
`execute_call_procedure`	Stored procedure call	`heliosdb-protocols/src/postgres/handler.rs`
`execute_create_trigger`	DDL CREATE TRIGGER	`heliosdb-protocols/src/postgres/handler.rs`

Compute Layer (9 spans)¶

Operation	Description	File
`execute_plan`	Full query plan execution	`heliosdb-compute/src/executor/query_executor.rs`
`table_scan`	Sequential table scan	`heliosdb-compute/src/executor/query_executor.rs`
`index_scan`	Index-based lookup	`heliosdb-compute/src/executor/query_executor.rs`
`filter`	Predicate evaluation	`heliosdb-compute/src/executor/query_executor.rs`
`project`	Column projection	`heliosdb-compute/src/executor/query_executor.rs`
`hash_join`	Hash join build + probe	`heliosdb-compute/src/executor/query_executor.rs`
`aggregate`	GROUP BY / aggregation	`heliosdb-compute/src/executor/query_executor.rs`
`sort`	ORDER BY sort	`heliosdb-compute/src/executor/query_executor.rs`
`window_function`	Window function evaluation	`heliosdb-compute/src/executor/query_executor.rs`

Optimizer Layer (2 spans)¶

Operation	Description	File
`optimize_query`	Full optimization pass	`heliosdb-compute/src/optimizer_legacy.rs`
`cost_estimation`	Cost model evaluation	`heliosdb-compute/src/optimizer_legacy.rs`

Parallel Execution (2 spans)¶

Operation	Description	File
`parallel_execute`	Work-stealing parallel execution	`heliosdb-compute/src/parallel_query.rs`
`partition_data`	Data partitioning for parallelism	`heliosdb-compute/src/parallel_query.rs`

Storage Layer (9 spans)¶

Operation	Description	File
`lsm_put`	LSM tree write (memtable insert)	`heliosdb-storage/src/lsm.rs`
`lsm_get`	LSM tree point read	`heliosdb-storage/src/lsm.rs`
`lsm_delete`	LSM tree delete (tombstone write)	`heliosdb-storage/src/lsm.rs`
`lsm_scan`	LSM tree range scan (k-way merge)	`heliosdb-storage/src/lsm.rs`
`wal_write`	Write-ahead log entry	`heliosdb-storage/src/lsm.rs`
`memtable_flush`	Memtable flush to SSTable	`heliosdb-storage/src/lsm.rs`
`compaction`	SSTable level compaction	`heliosdb-storage/src/lsm.rs`
`sstable_write`	SSTable file creation	`heliosdb-storage/src/sstable.rs`
`sstable_get`	SSTable point lookup (bloom + index)	`heliosdb-storage/src/sstable.rs`
`sstable_scan`	SSTable range scan	`heliosdb-storage/src/sstable.rs`

Cluster Layer (7 spans)¶

Operation	Description	File
`raft_append`	Raft log append	`heliosdb-cluster/src/consensus/mod.rs`, `log.rs`
`raft_election`	Raft leader election	`heliosdb-cluster/src/consensus/mod.rs`, `raft_node.rs`
`raft_commit`	Raft entry commit	`heliosdb-cluster/src/consensus/mod.rs`
`cache_fusion_lookup`	Cache fusion remote lookup	`heliosdb-cluster/src/cache_fusion.rs`
`cache_fusion_transfer`	Cache fusion block transfer	`heliosdb-cluster/src/cache_fusion.rs`
`distributed_scan`	Distributed smart scan coordination	`heliosdb-storage/src/distributed_smart_scan.rs`

Metadata Layer (2 operation types, 8 spans)¶

Operation	Description	File
`schema_lookup`	Schema/catalog resolution	`heliosdb-metadata/src/schema_service.rs`, `raft_service.rs`
`ddl_execute`	DDL operation execution	`heliosdb-metadata/src/schema_service.rs`, `raft_service.rs`

Understanding the Report¶

Sample Output¶

+======================================================================+
|              HeliosDB Performance Bottleneck Report                   |
+======================================================================+

Session Duration: 12.45s  |  Total Spans: 45,230

-- Component Breakdown ----------------------------------------------------
Component      Total Time      Spans   % Sess
--------------------------------------------------------------
Storage           8.92s       22,100    71.6%
Compute           2.34s       11,500    18.8%
Protocol          0.89s        5,600     7.1%
Optimizer         0.21s        5,600     1.7%
Metadata          0.09s          430     0.7%

-- Top Bottlenecks (by total time) ----------------------------------------
Component    Operation                       Total     Count        Avg        P99  % Sess
------------------------------------------------------------------------------------------
Storage      lsm_scan                        5.12s      2800      1.8ms     12.3ms   41.1%
Storage      sstable_scan                    2.45s      8400      291us      1.2ms   19.7%
Compute      hash_join                       1.23s       450      2.7ms     15.1ms    9.9%
Compute      sort                            0.67s       320      2.1ms      8.4ms    5.4%
Protocol     execute_select                  0.56s      2800      200us      1.1ms    4.5%
Storage      lsm_put                         0.45s      5000       90us      450us    3.6%
Storage      memtable_flush                  0.38s        12     31.7ms     45.2ms    3.1%
Compute      aggregate                       0.34s       280      1.2ms      4.5ms    2.7%
Optimizer    optimize_query                  0.21s      5600       38us      120us    1.7%
Protocol     sql_parse                       0.18s      5600       32us       95us    1.4%

-- All Operations ---------------------------------------------------------
Component    Operation                      Count      Total        Min        Avg        P95        P99
---------------------------------------------------------------------------------------------------------
Storage      lsm_scan                        2800      5.12s      120us      1.8ms      8.5ms     12.3ms
Storage      sstable_scan                    8400      2.45s       15us      291us      850us      1.2ms
...

How to Read It¶

Component Breakdown shows where time is spent at the macro level. In the example above, 71.6% of traced time is in Storage — typical for I/O-bound workloads. If Optimizer or Protocol dominate, investigate complex queries or parsing overhead.

Top Bottlenecks ranks individual operations by total cumulative time. The most impactful optimization target is the #1 bottleneck. Key metrics:

Total: Cumulative wall time across all invocations. The primary ranking metric.
Count: Number of times this operation was called. High count with low avg = many small operations. Low count with high avg = few expensive operations.
Avg: Average duration per invocation.
P99: 99th percentile — the "worst case" that 1% of invocations exceed. If P99 >> Avg, you have tail latency outliers.
% Sess: Percentage of the total session duration. Tells you the maximum speedup you'd get by eliminating this bottleneck entirely (Amdahl's law).

All Operations provides the full statistical breakdown for every traced operation.

Performance Testing Workflow¶

1. Baseline Measurement¶

use heliosdb_common::perf_tracer::PerfTracer;

// Reset and enable
PerfTracer::global().enable();

// Run baseline workload
run_oltp_benchmark(1000); // e.g., 1000 transactions

// Capture baseline report
let baseline = PerfTracer::global().generate_report();
let baseline_json = baseline.to_json();
println!("=== BASELINE ===\n{}", baseline.to_text());
PerfTracer::global().disable();

2. Make Changes¶

// Add an index, tune a config parameter, change a query, etc.

3. Compare¶

PerfTracer::global().enable();
run_oltp_benchmark(1000); // same workload
let after = PerfTracer::global().generate_report();
println!("=== AFTER CHANGE ===\n{}", after.to_text());
PerfTracer::global().disable();

// Programmatic comparison
let b = &baseline_json["bottlenecks"][0];
let a = &after.to_json()["bottlenecks"][0];
println!(
    "Top bottleneck: {} -> {} (was {}us, now {}us)",
    b["operation"], a["operation"],
    b["avg_duration_us"], a["avg_duration_us"],
);

4. Drill Into Specific Operations¶

let report = PerfTracer::global().generate_report();

// Find all storage operations slower than 1ms average
for op in &report.operations {
    if op.component == heliosdb_common::perf_tracer::Component::Storage
        && op.avg_duration > std::time::Duration::from_millis(1)
    {
        println!(
            "SLOW: {} — avg {:?}, p99 {:?}, count {}",
            op.operation, op.avg_duration, op.p99_duration, op.count
        );
    }
}

Adding Custom Spans¶

Instrument your own code with the perf_span! macro. Spans nest automatically via the thread-local stack.

use heliosdb_common::perf_tracer::Component;
use heliosdb_common::perf_span;

fn process_batch(batch: &[Row]) {
    // Simple span (operation name only)
    let _span = perf_span!(Component::Compute, "process_batch");

    for row in batch {
        // Span with key-value attributes (for debugging, not aggregation)
        let _row_span = perf_span!(
            Component::Compute, "process_row",
            "batch_size" => batch.len(),
            "row_key" => row.key
        );
        // ... processing ...
        // _row_span records timing on drop
    }
    // _span records timing on drop (includes all child spans)
}

Span Nesting¶

Spans automatically form parent-child relationships within each thread:

{
    let _outer = perf_span!(Component::Protocol, "handle_query");
    {
        let _parse = perf_span!(Component::Protocol, "sql_parse");
        // parse_span.parent_id == outer_span.span_id
    }
    {
        let _optimize = perf_span!(Component::Optimizer, "optimize_query");
        // optimize_span.parent_id == outer_span.span_id
    }
    {
        let _exec = perf_span!(Component::Compute, "execute_plan");
        {
            let _scan = perf_span!(Component::Storage, "lsm_scan");
            // scan_span.parent_id == exec_span.span_id
            // scan_span.depth == 2
        }
    }
}

Component Selection¶

Choose the component that owns the operation:

Component	Use For
`Component::Storage`	Anything touching disk, memtable, WAL, SSTables, compaction
`Component::Compute`	Query execution, joins, sorts, aggregation, expression evaluation
`Component::Optimizer`	Query planning, cost estimation, rule application
`Component::Protocol`	SQL parsing, wire protocol handling, result serialization
`Component::Network`	Connection management, transport, RPC
`Component::Cluster`	Raft consensus, replication, distributed coordination
`Component::Cache`	Block cache, query cache, tiered cache lookups
`Component::Metadata`	Schema catalog, DDL operations

JSON Output for CI/Dashboards¶

The JSON report format is designed for automated analysis:

{
  "session_duration_us": 12450000,
  "total_spans": 45230,
  "components": [
    {
      "component": "Storage",
      "total_duration_us": 8920000,
      "span_count": 22100,
      "pct_of_session": 71.6
    }
  ],
  "bottlenecks": [
    {
      "component": "Storage",
      "operation": "lsm_scan",
      "total_duration_us": 5120000,
      "count": 2800,
      "avg_duration_us": 1828,
      "p99_duration_us": 12300,
      "pct_of_session": 41.1
    }
  ],
  "operations": [
    {
      "component": "Storage",
      "operation": "lsm_scan",
      "count": 2800,
      "total_duration_us": 5120000,
      "min_duration_us": 120,
      "avg_duration_us": 1828,
      "p50_duration_us": 1450,
      "p95_duration_us": 8500,
      "p99_duration_us": 12300
    }
  ]
}

CI Integration Example¶

let report = PerfTracer::global().generate_report();
let json = report.to_json();

// Fail CI if any operation's p99 exceeds threshold
for op in json["operations"].as_array().unwrap() {
    let p99 = op["p99_duration_us"].as_u64().unwrap();
    let name = op["operation"].as_str().unwrap();
    if p99 > 50_000 { // 50ms threshold
        panic!("PERF REGRESSION: {} p99 = {}us (threshold: 50000us)", name, p99);
    }
}

Interpreting Common Patterns¶

Pattern	Likely Cause	Action
Storage > 70% of session	I/O-bound workload	Check `lsm_scan` and `sstable_scan` counts; add indexes; tune compaction
`lsm_scan` P99 >> Avg	Tail latency from scanning many SSTables	Run compaction; check level count
`memtable_flush` high avg	Large memtable or slow disk	Increase flush parallelism; check I/O bandwidth
`compaction` high total	Background compaction stealing I/O	Tune compaction rate limiter; schedule off-peak
`hash_join` dominates Compute	Large hash tables for joins	Check join order; add join indexes; increase memory
`sort` high total	Many ORDER BY without index	Add covering indexes for sorted access patterns
`optimize_query` > 10% session	Complex queries trigger expensive planning	Simplify queries; use prepared statements
`sql_parse` high total	Very long SQL strings or many distinct queries	Use parameterized queries; batch operations
`wal_write` P99 high	WAL fsync latency spikes	Check disk IOPS; enable group commit (`DirectIoWal`)
`raft_append` high in cluster	Consensus latency between nodes	Check network latency; reduce replication factor for reads
`cache_fusion_transfer` high	Remote cache misses	Co-locate hot data; increase local cache size
Optimizer cost low but Compute high	Good plans but slow execution	Focus on storage I/O and compute optimization
Protocol > 20% session	Parsing or serialization overhead	Use prepared statements; reduce result set size

Overhead Characteristics¶

Disabled (Default) — Zero Cost¶

perf_span!() call:
  1. AtomicBool::load(Acquire)     ~1ns   (single CPU instruction)
  2. Branch: span_id == 0 → skip   ~0ns   (predicted after warmup)
  3. Drop: span_id == 0 → no-op    ~0ns

Total overhead per span: ~1ns (unmeasurable in practice)

The compiler inlines the #[inline(always)] is_enabled() check. After branch predictor warmup, the disabled path adds zero measurable overhead — verified by benchmarking with and without instrumentation.

Enabled — Low Overhead¶

perf_span!() call:
  1. AtomicU64::fetch_add(Relaxed)  ~5ns   (span ID)
  2. Instant::now()                 ~25ns  (clock read)
  3. thread_local push              ~10ns  (stack push)

Drop (span completion):
  4. Instant::elapsed()             ~25ns  (clock read)
  5. thread_local pop               ~10ns  (stack pop)
  6. DashMap::entry().record()      ~30ns  (lock-free insert)

Total overhead per span: ~100ns when enabled

For a workload producing 10,000 spans/second, that's ~1ms of tracing overhead per second (0.1%). For 100,000 spans/second, ~10ms/second (1%). This is acceptable for staging and targeted production debugging.

Memory Usage¶

Per completed span: ~200 bytes (component, operation string, duration, attributes, IDs)
Per aggregated operation: ~500 bytes + 8 bytes per recorded duration
10,000 spans session: ~2MB memory
100,000 spans session: ~20MB memory

Call PerfTracer::global().reset() between test runs to free accumulated data.

API Reference¶

`PerfTracer`¶

// Get the global singleton
PerfTracer::global() -> &'static PerfTracer

// Lifecycle
tracer.enable()              // Enable + reset + start session clock
tracer.disable()             // Disable (spans become no-ops)
tracer.is_enabled() -> bool  // Fast check (atomic load)
tracer.reset()               // Clear all data without changing enabled state

// Span creation
tracer.begin_span(component, operation, attributes) -> PerfSpanGuard

// Reporting
tracer.generate_report() -> PerfReport
tracer.span_count() -> usize

`PerfReport`¶

report.to_text() -> String              // Human-readable table
report.to_json() -> serde_json::Value   // Machine-readable JSON

// Fields
report.session_duration: Duration
report.total_spans: u64
report.operations: Vec<OperationStats>      // All ops, sorted by total time desc
report.component_summary: Vec<ComponentSummary>  // Per-component aggregation
report.bottlenecks: Vec<Bottleneck>         // Top 10 by total time
report.spans: Vec<CompletedSpan>            // All individual spans (for analysis)

`perf_span!` Macro¶

// Simple span (no attributes)
let _span = perf_span!(Component::Storage, "operation_name");

// Span with key-value attributes
let _span = perf_span!(Component::Storage, "operation_name",
    "key1" => value1,
    "key2" => value2
);

// The guard (_span) MUST be bound to a variable.
// If not bound, it drops immediately and records zero duration.
// WRONG: perf_span!(Component::Storage, "op");  // drops immediately!
// RIGHT: let _span = perf_span!(Component::Storage, "op");

Comparison with External Tracing Tools¶

Feature	PerfTracer (built-in)	`tracing` crate	Jaeger/OpenTelemetry
Setup	Zero config	Subscriber setup	Agent + collector infrastructure
Overhead disabled	~1ns (atomic load)	~5ns (subscriber check)	N/A (not embedded)
Overhead enabled	~100ns/span	~200ns/span	~1-5us/span (network export)
Percentile stats	Built-in (p50/p95/p99)	Requires aggregator	Requires backend query
Bottleneck ranking	Automatic	Manual analysis	Manual query
Thread safety	Lock-free (DashMap)	Layer-dependent	Async export
Distributed tracing	Span hierarchy only	Full distributed	Full distributed
Production use	Targeted debugging	Always-on possible	Always-on recommended

Use PerfTracer for: Performance testing, benchmarking, targeted bottleneck analysis, CI regression checks.

Use tracing crate for: Always-on structured logging, distributed request tracing, log aggregation.

Use OpenTelemetry for: Multi-service distributed tracing, production observability, SLO monitoring.

PerfTracer is complementary to these tools — it provides instant, zero-dependency bottleneck analysis without any infrastructure setup.