Performance Tracing & Bottleneck Analysis¶
HeliosDB Full ships with a built-in, zero-cost-when-disabled performance tracing system that instruments every layer of the database — from SQL parsing through storage I/O. Use it to pinpoint bottlenecks during performance tests, diagnose slow queries, and generate production-grade reports without external tooling.
Key Differences from Nano/Lite¶
| Capability | Nano | Lite | Full |
|---|---|---|---|
| Granularity | Per-query phases | Per-query phases | Per-operation across 8 components |
| Span count | ~8 phase types | ~6 phase types | 57 instrumented operations |
| Overhead when disabled | Branch check per query | Option<QueryTrace> check |
Single atomic load (AtomicBool) |
| Overhead when enabled | ~1-2us per query | ~1-2us per query | ~50ns per span (lock-free) |
| Percentiles | None | avg/p50/p95/p99 | min/avg/p50/p95/p99 per operation |
| Hierarchical spans | No | No | Yes (parent-child relationships) |
| Component breakdown | No | No | Yes (8 components, % of session) |
| Output formats | Structured log events | Text report | Text report + JSON + programmatic API |
| Thread safety | Single-threaded | Single-threaded | Lock-free DashMap + thread-local stacks |
| SQL interface | RUST_LOG env var |
SET helios.trace_queries |
Programmatic API (see below) |
Architecture¶
PerfTracer (global singleton)
├── enabled: AtomicBool ← zero-cost gate
├── next_span_id: AtomicU64 ← lock-free ID generator
├── completed: DashMap ← all finished spans
└── aggregated: DashMap ← running per-op statistics
Thread 1 Thread 2 Thread 3
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ SPAN_STACK │ │ SPAN_STACK │ │ SPAN_STACK │
│ (thread- │ │ (thread- │ │ (thread- │
│ local) │ │ local) │ │ local) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ drop guard │ drop guard │ drop guard
└──────────────────────┴──────────────────────┘
│
┌──────────▼──────────┐
│ DashMap shard N │ ← lock-free aggregation
└─────────────────────┘
Zero-cost when disabled: Every perf_span! macro call starts with self.enabled.load(Ordering::Acquire). When tracing is off, this is a single CPU instruction (no branch misprediction after warmup) and the RAII guard stores span_id = 0, making Drop a no-op.
Lock-free when enabled: Each thread maintains its own span stack (thread_local!). On span completion, timing data is inserted into a DashMap (sharded concurrent hash map) — no global mutex, no contention between threads.
Quick Start¶
Rust API (Recommended)¶
use heliosdb_common::perf_tracer::{PerfTracer, Component};
use heliosdb_common::perf_span;
// 1. Enable tracing (resets previous data, starts session clock)
PerfTracer::global().enable();
// 2. Run your workload
// All instrumented operations automatically record timing spans.
// (execute queries, run benchmarks, etc.)
// 3. Generate the bottleneck report
let report = PerfTracer::global().generate_report();
// Human-readable text report
println!("{}", report.to_text());
// Machine-readable JSON (for dashboards, CI pipelines)
let json = report.to_json();
println!("{}", serde_json::to_string_pretty(&json).unwrap());
// 4. Disable tracing (returns to zero overhead)
PerfTracer::global().disable();
One-Liner for Quick Profiling¶
// Wrap any code block with enable/disable
PerfTracer::global().enable();
run_benchmark();
let report = PerfTracer::global().generate_report();
println!("{}", report.to_text());
PerfTracer::global().disable();
Instrumented Operations¶
Every operation below is automatically traced when the PerfTracer is enabled. No code changes needed — the spans are already wired into the hot paths.
Protocol Layer (10 spans)¶
| Operation | Description | File |
|---|---|---|
sql_parse |
SQL text to AST | heliosdb-protocols/src/postgres/sql_parser.rs |
handle_query |
Full query lifecycle (top-level) | heliosdb-protocols/src/postgres/handler.rs |
execute_select |
SELECT execution | heliosdb-protocols/src/postgres/handler.rs |
execute_insert |
INSERT execution | heliosdb-protocols/src/postgres/handler.rs |
execute_update |
UPDATE execution | heliosdb-protocols/src/postgres/handler.rs |
execute_delete |
DELETE execution | heliosdb-protocols/src/postgres/handler.rs |
execute_create_table |
DDL CREATE TABLE | heliosdb-protocols/src/postgres/handler.rs |
execute_create_function |
DDL CREATE FUNCTION | heliosdb-protocols/src/postgres/handler.rs |
execute_call_procedure |
Stored procedure call | heliosdb-protocols/src/postgres/handler.rs |
execute_create_trigger |
DDL CREATE TRIGGER | heliosdb-protocols/src/postgres/handler.rs |
Compute Layer (9 spans)¶
| Operation | Description | File |
|---|---|---|
execute_plan |
Full query plan execution | heliosdb-compute/src/executor/query_executor.rs |
table_scan |
Sequential table scan | heliosdb-compute/src/executor/query_executor.rs |
index_scan |
Index-based lookup | heliosdb-compute/src/executor/query_executor.rs |
filter |
Predicate evaluation | heliosdb-compute/src/executor/query_executor.rs |
project |
Column projection | heliosdb-compute/src/executor/query_executor.rs |
hash_join |
Hash join build + probe | heliosdb-compute/src/executor/query_executor.rs |
aggregate |
GROUP BY / aggregation | heliosdb-compute/src/executor/query_executor.rs |
sort |
ORDER BY sort | heliosdb-compute/src/executor/query_executor.rs |
window_function |
Window function evaluation | heliosdb-compute/src/executor/query_executor.rs |
Optimizer Layer (2 spans)¶
| Operation | Description | File |
|---|---|---|
optimize_query |
Full optimization pass | heliosdb-compute/src/optimizer_legacy.rs |
cost_estimation |
Cost model evaluation | heliosdb-compute/src/optimizer_legacy.rs |
Parallel Execution (2 spans)¶
| Operation | Description | File |
|---|---|---|
parallel_execute |
Work-stealing parallel execution | heliosdb-compute/src/parallel_query.rs |
partition_data |
Data partitioning for parallelism | heliosdb-compute/src/parallel_query.rs |
Storage Layer (9 spans)¶
| Operation | Description | File |
|---|---|---|
lsm_put |
LSM tree write (memtable insert) | heliosdb-storage/src/lsm.rs |
lsm_get |
LSM tree point read | heliosdb-storage/src/lsm.rs |
lsm_delete |
LSM tree delete (tombstone write) | heliosdb-storage/src/lsm.rs |
lsm_scan |
LSM tree range scan (k-way merge) | heliosdb-storage/src/lsm.rs |
wal_write |
Write-ahead log entry | heliosdb-storage/src/lsm.rs |
memtable_flush |
Memtable flush to SSTable | heliosdb-storage/src/lsm.rs |
compaction |
SSTable level compaction | heliosdb-storage/src/lsm.rs |
sstable_write |
SSTable file creation | heliosdb-storage/src/sstable.rs |
sstable_get |
SSTable point lookup (bloom + index) | heliosdb-storage/src/sstable.rs |
sstable_scan |
SSTable range scan | heliosdb-storage/src/sstable.rs |
Cluster Layer (7 spans)¶
| Operation | Description | File |
|---|---|---|
raft_append |
Raft log append | heliosdb-cluster/src/consensus/mod.rs, log.rs |
raft_election |
Raft leader election | heliosdb-cluster/src/consensus/mod.rs, raft_node.rs |
raft_commit |
Raft entry commit | heliosdb-cluster/src/consensus/mod.rs |
cache_fusion_lookup |
Cache fusion remote lookup | heliosdb-cluster/src/cache_fusion.rs |
cache_fusion_transfer |
Cache fusion block transfer | heliosdb-cluster/src/cache_fusion.rs |
distributed_scan |
Distributed smart scan coordination | heliosdb-storage/src/distributed_smart_scan.rs |
Metadata Layer (2 operation types, 8 spans)¶
| Operation | Description | File |
|---|---|---|
schema_lookup |
Schema/catalog resolution | heliosdb-metadata/src/schema_service.rs, raft_service.rs |
ddl_execute |
DDL operation execution | heliosdb-metadata/src/schema_service.rs, raft_service.rs |
Understanding the Report¶
Sample Output¶
+======================================================================+
| HeliosDB Performance Bottleneck Report |
+======================================================================+
Session Duration: 12.45s | Total Spans: 45,230
-- Component Breakdown ----------------------------------------------------
Component Total Time Spans % Sess
--------------------------------------------------------------
Storage 8.92s 22,100 71.6%
Compute 2.34s 11,500 18.8%
Protocol 0.89s 5,600 7.1%
Optimizer 0.21s 5,600 1.7%
Metadata 0.09s 430 0.7%
-- Top Bottlenecks (by total time) ----------------------------------------
Component Operation Total Count Avg P99 % Sess
------------------------------------------------------------------------------------------
Storage lsm_scan 5.12s 2800 1.8ms 12.3ms 41.1%
Storage sstable_scan 2.45s 8400 291us 1.2ms 19.7%
Compute hash_join 1.23s 450 2.7ms 15.1ms 9.9%
Compute sort 0.67s 320 2.1ms 8.4ms 5.4%
Protocol execute_select 0.56s 2800 200us 1.1ms 4.5%
Storage lsm_put 0.45s 5000 90us 450us 3.6%
Storage memtable_flush 0.38s 12 31.7ms 45.2ms 3.1%
Compute aggregate 0.34s 280 1.2ms 4.5ms 2.7%
Optimizer optimize_query 0.21s 5600 38us 120us 1.7%
Protocol sql_parse 0.18s 5600 32us 95us 1.4%
-- All Operations ---------------------------------------------------------
Component Operation Count Total Min Avg P95 P99
---------------------------------------------------------------------------------------------------------
Storage lsm_scan 2800 5.12s 120us 1.8ms 8.5ms 12.3ms
Storage sstable_scan 8400 2.45s 15us 291us 850us 1.2ms
...
How to Read It¶
Component Breakdown shows where time is spent at the macro level. In the example above, 71.6% of traced time is in Storage — typical for I/O-bound workloads. If Optimizer or Protocol dominate, investigate complex queries or parsing overhead.
Top Bottlenecks ranks individual operations by total cumulative time. The most impactful optimization target is the #1 bottleneck. Key metrics:
- Total: Cumulative wall time across all invocations. The primary ranking metric.
- Count: Number of times this operation was called. High count with low avg = many small operations. Low count with high avg = few expensive operations.
- Avg: Average duration per invocation.
- P99: 99th percentile — the "worst case" that 1% of invocations exceed. If P99 >> Avg, you have tail latency outliers.
- % Sess: Percentage of the total session duration. Tells you the maximum speedup you'd get by eliminating this bottleneck entirely (Amdahl's law).
All Operations provides the full statistical breakdown for every traced operation.
Performance Testing Workflow¶
1. Baseline Measurement¶
use heliosdb_common::perf_tracer::PerfTracer;
// Reset and enable
PerfTracer::global().enable();
// Run baseline workload
run_oltp_benchmark(1000); // e.g., 1000 transactions
// Capture baseline report
let baseline = PerfTracer::global().generate_report();
let baseline_json = baseline.to_json();
println!("=== BASELINE ===\n{}", baseline.to_text());
PerfTracer::global().disable();
2. Make Changes¶
3. Compare¶
PerfTracer::global().enable();
run_oltp_benchmark(1000); // same workload
let after = PerfTracer::global().generate_report();
println!("=== AFTER CHANGE ===\n{}", after.to_text());
PerfTracer::global().disable();
// Programmatic comparison
let b = &baseline_json["bottlenecks"][0];
let a = &after.to_json()["bottlenecks"][0];
println!(
"Top bottleneck: {} -> {} (was {}us, now {}us)",
b["operation"], a["operation"],
b["avg_duration_us"], a["avg_duration_us"],
);
4. Drill Into Specific Operations¶
let report = PerfTracer::global().generate_report();
// Find all storage operations slower than 1ms average
for op in &report.operations {
if op.component == heliosdb_common::perf_tracer::Component::Storage
&& op.avg_duration > std::time::Duration::from_millis(1)
{
println!(
"SLOW: {} — avg {:?}, p99 {:?}, count {}",
op.operation, op.avg_duration, op.p99_duration, op.count
);
}
}
Adding Custom Spans¶
Instrument your own code with the perf_span! macro. Spans nest automatically via the thread-local stack.
use heliosdb_common::perf_tracer::Component;
use heliosdb_common::perf_span;
fn process_batch(batch: &[Row]) {
// Simple span (operation name only)
let _span = perf_span!(Component::Compute, "process_batch");
for row in batch {
// Span with key-value attributes (for debugging, not aggregation)
let _row_span = perf_span!(
Component::Compute, "process_row",
"batch_size" => batch.len(),
"row_key" => row.key
);
// ... processing ...
// _row_span records timing on drop
}
// _span records timing on drop (includes all child spans)
}
Span Nesting¶
Spans automatically form parent-child relationships within each thread:
{
let _outer = perf_span!(Component::Protocol, "handle_query");
{
let _parse = perf_span!(Component::Protocol, "sql_parse");
// parse_span.parent_id == outer_span.span_id
}
{
let _optimize = perf_span!(Component::Optimizer, "optimize_query");
// optimize_span.parent_id == outer_span.span_id
}
{
let _exec = perf_span!(Component::Compute, "execute_plan");
{
let _scan = perf_span!(Component::Storage, "lsm_scan");
// scan_span.parent_id == exec_span.span_id
// scan_span.depth == 2
}
}
}
Component Selection¶
Choose the component that owns the operation:
| Component | Use For |
|---|---|
Component::Storage |
Anything touching disk, memtable, WAL, SSTables, compaction |
Component::Compute |
Query execution, joins, sorts, aggregation, expression evaluation |
Component::Optimizer |
Query planning, cost estimation, rule application |
Component::Protocol |
SQL parsing, wire protocol handling, result serialization |
Component::Network |
Connection management, transport, RPC |
Component::Cluster |
Raft consensus, replication, distributed coordination |
Component::Cache |
Block cache, query cache, tiered cache lookups |
Component::Metadata |
Schema catalog, DDL operations |
JSON Output for CI/Dashboards¶
The JSON report format is designed for automated analysis:
{
"session_duration_us": 12450000,
"total_spans": 45230,
"components": [
{
"component": "Storage",
"total_duration_us": 8920000,
"span_count": 22100,
"pct_of_session": 71.6
}
],
"bottlenecks": [
{
"component": "Storage",
"operation": "lsm_scan",
"total_duration_us": 5120000,
"count": 2800,
"avg_duration_us": 1828,
"p99_duration_us": 12300,
"pct_of_session": 41.1
}
],
"operations": [
{
"component": "Storage",
"operation": "lsm_scan",
"count": 2800,
"total_duration_us": 5120000,
"min_duration_us": 120,
"avg_duration_us": 1828,
"p50_duration_us": 1450,
"p95_duration_us": 8500,
"p99_duration_us": 12300
}
]
}
CI Integration Example¶
let report = PerfTracer::global().generate_report();
let json = report.to_json();
// Fail CI if any operation's p99 exceeds threshold
for op in json["operations"].as_array().unwrap() {
let p99 = op["p99_duration_us"].as_u64().unwrap();
let name = op["operation"].as_str().unwrap();
if p99 > 50_000 { // 50ms threshold
panic!("PERF REGRESSION: {} p99 = {}us (threshold: 50000us)", name, p99);
}
}
Interpreting Common Patterns¶
| Pattern | Likely Cause | Action |
|---|---|---|
| Storage > 70% of session | I/O-bound workload | Check lsm_scan and sstable_scan counts; add indexes; tune compaction |
lsm_scan P99 >> Avg |
Tail latency from scanning many SSTables | Run compaction; check level count |
memtable_flush high avg |
Large memtable or slow disk | Increase flush parallelism; check I/O bandwidth |
compaction high total |
Background compaction stealing I/O | Tune compaction rate limiter; schedule off-peak |
hash_join dominates Compute |
Large hash tables for joins | Check join order; add join indexes; increase memory |
sort high total |
Many ORDER BY without index | Add covering indexes for sorted access patterns |
optimize_query > 10% session |
Complex queries trigger expensive planning | Simplify queries; use prepared statements |
sql_parse high total |
Very long SQL strings or many distinct queries | Use parameterized queries; batch operations |
wal_write P99 high |
WAL fsync latency spikes | Check disk IOPS; enable group commit (DirectIoWal) |
raft_append high in cluster |
Consensus latency between nodes | Check network latency; reduce replication factor for reads |
cache_fusion_transfer high |
Remote cache misses | Co-locate hot data; increase local cache size |
| Optimizer cost low but Compute high | Good plans but slow execution | Focus on storage I/O and compute optimization |
| Protocol > 20% session | Parsing or serialization overhead | Use prepared statements; reduce result set size |
Overhead Characteristics¶
Disabled (Default) — Zero Cost¶
perf_span!() call:
1. AtomicBool::load(Acquire) ~1ns (single CPU instruction)
2. Branch: span_id == 0 → skip ~0ns (predicted after warmup)
3. Drop: span_id == 0 → no-op ~0ns
Total overhead per span: ~1ns (unmeasurable in practice)
The compiler inlines the #[inline(always)] is_enabled() check. After branch predictor warmup, the disabled path adds zero measurable overhead — verified by benchmarking with and without instrumentation.
Enabled — Low Overhead¶
perf_span!() call:
1. AtomicU64::fetch_add(Relaxed) ~5ns (span ID)
2. Instant::now() ~25ns (clock read)
3. thread_local push ~10ns (stack push)
Drop (span completion):
4. Instant::elapsed() ~25ns (clock read)
5. thread_local pop ~10ns (stack pop)
6. DashMap::entry().record() ~30ns (lock-free insert)
Total overhead per span: ~100ns when enabled
For a workload producing 10,000 spans/second, that's ~1ms of tracing overhead per second (0.1%). For 100,000 spans/second, ~10ms/second (1%). This is acceptable for staging and targeted production debugging.
Memory Usage¶
- Per completed span: ~200 bytes (component, operation string, duration, attributes, IDs)
- Per aggregated operation: ~500 bytes + 8 bytes per recorded duration
- 10,000 spans session: ~2MB memory
- 100,000 spans session: ~20MB memory
Call PerfTracer::global().reset() between test runs to free accumulated data.
API Reference¶
PerfTracer¶
// Get the global singleton
PerfTracer::global() -> &'static PerfTracer
// Lifecycle
tracer.enable() // Enable + reset + start session clock
tracer.disable() // Disable (spans become no-ops)
tracer.is_enabled() -> bool // Fast check (atomic load)
tracer.reset() // Clear all data without changing enabled state
// Span creation
tracer.begin_span(component, operation, attributes) -> PerfSpanGuard
// Reporting
tracer.generate_report() -> PerfReport
tracer.span_count() -> usize
PerfReport¶
report.to_text() -> String // Human-readable table
report.to_json() -> serde_json::Value // Machine-readable JSON
// Fields
report.session_duration: Duration
report.total_spans: u64
report.operations: Vec<OperationStats> // All ops, sorted by total time desc
report.component_summary: Vec<ComponentSummary> // Per-component aggregation
report.bottlenecks: Vec<Bottleneck> // Top 10 by total time
report.spans: Vec<CompletedSpan> // All individual spans (for analysis)
perf_span! Macro¶
// Simple span (no attributes)
let _span = perf_span!(Component::Storage, "operation_name");
// Span with key-value attributes
let _span = perf_span!(Component::Storage, "operation_name",
"key1" => value1,
"key2" => value2
);
// The guard (_span) MUST be bound to a variable.
// If not bound, it drops immediately and records zero duration.
// WRONG: perf_span!(Component::Storage, "op"); // drops immediately!
// RIGHT: let _span = perf_span!(Component::Storage, "op");
Comparison with External Tracing Tools¶
| Feature | PerfTracer (built-in) | tracing crate |
Jaeger/OpenTelemetry |
|---|---|---|---|
| Setup | Zero config | Subscriber setup | Agent + collector infrastructure |
| Overhead disabled | ~1ns (atomic load) | ~5ns (subscriber check) | N/A (not embedded) |
| Overhead enabled | ~100ns/span | ~200ns/span | ~1-5us/span (network export) |
| Percentile stats | Built-in (p50/p95/p99) | Requires aggregator | Requires backend query |
| Bottleneck ranking | Automatic | Manual analysis | Manual query |
| Thread safety | Lock-free (DashMap) | Layer-dependent | Async export |
| Distributed tracing | Span hierarchy only | Full distributed | Full distributed |
| Production use | Targeted debugging | Always-on possible | Always-on recommended |
Use PerfTracer for: Performance testing, benchmarking, targeted bottleneck analysis, CI regression checks.
Use tracing crate for: Always-on structured logging, distributed request tracing, log aggregation.
Use OpenTelemetry for: Multi-service distributed tracing, production observability, SLO monitoring.
PerfTracer is complementary to these tools — it provides instant, zero-dependency bottleneck analysis without any infrastructure setup.