Compaction Strategy Guide for HeliosDB¶

Overview¶

Compaction is a critical background process in LSM-tree storage engines that merges SSTables to: - Remove duplicate keys (keeping only the latest version) - Delete tombstones for deleted keys - Reorganize data into sorted levels - Reclaim disk space - Improve read performance

This guide covers HeliosDB's advanced compaction strategies, including parallel execution, I/O throttling, and early tombstone deletion.

Compaction Strategies¶

1. Leveled Compaction Strategy (LCS)¶

How It Works: - Data organized into levels (L0, L1, ..., Ln) - Each level is 10x larger than the previous level - SSTables within a level don't overlap (except L0) - Compaction merges overlapping SSTables between adjacent levels

Level Structure:

L0: 100 MB  (4 files × 25 MB, may overlap)
L1: 1 GB    (40 files × 25 MB, non-overlapping)
L2: 10 GB   (400 files × 25 MB, non-overlapping)
L3: 100 GB  (4000 files × 25 MB, non-overlapping)

Pros: - Excellent read performance (1-2 SSTables per read) - Low space amplification (~1.1x) - Predictable performance

Cons: - High write amplification (8-10x) - More I/O intensive - Slower for write-heavy workloads

Best For: - Read-heavy workloads - Point lookups - Random access patterns - Production databases

Configuration:

use heliosdb_storage::CompactionConfig;

let config = CompactionConfig {
    strategy: CompactionStrategy::Leveled { max_level: 5 },
    min_sstables_for_compaction: 2,
    level0_size_threshold: 100 * 1024 * 1024, // 100 MB
    level_size_multiplier: 10,
    max_concurrent_compactions: 4,
    ..Default::default()
};

2. Size-Tiered Compaction Strategy (STCS)¶

How It Works: - SSTables grouped into buckets by size - Compaction triggered when bucket has 4+ similar-sized SSTables - No strict level organization - Merges SSTables of similar size together

Bucket Example:

Bucket 1: [10 MB, 12 MB, 11 MB, 10 MB] → Compact to 43 MB
Bucket 2: [50 MB, 48 MB, 52 MB, 49 MB] → Compact to 199 MB
Bucket 3: [200 MB, 210 MB, 195 MB, 205 MB] → Compact to 810 MB

Pros: - Low write amplification (2-4x) - Fast writes - Good for time-series data

Cons: - Higher read amplification (may scan many SSTables) - Higher space amplification (~2x) - Temporary disk space spikes during compaction

Best For: - Write-heavy workloads - Time-series data - Log aggregation - Metrics collection

Configuration:

let config = CompactionConfig {
    strategy: CompactionStrategy::SizeTiered,
    min_sstables_for_compaction: 4,
    level0_size_threshold: 100 * 1024 * 1024,
    max_concurrent_compactions: 4,
    ..Default::default()
};

3. Universal Compaction¶

How It Works: - Optimized for time-series and append-only workloads - Compacts entire sorted runs - Uses size-ratio based triggering - Minimizes write amplification

Compaction Trigger:

Trigger when:
  Size(Run N) / Size(Run N-1) > threshold (typically 1.2)

Pros: - Lowest write amplification (2x) - Excellent for sequential writes - Minimal overhead

Cons: - Can have temporary space amplification - Not ideal for random updates - Less predictable than leveled

Best For: - Time-series databases - Append-only workloads - Event logging - Sensor data

Configuration:

use heliosdb_storage::{CompactionStrategyV2, CompactionManagerV2};

let strategy = CompactionStrategyV2::Universal;

let manager = CompactionManagerV2::new(
    data_dir,
    strategy,
    sstables,
    tuner,
    max_concurrent_compactions,
);

Strategy Comparison¶

Metric	Leveled	Size-Tiered	Universal
Write Amplification	8-10x	2-4x	2x
Read Amplification	1-2x	5-10x	10-20x
Space Amplification	1.1x	2x	1.5-2x
Read Performance	Excellent	Good	Fair
Write Performance	Good	Excellent	Excellent
Space Efficiency	Excellent	Fair	Good

Parallel Compaction¶

HeliosDB supports parallel compaction execution to maximize throughput on multi-core systems.

How It Works¶

Worker Pool: Multiple compaction workers run concurrently
Priority Queue: Tasks scheduled by priority
Semaphore: Limits concurrent compactions to prevent resource exhaustion
Independent Execution: Non-overlapping compactions run in parallel

Configuration¶

use heliosdb_storage::{CompactionManagerV2, AdaptiveLsmTuner, LsmTuningConfig};
use std::sync::Arc;

let tuner = Arc::new(AdaptiveLsmTuner::new(LsmTuningConfig::default()));

let manager = CompactionManagerV2::new(
    data_dir,
    CompactionStrategyV2::Adaptive,
    sstables,
    tuner,
    8, // Max 8 concurrent compactions
);

Worker Distribution¶

CPU Cores: 16
Recommendation: max_concurrent_compactions = cores / 2 = 8

Reasoning:
- Leave cores for foreground operations
- Each compaction uses 1-2 threads
- I/O bound, not CPU bound

Benefits¶

3-4x compaction throughput on multi-core systems
Reduced compaction backlog
Better foreground latency (less compaction debt)
Improved write throughput

Task Scheduling¶

Tasks are prioritized by urgency score:

urgency_score = (level_priority * 100) + (size_factor * 10) + age_factor

Priority:
  L0 → L1: 500 (highest priority)
  L1 → L2: 400
  L2 → L3: 300
  ...

Example:

Task 1: L0→L1, 200MB, 5 minutes old
  urgency = 500 + 20 + 5 = 525

Task 2: L3→L4, 1GB, 2 minutes old
  urgency = 200 + 100 + 2 = 302

Task 1 executes first (higher urgency)

I/O Throttling¶

Compaction can overwhelm I/O bandwidth, affecting foreground performance. HeliosDB includes adaptive I/O throttling.

How It Works¶

Token Bucket Algorithm: 1. Tokens represent I/O bytes 2. Tokens refill at configured rate 3. Compaction consumes tokens before I/O 4. If tokens exhausted, compaction waits

Configuration¶

use heliosdb_storage::IoThrottleConfig;

let config = IoThrottleConfig {
    max_read_bytes_per_sec: 100 * 1024 * 1024,  // 100 MB/s
    max_write_bytes_per_sec: 100 * 1024 * 1024, // 100 MB/s
    adaptive: true, // Adjust based on load
};

Adaptive Throttling¶

When adaptive: true, throttling adjusts based on: - Foreground operation latency - System I/O utilization - Compaction backlog

Rules: - If foreground latency <1ms: Allow full compaction bandwidth - If foreground latency 1-5ms: Reduce compaction to 75% - If foreground latency 5-10ms: Reduce compaction to 50% - If foreground latency >10ms: Reduce compaction to 25%

Benefits¶

Consistent foreground performance
Prevents compaction from starving reads/writes
Better multi-tenant resource sharing
Reduced tail latencies

Tombstone Management¶

Deletes in LSM trees use tombstones (markers indicating deletion). HeliosDB includes advanced tombstone management.

Standard Tombstone GC¶

Process: 1. Tombstone written on delete 2. Tombstone propagates through levels during compaction 3. After gc_grace_seconds, tombstone eligible for removal 4. Tombstone removed if it's the only version of the key

Configuration:

let config = CompactionConfig {
    gc_grace_seconds: 864000, // 10 days
    ..Default::default()
};

Why Grace Period? - Allows time for repairs/backups - Prevents resurrection of deleted data - Handles distributed clock skew

Early Tombstone Deletion (ETD)¶

HeliosDB V2 includes early tombstone deletion for high-tombstone scenarios.

When Triggered:

if tombstone_percentage >= 0.30 { // 30% threshold
    enable_early_tombstone_deletion();
}

Process: 1. Calculate tombstone percentage in compaction input 2. If >30%, enable aggressive tombstone removal 3. Remove tombstones older than gc_grace_seconds / 10 4. Reduces space amplification by 40-60%

Benefits: - Faster space reclamation - Lower read amplification (fewer tombstones to skip) - Better for high-churn workloads

Configuration:

let manager = CompactionManagerV2::new(...);
// ETD is automatic when tombstone % > 30%

Trade-offs: - May delete tombstones before full grace period - Acceptable for most workloads - Disable for strict consistency requirements

Priority Scheduling¶

Compaction tasks are scheduled by priority to optimize performance.

Priority Calculation¶

pub struct CompactionPriority {
    pub priority: u8,           // Base priority (0-255)
    pub estimated_size: u64,    // Size of compaction
    pub level: u8,              // Source level
    pub urgency_score: u64,     // Calculated urgency
}

urgency_score = calculate_urgency(level, size, age);

Priority Rules¶

L0 Compactions: Highest priority (prevents write stalls)
Large Compactions: Higher priority (clear backlog)
Old Compactions: Higher priority (prevent accumulation)
Deep Level Compactions: Lower priority (less impact)

Example¶

use heliosdb_storage::{CompactionPriority, CompactionTaskV2};

let task = CompactionTaskV2 {
    id: "compact-1".to_string(),
    priority: CompactionPriority {
        priority: 100,
        estimated_size: 200 * 1024 * 1024, // 200 MB
        level: 0,
        urgency_score: 500,
    },
    source_sstables: vec![...],
    target_level: 1,
    strategy: CompactionStrategyV2::LeveledParallel,
    enable_tombstone_gc: true,
    compression: CompressionAlgorithm::Snappy,
};

manager.submit_task(task)?;

Performance Optimization¶

1. Minimize Write Amplification¶

Strategies: - Use size-tiered or universal compaction - Increase SSTable size (fewer compactions) - Batch writes in larger memtables - Enable early tombstone deletion

Example:

let config = LsmTuningConfig {
    memtable_size_mb: 256,  // Larger memtables
    target_file_size_base: 128,  // Larger SSTables
    compaction_style: 1,  // Universal compaction
    ..Default::default()
};

Result: - Write amplification: 10x → 3x (70% reduction) - Write throughput: +100%

2. Minimize Read Amplification¶

Strategies: - Use leveled compaction - Increase bloom filter size - Trigger compaction earlier (fewer L0 files) - Increase block cache

Example:

let config = LsmTuningConfig {
    level0_file_trigger: 2,  // Compact L0 early
    bloom_bits_per_key: vec![16, 14, 12, 10, 8],  // Large filters
    block_cache_mb: 2048,  // Large cache
    compaction_style: 0,  // Leveled
    ..Default::default()
};

Result: - Read amplification: 10x → 2x (80% reduction) - Read throughput: +150%

3. Optimize Space Utilization¶

Strategies: - Use compression (Zstd for maximum compression) - Enable early tombstone deletion - Use leveled compaction (better space efficiency) - Reduce GC grace period

Example:

let config = LsmTuningConfig {
    compression_per_level: vec![0, 2, 2, 2, 2, 2, 2],  // Zstd
    compaction_style: 0,  // Leveled
    ..Default::default()
};

let compaction_config = CompactionConfig {
    gc_grace_seconds: 86400,  // 1 day (reduced from 10)
    ..Default::default()
};

Result: - Space amplification: 2.5x → 1.2x (52% reduction) - Disk savings: 40-60%

4. Balance Throughput and Latency¶

Use Adaptive Strategy:

let strategy = CompactionStrategyV2::Adaptive;

The adaptive strategy automatically switches between leveled and universal based on workload.

Benefits: - Optimal performance across workload changes - Reduced operational overhead - Best of both worlds

Monitoring and Metrics¶

Key Metrics¶

1. Compaction Metrics

let metrics = manager.metrics();
let snapshot = metrics.snapshot();

println!("Total Compactions: {}", snapshot.total_compactions);
println!("Bytes Read: {} MB", snapshot.bytes_read / (1024 * 1024));
println!("Bytes Written: {} MB", snapshot.bytes_written / (1024 * 1024));
println!("Tombstones Removed: {}", snapshot.tombstones_removed);
println!("Space Reclaimed: {} MB", snapshot.space_reclaimed / (1024 * 1024));

2. Amplification Factors

println!("Write Amplification: {:.2}x", snapshot.write_amplification());
println!("Space Efficiency: {:.2}%", snapshot.space_efficiency() * 100.0);

3. Performance Metrics

println!("Avg Compaction Time: {:.2}s", snapshot.avg_compaction_time_secs());
println!("Avg Throughput: {:.2} MB/s",
         snapshot.avg_throughput as f64 / (1024.0 * 1024.0));
println!("Failed Compactions: {}", snapshot.failed_compactions);

Metrics Report¶

println!("{}", snapshot.format_report());

Output:

Compaction Metrics Report
=========================
Total Compactions: 1250
  - Size-Tiered: 850
  - Leveled: 400
  - Failed: 5

Space Statistics:
  - Bytes Read: 120 GB
  - Bytes Written: 480 GB
  - Space Reclaimed: 60 GB
  - Space Efficiency: 50.00%

SSTable Statistics:
  - Tables Merged: 5000
  - Tables Created: 1250
  - Tombstones Removed: 500000
  - Duplicates Removed: 1500000

Performance:
  - Total Compaction Time: 3600.00s
  - Average Compaction Time: 2.88s
  - Average Throughput: 133.33 MB/s
  - Write Amplification: 4.00x
  - Peak Memory Usage: 512.00 MB

Alerts and Thresholds¶

Set up monitoring for:

// Write amplification too high
if snapshot.write_amplification() > 15.0 {
    alert("Write amplification excessive: {:.2}x",
          snapshot.write_amplification());
}

// Compaction falling behind
if manager.active_compactions() >= max_concurrent {
    alert("All compaction workers busy, backlog building");
}

// High failure rate
let failure_rate = snapshot.failed_compactions as f64
                   / snapshot.total_compactions as f64;
if failure_rate > 0.01 {
    alert("Compaction failure rate: {:.2}%", failure_rate * 100.0);
}

// Space efficiency low (not reclaiming enough space)
if snapshot.space_efficiency() < 0.20 {
    alert("Low space reclamation: {:.2}%",
          snapshot.space_efficiency() * 100.0);
}

Best Practices¶

1. Choose the Right Strategy¶

Decision Tree:

Is workload write-heavy (>70% writes)?
  Yes → Size-Tiered or Universal
  No → Is it read-heavy (>70% reads)?
    Yes → Leveled
    No → Adaptive (let system choose)

2. Size SSTables Appropriately¶

Guidelines: - L0 files: 25-50 MB (fast to compact) - L1+ files: 64-128 MB (balance compaction cost) - Time-series: 128-256 MB (sequential writes)

3. Configure Parallel Workers¶

Formula:

max_concurrent = min(
    cpu_cores / 2,
    available_memory_gb / 2,
    io_bandwidth_gb_per_sec * 2
)

Example: - 16 cores, 32 GB RAM, 2 GB/s I/O - max_concurrent = min(8, 16, 4) = 4

4. Use I/O Throttling in Production¶

Always enable for: - Multi-tenant systems - Shared storage - Cloud deployments - SSD storage (prevent wear)

5. Monitor Continuously¶

Essential metrics: - Compaction backlog (pending tasks) - Active compactions - Write/read/space amplification - Failure rate - L0 file count

6. Tune Based on Metrics¶

If write amplification high: - Switch to size-tiered/universal - Increase SSTable size - Reduce compaction frequency

If read amplification high: - Switch to leveled - Increase bloom filter size - Compact L0 more frequently

If space amplification high: - Reduce GC grace period - Enable early tombstone deletion - Trigger manual compaction

Conclusion¶

HeliosDB's advanced compaction system provides:

✓ Multiple strategies for different workloads ✓ Parallel execution for high throughput ✓ I/O throttling for consistent performance ✓ Early tombstone deletion for space efficiency ✓ Priority scheduling for optimal resource usage

Key Takeaways:

Choose strategy based on workload (leveled for reads, size-tiered for writes)
Enable parallel compaction on multi-core systems
Use I/O throttling to protect foreground performance
Monitor metrics and adjust configuration as needed
Use adaptive mode for changing workloads

By following this guide, you can achieve: - 40-60% reduction in write amplification - 2-3x improvement in compaction throughput - Consistent sub-millisecond foreground latencies - 50%+ space savings with compression and tombstone management

For more information, see the LSM Tuning Guide.