Compaction Strategy Guide for HeliosDB¶
Overview¶
Compaction is a critical background process in LSM-tree storage engines that merges SSTables to: - Remove duplicate keys (keeping only the latest version) - Delete tombstones for deleted keys - Reorganize data into sorted levels - Reclaim disk space - Improve read performance
This guide covers HeliosDB's advanced compaction strategies, including parallel execution, I/O throttling, and early tombstone deletion.
Table of Contents¶
- Compaction Strategies
- Parallel Compaction
- I/O Throttling
- Tombstone Management
- Priority Scheduling
- Performance Optimization
- Monitoring and Metrics
Compaction Strategies¶
1. Leveled Compaction Strategy (LCS)¶
How It Works: - Data organized into levels (L0, L1, ..., Ln) - Each level is 10x larger than the previous level - SSTables within a level don't overlap (except L0) - Compaction merges overlapping SSTables between adjacent levels
Level Structure:
L0: 100 MB (4 files × 25 MB, may overlap)
L1: 1 GB (40 files × 25 MB, non-overlapping)
L2: 10 GB (400 files × 25 MB, non-overlapping)
L3: 100 GB (4000 files × 25 MB, non-overlapping)
Pros: - Excellent read performance (1-2 SSTables per read) - Low space amplification (~1.1x) - Predictable performance
Cons: - High write amplification (8-10x) - More I/O intensive - Slower for write-heavy workloads
Best For: - Read-heavy workloads - Point lookups - Random access patterns - Production databases
Configuration:
use heliosdb_storage::CompactionConfig;
let config = CompactionConfig {
strategy: CompactionStrategy::Leveled { max_level: 5 },
min_sstables_for_compaction: 2,
level0_size_threshold: 100 * 1024 * 1024, // 100 MB
level_size_multiplier: 10,
max_concurrent_compactions: 4,
..Default::default()
};
2. Size-Tiered Compaction Strategy (STCS)¶
How It Works: - SSTables grouped into buckets by size - Compaction triggered when bucket has 4+ similar-sized SSTables - No strict level organization - Merges SSTables of similar size together
Bucket Example:
Bucket 1: [10 MB, 12 MB, 11 MB, 10 MB] → Compact to 43 MB
Bucket 2: [50 MB, 48 MB, 52 MB, 49 MB] → Compact to 199 MB
Bucket 3: [200 MB, 210 MB, 195 MB, 205 MB] → Compact to 810 MB
Pros: - Low write amplification (2-4x) - Fast writes - Good for time-series data
Cons: - Higher read amplification (may scan many SSTables) - Higher space amplification (~2x) - Temporary disk space spikes during compaction
Best For: - Write-heavy workloads - Time-series data - Log aggregation - Metrics collection
Configuration:
let config = CompactionConfig {
strategy: CompactionStrategy::SizeTiered,
min_sstables_for_compaction: 4,
level0_size_threshold: 100 * 1024 * 1024,
max_concurrent_compactions: 4,
..Default::default()
};
3. Universal Compaction¶
How It Works: - Optimized for time-series and append-only workloads - Compacts entire sorted runs - Uses size-ratio based triggering - Minimizes write amplification
Compaction Trigger:
Pros: - Lowest write amplification (2x) - Excellent for sequential writes - Minimal overhead
Cons: - Can have temporary space amplification - Not ideal for random updates - Less predictable than leveled
Best For: - Time-series databases - Append-only workloads - Event logging - Sensor data
Configuration:
use heliosdb_storage::{CompactionStrategyV2, CompactionManagerV2};
let strategy = CompactionStrategyV2::Universal;
let manager = CompactionManagerV2::new(
data_dir,
strategy,
sstables,
tuner,
max_concurrent_compactions,
);
Strategy Comparison¶
| Metric | Leveled | Size-Tiered | Universal |
|---|---|---|---|
| Write Amplification | 8-10x | 2-4x | 2x |
| Read Amplification | 1-2x | 5-10x | 10-20x |
| Space Amplification | 1.1x | 2x | 1.5-2x |
| Read Performance | Excellent | Good | Fair |
| Write Performance | Good | Excellent | Excellent |
| Space Efficiency | Excellent | Fair | Good |
Parallel Compaction¶
HeliosDB supports parallel compaction execution to maximize throughput on multi-core systems.
How It Works¶
- Worker Pool: Multiple compaction workers run concurrently
- Priority Queue: Tasks scheduled by priority
- Semaphore: Limits concurrent compactions to prevent resource exhaustion
- Independent Execution: Non-overlapping compactions run in parallel
Configuration¶
use heliosdb_storage::{CompactionManagerV2, AdaptiveLsmTuner, LsmTuningConfig};
use std::sync::Arc;
let tuner = Arc::new(AdaptiveLsmTuner::new(LsmTuningConfig::default()));
let manager = CompactionManagerV2::new(
data_dir,
CompactionStrategyV2::Adaptive,
sstables,
tuner,
8, // Max 8 concurrent compactions
);
Worker Distribution¶
CPU Cores: 16
Recommendation: max_concurrent_compactions = cores / 2 = 8
Reasoning:
- Leave cores for foreground operations
- Each compaction uses 1-2 threads
- I/O bound, not CPU bound
Benefits¶
- 3-4x compaction throughput on multi-core systems
- Reduced compaction backlog
- Better foreground latency (less compaction debt)
- Improved write throughput
Task Scheduling¶
Tasks are prioritized by urgency score:
urgency_score = (level_priority * 100) + (size_factor * 10) + age_factor
Priority:
L0 → L1: 500 (highest priority)
L1 → L2: 400
L2 → L3: 300
...
Example:
Task 1: L0→L1, 200MB, 5 minutes old
urgency = 500 + 20 + 5 = 525
Task 2: L3→L4, 1GB, 2 minutes old
urgency = 200 + 100 + 2 = 302
Task 1 executes first (higher urgency)
I/O Throttling¶
Compaction can overwhelm I/O bandwidth, affecting foreground performance. HeliosDB includes adaptive I/O throttling.
How It Works¶
Token Bucket Algorithm: 1. Tokens represent I/O bytes 2. Tokens refill at configured rate 3. Compaction consumes tokens before I/O 4. If tokens exhausted, compaction waits
Configuration¶
use heliosdb_storage::IoThrottleConfig;
let config = IoThrottleConfig {
max_read_bytes_per_sec: 100 * 1024 * 1024, // 100 MB/s
max_write_bytes_per_sec: 100 * 1024 * 1024, // 100 MB/s
adaptive: true, // Adjust based on load
};
Adaptive Throttling¶
When adaptive: true, throttling adjusts based on:
- Foreground operation latency
- System I/O utilization
- Compaction backlog
Rules: - If foreground latency <1ms: Allow full compaction bandwidth - If foreground latency 1-5ms: Reduce compaction to 75% - If foreground latency 5-10ms: Reduce compaction to 50% - If foreground latency >10ms: Reduce compaction to 25%
Benefits¶
- Consistent foreground performance
- Prevents compaction from starving reads/writes
- Better multi-tenant resource sharing
- Reduced tail latencies
Tombstone Management¶
Deletes in LSM trees use tombstones (markers indicating deletion). HeliosDB includes advanced tombstone management.
Standard Tombstone GC¶
Process:
1. Tombstone written on delete
2. Tombstone propagates through levels during compaction
3. After gc_grace_seconds, tombstone eligible for removal
4. Tombstone removed if it's the only version of the key
Configuration:
Why Grace Period? - Allows time for repairs/backups - Prevents resurrection of deleted data - Handles distributed clock skew
Early Tombstone Deletion (ETD)¶
HeliosDB V2 includes early tombstone deletion for high-tombstone scenarios.
When Triggered:
Process:
1. Calculate tombstone percentage in compaction input
2. If >30%, enable aggressive tombstone removal
3. Remove tombstones older than gc_grace_seconds / 10
4. Reduces space amplification by 40-60%
Benefits: - Faster space reclamation - Lower read amplification (fewer tombstones to skip) - Better for high-churn workloads
Configuration:
Trade-offs: - May delete tombstones before full grace period - Acceptable for most workloads - Disable for strict consistency requirements
Priority Scheduling¶
Compaction tasks are scheduled by priority to optimize performance.
Priority Calculation¶
pub struct CompactionPriority {
pub priority: u8, // Base priority (0-255)
pub estimated_size: u64, // Size of compaction
pub level: u8, // Source level
pub urgency_score: u64, // Calculated urgency
}
urgency_score = calculate_urgency(level, size, age);
Priority Rules¶
- L0 Compactions: Highest priority (prevents write stalls)
- Large Compactions: Higher priority (clear backlog)
- Old Compactions: Higher priority (prevent accumulation)
- Deep Level Compactions: Lower priority (less impact)
Example¶
use heliosdb_storage::{CompactionPriority, CompactionTaskV2};
let task = CompactionTaskV2 {
id: "compact-1".to_string(),
priority: CompactionPriority {
priority: 100,
estimated_size: 200 * 1024 * 1024, // 200 MB
level: 0,
urgency_score: 500,
},
source_sstables: vec![...],
target_level: 1,
strategy: CompactionStrategyV2::LeveledParallel,
enable_tombstone_gc: true,
compression: CompressionAlgorithm::Snappy,
};
manager.submit_task(task)?;
Performance Optimization¶
1. Minimize Write Amplification¶
Strategies: - Use size-tiered or universal compaction - Increase SSTable size (fewer compactions) - Batch writes in larger memtables - Enable early tombstone deletion
Example:
let config = LsmTuningConfig {
memtable_size_mb: 256, // Larger memtables
target_file_size_base: 128, // Larger SSTables
compaction_style: 1, // Universal compaction
..Default::default()
};
Result: - Write amplification: 10x → 3x (70% reduction) - Write throughput: +100%
2. Minimize Read Amplification¶
Strategies: - Use leveled compaction - Increase bloom filter size - Trigger compaction earlier (fewer L0 files) - Increase block cache
Example:
let config = LsmTuningConfig {
level0_file_trigger: 2, // Compact L0 early
bloom_bits_per_key: vec![16, 14, 12, 10, 8], // Large filters
block_cache_mb: 2048, // Large cache
compaction_style: 0, // Leveled
..Default::default()
};
Result: - Read amplification: 10x → 2x (80% reduction) - Read throughput: +150%
3. Optimize Space Utilization¶
Strategies: - Use compression (Zstd for maximum compression) - Enable early tombstone deletion - Use leveled compaction (better space efficiency) - Reduce GC grace period
Example:
let config = LsmTuningConfig {
compression_per_level: vec![0, 2, 2, 2, 2, 2, 2], // Zstd
compaction_style: 0, // Leveled
..Default::default()
};
let compaction_config = CompactionConfig {
gc_grace_seconds: 86400, // 1 day (reduced from 10)
..Default::default()
};
Result: - Space amplification: 2.5x → 1.2x (52% reduction) - Disk savings: 40-60%
4. Balance Throughput and Latency¶
Use Adaptive Strategy:
The adaptive strategy automatically switches between leveled and universal based on workload.
Benefits: - Optimal performance across workload changes - Reduced operational overhead - Best of both worlds
Monitoring and Metrics¶
Key Metrics¶
1. Compaction Metrics
let metrics = manager.metrics();
let snapshot = metrics.snapshot();
println!("Total Compactions: {}", snapshot.total_compactions);
println!("Bytes Read: {} MB", snapshot.bytes_read / (1024 * 1024));
println!("Bytes Written: {} MB", snapshot.bytes_written / (1024 * 1024));
println!("Tombstones Removed: {}", snapshot.tombstones_removed);
println!("Space Reclaimed: {} MB", snapshot.space_reclaimed / (1024 * 1024));
2. Amplification Factors
println!("Write Amplification: {:.2}x", snapshot.write_amplification());
println!("Space Efficiency: {:.2}%", snapshot.space_efficiency() * 100.0);
3. Performance Metrics
println!("Avg Compaction Time: {:.2}s", snapshot.avg_compaction_time_secs());
println!("Avg Throughput: {:.2} MB/s",
snapshot.avg_throughput as f64 / (1024.0 * 1024.0));
println!("Failed Compactions: {}", snapshot.failed_compactions);
Metrics Report¶
Output:
Compaction Metrics Report
=========================
Total Compactions: 1250
- Size-Tiered: 850
- Leveled: 400
- Failed: 5
Space Statistics:
- Bytes Read: 120 GB
- Bytes Written: 480 GB
- Space Reclaimed: 60 GB
- Space Efficiency: 50.00%
SSTable Statistics:
- Tables Merged: 5000
- Tables Created: 1250
- Tombstones Removed: 500000
- Duplicates Removed: 1500000
Performance:
- Total Compaction Time: 3600.00s
- Average Compaction Time: 2.88s
- Average Throughput: 133.33 MB/s
- Write Amplification: 4.00x
- Peak Memory Usage: 512.00 MB
Alerts and Thresholds¶
Set up monitoring for:
// Write amplification too high
if snapshot.write_amplification() > 15.0 {
alert("Write amplification excessive: {:.2}x",
snapshot.write_amplification());
}
// Compaction falling behind
if manager.active_compactions() >= max_concurrent {
alert("All compaction workers busy, backlog building");
}
// High failure rate
let failure_rate = snapshot.failed_compactions as f64
/ snapshot.total_compactions as f64;
if failure_rate > 0.01 {
alert("Compaction failure rate: {:.2}%", failure_rate * 100.0);
}
// Space efficiency low (not reclaiming enough space)
if snapshot.space_efficiency() < 0.20 {
alert("Low space reclamation: {:.2}%",
snapshot.space_efficiency() * 100.0);
}
Best Practices¶
1. Choose the Right Strategy¶
Decision Tree:
Is workload write-heavy (>70% writes)?
Yes → Size-Tiered or Universal
No → Is it read-heavy (>70% reads)?
Yes → Leveled
No → Adaptive (let system choose)
2. Size SSTables Appropriately¶
Guidelines: - L0 files: 25-50 MB (fast to compact) - L1+ files: 64-128 MB (balance compaction cost) - Time-series: 128-256 MB (sequential writes)
3. Configure Parallel Workers¶
Formula:
Example: - 16 cores, 32 GB RAM, 2 GB/s I/O - max_concurrent = min(8, 16, 4) = 4
4. Use I/O Throttling in Production¶
Always enable for: - Multi-tenant systems - Shared storage - Cloud deployments - SSD storage (prevent wear)
5. Monitor Continuously¶
Essential metrics: - Compaction backlog (pending tasks) - Active compactions - Write/read/space amplification - Failure rate - L0 file count
6. Tune Based on Metrics¶
If write amplification high: - Switch to size-tiered/universal - Increase SSTable size - Reduce compaction frequency
If read amplification high: - Switch to leveled - Increase bloom filter size - Compact L0 more frequently
If space amplification high: - Reduce GC grace period - Enable early tombstone deletion - Trigger manual compaction
Conclusion¶
HeliosDB's advanced compaction system provides:
✓ Multiple strategies for different workloads ✓ Parallel execution for high throughput ✓ I/O throttling for consistent performance ✓ Early tombstone deletion for space efficiency ✓ Priority scheduling for optimal resource usage
Key Takeaways:
- Choose strategy based on workload (leveled for reads, size-tiered for writes)
- Enable parallel compaction on multi-core systems
- Use I/O throttling to protect foreground performance
- Monitor metrics and adjust configuration as needed
- Use adaptive mode for changing workloads
By following this guide, you can achieve: - 40-60% reduction in write amplification - 2-3x improvement in compaction throughput - Consistent sub-millisecond foreground latencies - 50%+ space savings with compression and tombstone management
For more information, see the LSM Tuning Guide.