Automated ETL Quick Start Guide¶
Time to Complete: 5 minutes Prerequisites: Rust 1.75+, HeliosDB installed Last Updated: January 4, 2026
Overview¶
This guide helps you get started with HeliosDB's Automated ETL feature in just 5 minutes. You'll learn how to infer schemas, build pipelines, and transform data automatically.
Step 1: Add Dependency¶
Add the ETL crate to your Cargo.toml:
Step 2: Basic Schema Inference¶
use heliosdb_etl::{AutomatedETLEngine, SchemaInferenceConfig};
use std::collections::HashMap;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create engine with default configuration
let config = SchemaInferenceConfig::default();
let engine = AutomatedETLEngine::new(config).await?;
// Sample data (simulating CSV rows)
let data = vec![
HashMap::from([
("id".to_string(), "1".to_string()),
("name".to_string(), "Alice".to_string()),
("email".to_string(), "alice@example.com".to_string()),
]),
HashMap::from([
("id".to_string(), "2".to_string()),
("name".to_string(), "Bob".to_string()),
("email".to_string(), "bob@example.com".to_string()),
]),
];
// Infer schema automatically
let schema = engine.infer_schema("users", &data).await?;
println!("Inferred {} columns:", schema.columns.len());
for col in &schema.columns {
println!(" {} : {:?}", col.name, col.data_type);
}
Ok(())
}
Step 3: Build and Execute Pipeline¶
use heliosdb_etl::{AutomatedETLEngine, SchemaInferenceConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let engine = AutomatedETLEngine::new(SchemaInferenceConfig::default()).await?;
// Your source data
let source_data = load_your_data();
// 1. Infer source schema
let source_schema = engine.infer_schema("source", &source_data).await?;
// 2. Build transformation pipeline
let pipeline = engine.build_pipeline(
source_schema.clone(),
source_schema, // Same schema for passthrough
).await?;
// 3. Execute pipeline
let result = pipeline.execute(source_data).await?;
println!("Processed {} rows in {:?}", result.rows_processed, result.duration);
println!("Throughput: {:.0} rows/second", result.throughput);
Ok(())
}
Step 4: Enable Data Quality Checks¶
use heliosdb_etl::{PipelineConfig, DataQualityValidator};
// Enable quality validation in pipeline
let pipeline_config = PipelineConfig {
name: "my_pipeline".to_string(),
source_schema,
target_schema,
quality_checks: true, // Enable quality validation
anomaly_detection: true, // Enable anomaly detection
batch_size: 10_000,
parallelism: 8,
..Default::default()
};
// After pipeline execution, check quality
let validator = DataQualityValidator::new();
let metrics = validator.validate(&result.data, &schema).await?;
println!("Quality Metrics:");
println!(" Completeness: {:.1}%", metrics.completeness * 100.0);
println!(" Accuracy: {:.1}%", metrics.accuracy * 100.0);
println!(" Overall Score: {:.1}%", metrics.overall_score * 100.0);
Common Configuration Options¶
use heliosdb_etl::SchemaInferenceConfig;
let config = SchemaInferenceConfig {
sample_size: 10_000, // Rows to sample for inference
confidence_threshold: 0.8, // Min confidence for type detection
infer_relationships: true, // Detect foreign keys
infer_constraints: true, // Detect primary keys/unique
max_rows: 1_000_000, // Max rows to process
};
What's Next?¶
| Topic | Guide |
|---|---|
| Real-world examples | EXAMPLES.md |
| Full API reference | F5.2.4_AUTOMATED_ETL_USER_GUIDE.md |
| CDC integration | CDC Webhooks |
Troubleshooting¶
Issue: Schema inference returns wrong types
// Solution: Increase sample size
let config = SchemaInferenceConfig {
sample_size: 50_000,
confidence_threshold: 0.9,
..Default::default()
};
Issue: Slow transformation performance
// Solution: Increase parallelism and batch size
let config = PipelineConfig {
batch_size: 50_000,
parallelism: num_cpus::get(),
..Default::default()
};
See Also: README.md | EXAMPLES.md