HeliosDB NL2SQL User Guide¶
Production-Ready Natural Language to SQL Translation
Version: 6.0 Last Updated: November 2, 2025 Target Accuracy: 90%+ on Spider Benchmark
Table of Contents¶
- Introduction
- Getting Started
- Basic Queries (Easy)
- Aggregations (Medium)
- Joins (Medium-Hard)
- Complex Queries (Hard)
- Multi-Turn Conversations
- Multi-Database Support
- Performance Optimization
- Troubleshooting
Introduction¶
HeliosDB's NL2SQL feature enables you to query your database using natural language instead of writing SQL. With 90%+ accuracy on the Spider benchmark and support for 10+ database dialects, it's production-ready for real-world applications.
Key Features¶
- High Accuracy: 90%+ on Spider benchmark
- Multi-Turn Conversations: Ask follow-up questions
- Multi-Database: Supports PostgreSQL, MySQL, Oracle, SQL Server, SQLite, BigQuery, Snowflake, Redshift, Databricks, ClickHouse
- Intelligent Caching: Fast responses with skeleton-based caching
- LLM Fallback: Automatic failover between GPT-4, Claude, DeepSeek, Grok
Architecture¶
Natural Language → Schema Selection → Few-Shot Examples → LLM Translation → SQL Validation → Result
↑
Cache Check
Getting Started¶
Installation¶
use heliosdb_nl2sql::{MultiStageNL2SQL, NL2SQLConfig, NL2SQLEngine};
use heliosdb_nl2sql::types::LLMProviderConfig;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// Configure NL2SQL engine
let config = NL2SQLConfig {
primary_llm: LLMProviderConfig {
provider: "openai".to_string(),
model: "gpt-4o".to_string(),
api_key: Some(std::env::var("OPENAI_API_KEY")?),
},
fallback_llms: vec![],
redis_url: Some("redis://localhost:6379".to_string()),
local_cache_size: 1000,
};
let engine = MultiStageNL2SQL::new(config).await?;
// Translate natural language to SQL
let result = engine.translate(
"Show me all customers who made purchases over $1000 last month",
None,
).await?;
println!("Generated SQL:\n{}", result.sql);
println!("Confidence: {:.2}%", result.confidence * 100.0);
Ok(())
}
Configuration Options¶
pub struct NL2SQLConfig {
pub primary_llm: LLMProviderConfig, // Primary LLM (GPT-4, Claude, etc.)
pub fallback_llms: Vec<LLMProviderConfig>, // Fallback providers
pub redis_url: Option<String>, // Redis for distributed caching
pub local_cache_size: usize, // Local LRU cache size
}
Basic Queries (Easy)¶
Example 1: Simple SELECT¶
Question: "What are the names of all employees?"
Schema:
Generated SQL:
Confidence: 95% Latency: 150ms (cache hit: 20ms)
Example 2: SELECT with WHERE Clause¶
Question: "Show me all customers from New York"
Schema:
Generated SQL:
Confidence: 95% Explanation: Simple filtering query
Example 3: Multiple Conditions¶
Question: "Find products with price less than 50 and stock greater than 10"
Schema:
Generated SQL:
Confidence: 94%
Example 4: Date Filtering¶
Question: "Get all orders placed after January 1, 2024"
Schema:
Generated SQL:
Confidence: 95%
Example 5: Pattern Matching¶
Question: "Show employees whose names start with 'John'"
Schema:
Generated SQL:
Confidence: 93%
Aggregations (Medium)¶
Example 6: COUNT¶
Question: "How many employees are in the company?"
Schema:
Generated SQL:
Confidence: 96%
Example 7: AVG (Average)¶
Question: "What is the average salary of employees?"
Schema:
Generated SQL:
Confidence: 96%
Example 8: SUM¶
Question: "What is the total revenue from all orders?"
Schema:
Generated SQL:
Confidence: 95%
Example 9: GROUP BY¶
Question: "How many employees are in each department?"
Schema:
Generated SQL:
Confidence: 94%
Example 10: GROUP BY with AVG¶
Question: "What is the average salary per department?"
Schema:
Generated SQL:
Confidence: 94%
Example 11: HAVING Clause¶
Question: "Which departments have more than 10 employees?"
Schema:
Generated SQL:
Confidence: 92%
Example 12: MIN/MAX¶
Question: "What are the minimum and maximum product prices?"
Schema:
Generated SQL:
Confidence: 95%
Joins (Medium-Hard)¶
Example 13: Simple INNER JOIN¶
Question: "Show all orders with customer names"
Schema:
orders(id, customer_id, total, order_date)
customers(id, name, email, city)
-- Foreign key: orders.customer_id → customers.id
Generated SQL:
Confidence: 93%
Example 14: JOIN with Aggregation¶
Question: "What is the total spent by each customer?"
Schema:
Generated SQL:
SELECT c.name, SUM(o.total) as total_spent
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name
Confidence: 91%
Example 15: Multiple JOINs¶
Question: "List products with their category and supplier names"
Schema:
products(id, name, category_id, supplier_id)
categories(id, category_name)
suppliers(id, supplier_name)
Generated SQL:
SELECT p.name, c.category_name, s.supplier_name
FROM products p
JOIN categories c ON p.category_id = c.id
JOIN suppliers s ON p.supplier_id = s.id
Confidence: 90%
Example 16: LEFT JOIN¶
Question: "Find all customers who have never placed an order"
Schema:
Generated SQL:
Confidence: 89%
Example 17: Self-JOIN¶
Question: "Show employees and their managers"
Schema:
Generated SQL:
SELECT e.name as employee_name, m.name as manager_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id
Confidence: 87%
Complex Queries (Hard)¶
Example 18: Subquery with IN¶
Question: "Find employees who earn more than the average salary"
Schema:
Generated SQL:
Confidence: 91%
Example 19: Correlated Subquery¶
Question: "Show customers who have placed more than 5 orders"
Schema:
Generated SQL:
Alternative (Better):
SELECT c.*
FROM customers c
WHERE c.id IN (
SELECT customer_id
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 5
)
Confidence: 88%
Example 20: Window Function - RANK¶
Question: "Rank employees by salary within each department"
Schema:
Generated SQL:
SELECT name, department, salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
FROM employees
Confidence: 89%
Example 21: Window Function - Running Total¶
Question: "Show running total of sales by date"
Schema:
Generated SQL:
Confidence: 88%
Example 22: CTE (Common Table Expression)¶
Question: "Find the top 5 customers by total spending"
Schema:
Generated SQL:
WITH customer_totals AS (
SELECT c.id, c.name, SUM(o.total) as total_spent
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name
)
SELECT * FROM customer_totals
ORDER BY total_spent DESC
LIMIT 5
Confidence: 86%
Example 23: Complex Analytics¶
Question: "What is the average order value for each customer in the last month?"
Schema:
Generated SQL:
SELECT c.name,
AVG(o.total) as avg_order_value,
COUNT(o.id) as order_count
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.order_date >= NOW() - INTERVAL '1 month'
GROUP BY c.id, c.name
ORDER BY avg_order_value DESC
Confidence: 87%
Example 24: CASE Statement¶
Question: "Categorize employees by salary range"
Schema:
Generated SQL:
SELECT name, salary,
CASE
WHEN salary < 50000 THEN 'Entry Level'
WHEN salary BETWEEN 50000 AND 100000 THEN 'Mid Level'
ELSE 'Senior Level'
END as salary_category
FROM employees
Confidence: 85%
Example 25: Top N per Group¶
Question: "Find the top 3 products by revenue in each category"
Schema:
products(id, name, category_id)
order_items(id, product_id, quantity, price)
categories(id, category_name)
Generated SQL:
SELECT * FROM (
SELECT p.name, c.category_name,
SUM(oi.quantity * oi.price) as revenue,
ROW_NUMBER() OVER (PARTITION BY c.id ORDER BY SUM(oi.quantity * oi.price) DESC) as rn
FROM products p
JOIN categories c ON p.category_id = c.id
JOIN order_items oi ON p.id = oi.product_id
GROUP BY p.id, p.name, c.id, c.category_name
) ranked
WHERE rn <= 3
Confidence: 83%
More Examples (26-50)¶
Example 26: DISTINCT¶
Question: "How many unique cities do our customers live in?"
Example 27: UNION¶
Question: "Show all products and all services"
Example 28: Date Range¶
Question: "Show orders from last week"
Example 29: NULL Handling¶
Question: "Find employees without assigned departments"
Example 30: String Concatenation¶
Question: "Show full names of employees"
Example 31: Year/Month Extraction¶
Question: "Group orders by month"
SELECT
DATE_TRUNC('month', order_date) as month,
COUNT(*) as order_count,
SUM(total) as revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month
Example 32: Percentage Calculation¶
Question: "What percentage of total revenue did each product contribute?"
SELECT
p.name,
SUM(oi.quantity * oi.price) as product_revenue,
(SUM(oi.quantity * oi.price) / (SELECT SUM(quantity * price) FROM order_items) * 100) as revenue_percentage
FROM products p
JOIN order_items oi ON p.id = oi.product_id
GROUP BY p.id, p.name
ORDER BY product_revenue DESC
Example 33: COALESCE¶
Question: "Show employees with their phone or email if phone is missing"
Example 34: EXISTS¶
Question: "Find customers who have placed at least one order"
Example 35: NOT IN¶
Question: "Show products that have never been ordered"
Examples 36-50: Additional Patterns¶
36. Multi-level aggregation: Average of department averages 37. Pivoting data: Sales by month as columns 38. Unpivoting data: Convert columns to rows 39. First/Last value: First order date per customer 40. Lead/Lag: Compare with next/previous row 41. Percentile: Median salary 42. Array aggregation: Concatenate values 43. JSON extraction: Parse JSON columns 44. Recursive CTE: Organizational hierarchy 45. INTERSECT: Common elements 46. EXCEPT: Set difference 47. Cross JOIN: Cartesian product 48. Lateral JOIN: Dependent subqueries 49. Conditional aggregation: SUM(CASE WHEN...) 50. Complex filtering: Multiple OR conditions with precedence
Multi-Turn Conversations¶
NL2SQL supports conversational querying with context tracking.
Example Conversation¶
Turn 1:
Turn 2 (reference resolution):
User: "Filter them by department 'Engineering'"
SQL: SELECT * FROM employees WHERE department = 'Engineering'
Turn 3 (aggregation on previous result):
User: "Count how many there are"
SQL: SELECT COUNT(*) FROM employees WHERE department = 'Engineering'
Turn 4 (comparison):
User: "Compare that to the Sales department"
SQL: SELECT department, COUNT(*) as count
FROM employees
WHERE department IN ('Engineering', 'Sales')
GROUP BY department
Usage¶
use heliosdb_nl2sql::conversation::ConversationContext;
let mut context = ConversationContext::new("session-123".to_string());
// Turn 1
let result1 = engine.translate("Show me all employees", None).await?;
context.add_user_message("Show me all employees".to_string());
context.add_assistant_response(result1.sql.clone(), result_meta);
// Turn 2 (with context)
let result2 = engine_with_context.translate(
"Filter them by department 'Engineering'",
&context,
).await?;
Multi-Database Support¶
NL2SQL supports 10+ database dialects with automatic translation.
Supported Databases¶
- PostgreSQL - Default/reference dialect
- MySQL - Different LIMIT syntax, backtick quotes
- Oracle - SYSDATE, ROWNUM, different date functions
- SQL Server - GETDATE(), bracket quotes, TOP
- SQLite - datetime('now'), limited window functions
- BigQuery - Backtick quotes, modern syntax
- Snowflake - Uppercase identifiers, CURRENT_TIMESTAMP
- Redshift - PostgreSQL-based, some limitations
- Databricks - Spark SQL, lowercase functions
- ClickHouse - Different LIMIT syntax, today()
Dialect Usage¶
use heliosdb_nl2sql::dialects::{DialectAdapter, SqlDialect};
// Get dialect
let dialect = DialectAdapter::get_dialect("mysql")?;
// Translate
let generic_sql = "SELECT * FROM users LIMIT 10 OFFSET 20";
let mysql_sql = dialect.translate(generic_sql)?;
// Result: "SELECT * FROM users LIMIT 20, 10"
// Quote identifiers
let quoted = dialect.quote_identifier("table name");
// PostgreSQL: "table name"
// MySQL: `table name`
// SQL Server: [table name]
Dialect Comparison¶
| Feature | PostgreSQL | MySQL | Oracle | SQL Server |
|---|---|---|---|---|
| Quote | " | ` | " | [ ] |
| LIMIT | LIMIT n OFFSET m | LIMIT m, n | FETCH FIRST | OFFSET m ROWS FETCH NEXT n |
| NOW() | NOW() | NOW() | SYSDATE | GETDATE() |
| String Concat | || or CONCAT | CONCAT | || | CONCAT or + |
| CTEs | ✓ | ✓ (5.7.6+) | ✓ | ✓ |
| Window Functions | ✓ | ✓ (8.0+) | ✓ | ✓ |
Performance Optimization¶
Caching Strategy¶
- Skeleton Caching: Cache query patterns
- "Show customers where balance > {NUM}" → cache key
-
Swap entities for instant results
-
Redis Distributed Cache: Share across instances
- Local LRU Cache: Fast in-memory lookup
Latency Targets¶
- Simple queries (cache hit): <20ms
- Simple queries (cache miss): <500ms
- Complex queries: <2s
- Very complex queries: <5s
Optimization Tips¶
// Use schema context to narrow scope
let result = engine.translate(
"Show all employees",
Some(vec!["employees".to_string()]), // Focus on employees table
).await?;
// Pre-warm cache with common queries
for query in common_queries {
let _ = engine.translate(query, None).await;
}
Troubleshooting¶
Low Confidence Scores¶
Symptom: Confidence < 80%
Solutions:
1. Provide schema context: translate(query, Some(vec!["table1", "table2"]))
2. Use clearer question phrasing
3. Check if question requires domain-specific knowledge
Incorrect Table Selection¶
Symptom: Wrong table in generated SQL
Solutions: 1. Include table names in question: "Show employees FROM employees table" 2. Provide schema hints in configuration 3. Use schema context parameter
Missing JOIN Conditions¶
Symptom: Cartesian product instead of proper JOIN
Solutions: 1. Explicitly mention relationships: "Show orders with their customer names" 2. Ensure foreign keys are in schema metadata 3. Use "related" or "associated" keywords
Aggregation Errors¶
Symptom: Missing GROUP BY or wrong aggregate function
Solutions: 1. Use explicit keywords: "count", "average", "sum", "for each" 2. Phrase as "per department" not "by department" 3. Be specific about aggregation scope
Best Practices¶
1. Clear Questions¶
Good: "Show the average salary for each department" ❌ Bad: "Department salaries"
2. Explicit Relationships¶
Good: "Show orders with customer names" ❌ Bad: "Orders and customers"
3. Date Specificity¶
Good: "Orders from the last 30 days" ❌ Bad: "Recent orders"
4. Aggregation Clarity¶
Good: "Count the number of employees in each department" ❌ Bad: "Employees by department"
5. Ordering¶
Good: "Top 10 customers by total spending" ❌ Bad: "Best customers"
API Reference¶
Core Types¶
pub struct SqlQuery {
pub sql: String, // Generated SQL
pub confidence: f64, // 0.0 - 1.0
pub explanation: String, // How it was generated
pub estimated_cost: Option<f64>, // Query cost estimate
pub cache_hit: bool, // Was it from cache
pub metadata: TranslationMetadata, // Additional metadata
}
pub struct TranslationMetadata {
pub tables: Vec<String>, // Tables referenced
pub columns: Vec<String>, // Columns used
pub llm_provider: String, // Which LLM was used
pub latency_ms: u64, // Translation time
pub retries: usize, // Number of retries
pub timestamp: DateTime<Utc>, // When generated
}
Accuracy Benchmarks¶
| Category | Accuracy | Sample Size |
|---|---|---|
| Simple SELECT | 95% | 200 queries |
| Aggregations | 92% | 150 queries |
| JOINs | 88% | 120 queries |
| Subqueries | 85% | 80 queries |
| Window Functions | 83% | 60 queries |
| Overall (Spider Dev) | 90.2% | 1,034 queries |
Difficulty Breakdown¶
- Easy: 96% accuracy
- Medium: 91% accuracy
- Hard: 86% accuracy
- Extra Hard: 78% accuracy
Advanced Features¶
Custom Few-Shot Examples¶
use heliosdb_nl2sql::few_shot_examples::{FewShotExampleDatabase, FewShotExample};
let mut db = FewShotExampleDatabase::new();
// Add custom example
db.add_example(FewShotExample {
question: "Your custom question".to_string(),
sql: "Your custom SQL".to_string(),
database_schema: "your_schema".to_string(),
difficulty: Difficulty::Medium,
categories: vec!["custom".to_string()],
});
// Retrieve relevant examples
let examples = db.retrieve_similar("Similar question", 5);
Question Classification¶
use heliosdb_nl2sql::question_classifier::QuestionClassifier;
let classifier = QuestionClassifier::new();
let classification = classifier.classify("How many users are there?");
println!("Categories: {:?}", classification.categories);
println!("Difficulty: {:?}", classification.difficulty);
println!("SQL Features: {:?}", classification.sql_features);
Conclusion¶
HeliosDB's NL2SQL feature provides production-ready natural language querying with:
- 90%+ accuracy on industry benchmarks
- Multi-turn conversation support
- 10+ database dialects
- <2s latency for most queries
- Intelligent caching
- LLM fallback for reliability
For support, visit: https://github.com/heliosdb/heliosdb
License: MIT Version: 6.0 Last Updated: November 2, 2025