HeliosDB Health Check Guide¶
Version: 1.0 Last Updated: 2025-11-30
Health Check Endpoints¶
HTTP Endpoint¶
# Simple health check
curl http://localhost:5432/health
# Returns: 200 OK if healthy
# Detailed health check
curl http://localhost:5432/health/detailed
# Returns: JSON with full status
Response Format¶
{
"status": "healthy",
"timestamp": "2025-11-30T10:00:00Z",
"version": "7.0.0",
"components": {
"database": "healthy",
"replication": "healthy",
"cache": "healthy",
"storage": "healthy"
},
"metrics": {
"uptime_seconds": 86400,
"connections": 42,
"memory_usage_pct": 45.2,
"disk_usage_pct": 62.5,
"cpu_usage_pct": 12.3
}
}
Health Check SQL Commands¶
-- Database health
SELECT pg_is_in_recovery() as is_replica;
-- Replication health
SELECT COUNT(*) as replica_count FROM pg_stat_replication;
-- Cache health
SELECT cache_hit_rate FROM cache_statistics;
-- Storage health
SELECT pg_database_size(current_database()) / 1024 / 1024 / 1024 as size_gb;
-- Vacuum/ANALYZE status
SELECT
schemaname,
tablename,
last_vacuum,
last_analyze
FROM pg_stat_user_tables
ORDER BY last_vacuum DESC;
Monitoring Integration¶
Prometheus Metrics¶
# HELP heliosdb_up Database is up
# TYPE heliosdb_up gauge
heliosdb_up 1
# HELP heliosdb_connections Active connections
# TYPE heliosdb_connections gauge
heliosdb_connections 42
# HELP heliosdb_memory_usage_bytes Memory usage
# TYPE heliosdb_memory_usage_bytes gauge
heliosdb_memory_usage_bytes 1073741824
Kubernetes Probes¶
livenessProbe:
httpGet:
path: /health
port: 5432
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 5432
initialDelaySeconds: 5
periodSeconds: 5
Alerting Rules¶
# Alert if database is down
- alert: HeliosDBDown
expr: heliosdb_up == 0
for: 1m
annotations:
summary: "HeliosDB is down"
# Alert if memory usage high
- alert: HighMemoryUsage
expr: heliosdb_memory_usage_pct > 80
for: 5m
annotations:
summary: "High memory usage"
# Alert if replication lag
- alert: ReplicationLag
expr: heliosdb_replication_lag_bytes > 1073741824
for: 1m
annotations:
summary: "Replication lag detected"
Troubleshooting¶
Issue: Health check returns unhealthy¶
-- Check what's unhealthy
SELECT * FROM health_check_details;
-- Check specific components
SELECT * FROM component_health_status;
-- Review logs
SELECT * FROM system_logs WHERE severity = 'ERROR'
ORDER BY timestamp DESC LIMIT 20;
Best Practices¶
- Check health every 30 seconds
- Set up alerts for failures
- Monitor trends over time
- Include in deployment checks
- Test failover with health checks
Related Documentation: - Monitoring Guide - High Availability Guide