HeliosDB Health Check Guide¶

Version: 1.0 Last Updated: 2025-11-30

Health Check Endpoints¶

HTTP Endpoint¶

# Simple health check
curl http://localhost:5432/health
# Returns: 200 OK if healthy

# Detailed health check
curl http://localhost:5432/health/detailed
# Returns: JSON with full status

Response Format¶

{
  "status": "healthy",
  "timestamp": "2025-11-30T10:00:00Z",
  "version": "7.0.0",
  "components": {
    "database": "healthy",
    "replication": "healthy",
    "cache": "healthy",
    "storage": "healthy"
  },
  "metrics": {
    "uptime_seconds": 86400,
    "connections": 42,
    "memory_usage_pct": 45.2,
    "disk_usage_pct": 62.5,
    "cpu_usage_pct": 12.3
  }
}

Health Check SQL Commands¶

-- Database health
SELECT pg_is_in_recovery() as is_replica;

-- Replication health
SELECT COUNT(*) as replica_count FROM pg_stat_replication;

-- Cache health
SELECT cache_hit_rate FROM cache_statistics;

-- Storage health
SELECT pg_database_size(current_database()) / 1024 / 1024 / 1024 as size_gb;

-- Vacuum/ANALYZE status
SELECT
  schemaname,
  tablename,
  last_vacuum,
  last_analyze
FROM pg_stat_user_tables
ORDER BY last_vacuum DESC;

Monitoring Integration¶

Prometheus Metrics¶

# HELP heliosdb_up Database is up
# TYPE heliosdb_up gauge
heliosdb_up 1

# HELP heliosdb_connections Active connections
# TYPE heliosdb_connections gauge
heliosdb_connections 42

# HELP heliosdb_memory_usage_bytes Memory usage
# TYPE heliosdb_memory_usage_bytes gauge
heliosdb_memory_usage_bytes 1073741824

Kubernetes Probes¶

livenessProbe:
  httpGet:
    path: /health
    port: 5432
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 5432
  initialDelaySeconds: 5
  periodSeconds: 5

Alerting Rules¶

# Alert if database is down
- alert: HeliosDBDown
  expr: heliosdb_up == 0
  for: 1m
  annotations:
    summary: "HeliosDB is down"

# Alert if memory usage high
- alert: HighMemoryUsage
  expr: heliosdb_memory_usage_pct > 80
  for: 5m
  annotations:
    summary: "High memory usage"

# Alert if replication lag
- alert: ReplicationLag
  expr: heliosdb_replication_lag_bytes > 1073741824
  for: 1m
  annotations:
    summary: "Replication lag detected"

Troubleshooting¶

Issue: Health check returns unhealthy¶

-- Check what's unhealthy
SELECT * FROM health_check_details;

-- Check specific components
SELECT * FROM component_health_status;

-- Review logs
SELECT * FROM system_logs WHERE severity = 'ERROR'
ORDER BY timestamp DESC LIMIT 20;

Best Practices¶

Check health every 30 seconds
Set up alerts for failures
Monitor trends over time
Include in deployment checks
Test failover with health checks

Related Documentation: - Monitoring Guide - High Availability Guide