Skip to content

HeliosDB Health Check Guide

Version: 1.0 Last Updated: 2025-11-30


Health Check Endpoints

HTTP Endpoint

# Simple health check
curl http://localhost:5432/health
# Returns: 200 OK if healthy

# Detailed health check
curl http://localhost:5432/health/detailed
# Returns: JSON with full status

Response Format

{
  "status": "healthy",
  "timestamp": "2025-11-30T10:00:00Z",
  "version": "7.0.0",
  "components": {
    "database": "healthy",
    "replication": "healthy",
    "cache": "healthy",
    "storage": "healthy"
  },
  "metrics": {
    "uptime_seconds": 86400,
    "connections": 42,
    "memory_usage_pct": 45.2,
    "disk_usage_pct": 62.5,
    "cpu_usage_pct": 12.3
  }
}

Health Check SQL Commands

-- Database health
SELECT pg_is_in_recovery() as is_replica;

-- Replication health
SELECT COUNT(*) as replica_count FROM pg_stat_replication;

-- Cache health
SELECT cache_hit_rate FROM cache_statistics;

-- Storage health
SELECT pg_database_size(current_database()) / 1024 / 1024 / 1024 as size_gb;

-- Vacuum/ANALYZE status
SELECT
  schemaname,
  tablename,
  last_vacuum,
  last_analyze
FROM pg_stat_user_tables
ORDER BY last_vacuum DESC;

Monitoring Integration

Prometheus Metrics

# HELP heliosdb_up Database is up
# TYPE heliosdb_up gauge
heliosdb_up 1

# HELP heliosdb_connections Active connections
# TYPE heliosdb_connections gauge
heliosdb_connections 42

# HELP heliosdb_memory_usage_bytes Memory usage
# TYPE heliosdb_memory_usage_bytes gauge
heliosdb_memory_usage_bytes 1073741824

Kubernetes Probes

livenessProbe:
  httpGet:
    path: /health
    port: 5432
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 5432
  initialDelaySeconds: 5
  periodSeconds: 5

Alerting Rules

# Alert if database is down
- alert: HeliosDBDown
  expr: heliosdb_up == 0
  for: 1m
  annotations:
    summary: "HeliosDB is down"

# Alert if memory usage high
- alert: HighMemoryUsage
  expr: heliosdb_memory_usage_pct > 80
  for: 5m
  annotations:
    summary: "High memory usage"

# Alert if replication lag
- alert: ReplicationLag
  expr: heliosdb_replication_lag_bytes > 1073741824
  for: 1m
  annotations:
    summary: "Replication lag detected"

Troubleshooting

Issue: Health check returns unhealthy

-- Check what's unhealthy
SELECT * FROM health_check_details;

-- Check specific components
SELECT * FROM component_health_status;

-- Review logs
SELECT * FROM system_logs WHERE severity = 'ERROR'
ORDER BY timestamp DESC LIMIT 20;

Best Practices

  1. Check health every 30 seconds
  2. Set up alerts for failures
  3. Monitor trends over time
  4. Include in deployment checks
  5. Test failover with health checks

Related Documentation: - Monitoring Guide - High Availability Guide