Skip to content

HeliosDB Operational Runbooks

Version: 1.0 Last Updated: 2025-11-24 Target Release: Limited GA (v7.0)


Overview

This directory contains comprehensive operational runbooks for managing HeliosDB in production environments during the Limited GA phase. Each runbook provides step-by-step procedures, troubleshooting guidance, and best practices for specific operational scenarios.


Runbook Index

1. Deployment Runbook

Purpose: Procedures for deploying HeliosDB updates and new versions

Key Topics: - Pre-deployment checklist - Rolling update procedure - Blue-green deployment steps - Rollback procedures - Post-deployment validation - Common deployment issues

When to Use: - Deploying version updates - Applying patches - Rolling back deployments - Validating deployments


2. Incident Response Runbook

Purpose: Structured approach to handling production incidents

Key Topics: - Incident classification (P0-P4) - Initial response steps - Escalation procedures - Communication templates - Postmortem process - Incident examples

When to Use: - Service outages - Performance degradation - Data integrity issues - Any production incident


3. Scaling Operations Runbook

Purpose: Manual and automated scaling procedures

Key Topics: - Manual scale up/down procedures - Auto-scaling configuration - Resource monitoring - Capacity planning - Cost optimization

When to Use: - Resource constraints (CPU, memory, disk) - Performance optimization - Capacity planning - Cost reduction


4. Backup and Restore Runbook

Purpose: Comprehensive backup and disaster recovery procedures

Key Topics: - Backup verification - Point-in-time recovery (PITR) steps - Full restore procedure - Cross-region restore - Recovery time estimation - Backup troubleshooting

When to Use: - Disaster recovery - Data corruption - Accidental data deletion - Migration scenarios - DR testing


5. Database Maintenance Runbook

Purpose: Regular maintenance tasks for database health

Key Topics: - VACUUM procedure - ANALYZE statistics update - Index rebuilding (REINDEX) - Table reorganization - Query performance analysis - Storage management

When to Use: - Scheduled maintenance windows - Performance degradation - Storage bloat - Index optimization - Query tuning


6. Performance Troubleshooting Runbook

Purpose: Diagnosing and resolving performance issues

Key Topics: - Slow query identification - High CPU investigation - Memory pressure analysis - Disk I/O bottlenecks - Network latency debugging - Performance tuning checklist

When to Use: - Slow queries - High resource usage - System bottlenecks - Latency issues - Performance optimization


7. GPU Operations Runbook

Purpose: Managing GPU acceleration features

Key Topics: - Enable/disable GPU acceleration - GPU health monitoring - GPU memory management - Fallback to CPU procedure - GPU troubleshooting - CUDA/ROCm diagnostics

When to Use: - GPU configuration - GPU performance issues - GPU memory errors - GPU hardware failures - CUDA/ROCm updates


8. Multi-Region Operations Runbook

Purpose: Managing multi-region deployments

Key Topics: - Region health monitoring - Manual failover procedure - Consistency verification - Cross-region replication checks - Region addition/removal - Multi-region troubleshooting

When to Use: - Regional failovers - Adding/removing regions - Replication issues - Split-brain scenarios - Cross-region performance


Quick Start Guide

For New Operators

  1. Familiarize with core runbooks first:
  2. Start with Incident Response
  3. Review Deployment
  4. Understand Backup and Restore

  5. Set up monitoring and alerts:

  6. Configure Prometheus alerts from runbooks
  7. Set up Grafana dashboards
  8. Test alert routing

  9. Practice procedures in staging:

  10. Test deployments
  11. Practice failovers
  12. Validate backup/restore

  13. Review incident examples:

  14. Study P0-P4 incident scenarios
  15. Review postmortem templates
  16. Understand escalation paths

For Experienced Operators

  • Quick Reference Sections: Each runbook has a "Quick Reference" section at the end with essential commands
  • Decision Trees: Look for decision flowcharts in troubleshooting sections
  • Automation Scripts: Many procedures include automation scripts ready for use

Runbook Usage Guidelines

Before Using a Runbook

  1. Assess the situation:
  2. Severity (P0-P4)
  3. Impact scope
  4. Time sensitivity

  5. Gather diagnostics:

  6. System metrics
  7. Recent logs
  8. Error messages
  9. Timeline

  10. Notify stakeholders:

  11. On-call team
  12. Manager (if P0/P1)
  13. Customers (if customer-impacting)

During Procedure Execution

  1. Follow steps sequentially (unless explicitly stated otherwise)
  2. Document actions (timestamps, commands, results)
  3. Validate after each step (don't skip verification)
  4. Communicate progress (war room updates every 15-30 minutes)
  5. Know when to escalate (if stuck > 15 minutes or procedure fails)

After Procedure Completion

  1. Validate success:
  2. Run health checks
  3. Monitor metrics (30-60 minutes)
  4. Verify customer impact resolved

  5. Document the incident:

  6. Create postmortem (P0/P1)
  7. Update runbook if needed
  8. Share learnings with team

  9. Follow up:

  10. Complete action items
  11. Update monitoring/alerts
  12. Schedule preventive maintenance

Common Scenarios and Runbook Selection

Scenario: Service is Down

Incident Response Runbook - Section 6.1: Complete Service Outage

Scenario: Deploying a New Version

Deployment Runbook - Section 2: Rolling Update Procedure (backward compatible) - Section 3: Blue-Green Deployment (major version)

Scenario: Slow Queries

Performance Troubleshooting Runbook - Section 1: Slow Query Identification

Database Maintenance Runbook - Section 5: Query Performance Analysis

Scenario: Running Out of Disk Space

Incident Response Runbook - Section 6.3: Disk Space Exhaustion

Database Maintenance Runbook - Section 6: Storage Management

Scenario: Need to Restore Data

Backup and Restore Runbook - Section 2: Point-in-Time Recovery (specific time) - Section 3: Full Restore (complete disaster)

Scenario: High CPU Usage

Performance Troubleshooting Runbook - Section 2: High CPU Investigation

Scaling Operations Runbook - Section 1: Manual Scaling Procedures

Scenario: Primary Region Failure

Multi-Region Operations Runbook - Section 2.4: Emergency Failover Procedure

Scenario: GPU Not Working

GPU Operations Runbook - Section 5: GPU Troubleshooting

Scenario: Replication Lag High

Multi-Region Operations Runbook - Section 4.3: Replication Troubleshooting

Incident Response Runbook - Section 6.2: Replication Lag Example

Scenario: Scheduled Maintenance

Database Maintenance Runbook - Section 1: VACUUM Procedure - Section 3: Index Rebuilding


Support and Escalation

Level 1: On-Call Engineer

  • Responsibility: Execute runbooks, gather diagnostics
  • Contact: PagerDuty rotation
  • Response SLA: 5 minutes

Level 2: Senior SRE

  • Responsibility: Non-standard procedures, cross-team coordination
  • Contact: PagerDuty + #heliosdb-oncall
  • Response SLA: 15 minutes

Level 3: Engineering Manager

  • Responsibility: Service degradation decisions, customer communication
  • Contact: Direct phone + Slack
  • Response SLA: 30 minutes

Level 4: CTO

  • Responsibility: Major incidents, executive decisions
  • Contact: Emergency phone
  • Response SLA: Best effort

Escalation Criteria: See Incident Response Runbook - Section 3


Runbook Maintenance

Updating Runbooks

When to Update: - After incident postmortems (lessons learned) - System changes (new features, configuration changes) - Process improvements - Feedback from operators

How to Update: 1. Create branch: git checkout -b update-runbook-xyz 2. Edit runbook(s) 3. Update "Last Updated" date 4. Add entry to "Revision History" 5. Create PR and get review 6. Merge and notify team

Feedback

Send feedback or suggestions to: - Slack: #heliosdb-ops - Email: heliosdb-ops@company.com - GitHub Issues: Tag with runbook label


Additional Resources

Internal Documentation

External Resources

Training

  • HeliosDB Operations Bootcamp (internal)
  • PostgreSQL DBA Certification
  • Kubernetes Administrator Certification
  • AWS Solutions Architect

Appendix

A. Common Commands Cheat Sheet

# Health checks
curl http://heliosdb-lb:7000/health
psql -h heliosdb-lb -U admin -c "SELECT version();"

# Metrics
curl http://heliosdb-lb:7000/metrics | grep query_duration
curl http://heliosdb-lb:7000/metrics | grep error_rate

# Replication status
psql -h heliosdb-primary -U admin -c "SELECT * FROM pg_stat_replication;"

# Active queries
psql -h heliosdb-lb -U admin -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"

# Slow queries
psql -h heliosdb-lb -U admin -c "SELECT query, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

# Disk space
df -h /var/lib/heliosdb

# Service control
systemctl status heliosdb
systemctl restart heliosdb

B. Monitoring Dashboards

  • Overview Dashboard: http://grafana.company.com/d/heliosdb-overview
  • Performance Dashboard: http://grafana.company.com/d/heliosdb-performance
  • Multi-Region Dashboard: http://grafana.company.com/d/heliosdb-multi-region
  • GPU Dashboard: http://grafana.company.com/d/heliosdb-gpu

C. Emergency Contacts

Role Contact Backup
On-Call Primary PagerDuty @oncall-primary
On-Call Senior PagerDuty + Slack @oncall-senior
Engineering Manager Slack DM @eng-manager-heliosdb
Database Team Lead Slack DM @dba-lead
Cloud Operations cloud-ops@company.com #cloud-ops

D. War Room Procedures

When to Create War Room: - P0/P1 incidents - Major deployments - Planned failovers

How to Create: 1. Start Zoom meeting: /zoom start-meeting --incident <ID> 2. Post in Slack: #incident-war-room 3. Update status page: https://status.company.com 4. Notify stakeholders


License

Copyright 2025 HeliosDB Team. Internal use only.


For urgent assistance during incidents, refer to the Incident Response Runbook first.