Skip to content

HeliosDB Rollback Procedures

Version: 7.0.0 Environment: Staging Last Updated: 2025-11-17


Table of Contents

  1. Overview
  2. When to Rollback
  3. Rollback Decision Matrix
  4. Pre-Rollback Checklist
  5. Rollback Procedures
  6. Post-Rollback Validation
  7. Incident Documentation

Overview

This document provides procedures for rolling back HeliosDB Phase 1 deployments in the event of critical issues, bugs, or performance problems.

Rollback Strategy

HeliosDB supports multiple rollback strategies:

  1. Service-Level Rollback: Rollback individual service (recommended)
  2. Full Stack Rollback: Rollback entire deployment
  3. Data Rollback: Restore database from backup (last resort)

Recovery Time Objectives (RTO)

  • Service-Level Rollback: < 5 minutes
  • Full Stack Rollback: < 15 minutes
  • Data Rollback: < 30 minutes (depending on backup size)

When to Rollback

Rollback Triggers

Execute rollback immediately when:

  1. Service Unavailability
  2. Service down for > 5 minutes
  3. Cannot be resolved by restart
  4. Affecting critical functionality

  5. Data Corruption

  6. Inconsistent data detected
  7. Data loss occurring
  8. Database integrity compromised

  9. Security Vulnerability

  10. Critical security flaw discovered
  11. Exploit actively being used
  12. Compliance breach

  13. Performance Degradation

  14. Latency increase > 500%
  15. Error rate > 25%
  16. Resource exhaustion

  17. Compliance Violations

  18. Audit log failures
  19. Compliance framework violations
  20. Regulatory breach risk

Don't Rollback When

  • Minor bugs that don't affect core functionality
  • Cosmetic issues
  • Performance degradation < 100%
  • Issues can be hotfixed quickly (< 30 minutes)

Rollback Decision Matrix

Severity Impact Response Rollback?
P1 Critical Service down, data loss Immediate Yes, immediate
P1 Critical Security breach Immediate Yes, immediate
P2 High Major degradation Within 15 min Yes, if no quick fix
P2 High Partial functionality loss Within 30 min Consider after attempt to fix
P3 Medium Minor degradation Within 1 hour No, fix forward
P4 Low Cosmetic issues Next business day No

Pre-Rollback Checklist

Before executing rollback:

1. Incident Assessment

  • [ ] Confirm severity level (P1-P4)
  • [ ] Identify affected services
  • [ ] Document symptoms and error messages
  • [ ] Check if issue is deployment-related
  • [ ] Verify rollback is appropriate response

2. Communication

  • [ ] Notify team via Slack/email
  • [ ] Identify incident commander
  • [ ] Create incident ticket/doc
  • [ ] Prepare status update for stakeholders

3. Backup Verification

  • [ ] Verify backup availability
  • [ ] Check backup timestamp
  • [ ] Confirm backup integrity
  • [ ] Test backup accessibility

4. Rollback Plan

  • [ ] Determine rollback scope (service vs full stack)
  • [ ] Identify target version/commit
  • [ ] Review dependencies
  • [ ] Prepare rollback commands

Rollback Procedures

Procedure 1: Service-Level Rollback (Docker Compose)

Use Case: Rollback individual service to previous version

Duration: 3-5 minutes

Step 1: Identify Previous Version

# Check deployment history
docker images | grep heliosdb

# Identify previous working image tag
# Example: heliosdb-conversational-bi:v7.0.0-rc1

Step 2: Update Service to Previous Version

cd /home/claude/HeliosDB/deployment/staging

# Edit docker-compose.yml
# Change image tag for affected service
# Example:
# conversational-bi:
#   image: heliosdb-conversational-bi:v7.0.0-rc1  # Rollback to this

# Or pull previous version
docker pull heliosdb-conversational-bi:v7.0.0-rc1
docker tag heliosdb-conversational-bi:v7.0.0-rc1 heliosdb-conversational-bi:latest

Step 3: Restart Service

# Restart specific service
docker compose -f docker-compose.yml up -d --force-recreate conversational-bi

# Monitor logs
docker compose -f docker-compose.yml logs -f conversational-bi

Step 4: Verify Rollback

# Check service health
curl http://localhost:8081/health

# Check service version
curl http://localhost:8081/version

# Monitor metrics for 5 minutes
watch -n 2 'curl -s http://localhost:9091/metrics | grep up'

Procedure 2: Full Stack Rollback (Docker Compose)

Use Case: Rollback entire deployment to previous stable state

Duration: 10-15 minutes

Step 1: Stop Current Deployment

cd /home/claude/HeliosDB/deployment/staging

# Stop all services
docker compose -f docker-compose.yml down

Step 2: Checkout Previous Version

cd /home/claude/HeliosDB

# List available tags
git tag -l

# Checkout previous stable version
git checkout v7.0.0-rc1  # Replace with target version

# Verify checkout
git log -1

Step 3: Rebuild Images (if needed)

# Rebuild all images from previous version
docker compose -f deployment/staging/docker-compose.yml build

# Verify images
docker images | grep heliosdb

Step 4: Restore Configuration

# Restore previous .env file (if backed up)
cp /home/claude/HeliosDB/deployment/staging/.env.backup.YYYYMMDD \
   /home/claude/HeliosDB/deployment/staging/.env

# Verify configuration
cat deployment/staging/.env

Step 5: Start Services

# Start infrastructure first
docker compose -f deployment/staging/docker-compose.yml up -d postgres redis

# Wait for ready (30 seconds)
sleep 30

# Start feature services
docker compose -f deployment/staging/docker-compose.yml up -d \
  conversational-bi \
  compliance \
  embedded-cloud-sync

# Start monitoring
docker compose -f deployment/staging/docker-compose.yml up -d prometheus grafana

# Start load balancer
docker compose -f deployment/staging/docker-compose.yml up -d nginx

Step 6: Verify Full Stack

# Check all services
docker compose -f deployment/staging/docker-compose.yml ps

# Test all health endpoints
curl http://localhost:8081/health  # Conversational BI
curl http://localhost:8082/health  # Compliance
curl http://localhost:8083/health  # Embedded+Cloud

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'

Procedure 3: Service-Level Rollback (Kubernetes)

Use Case: Rollback individual Kubernetes deployment

Duration: 5-10 minutes

Step 1: Check Deployment History

# View rollout history
kubectl rollout history deployment/conversational-bi -n heliosdb-staging

# Example output:
# REVISION  CHANGE-CAUSE
# 1         Initial deployment
# 2         Update to v7.0.0
# 3         Current deployment

Step 2: Rollback Deployment

# Rollback to previous revision
kubectl rollout undo deployment/conversational-bi -n heliosdb-staging

# Or rollback to specific revision
kubectl rollout undo deployment/conversational-bi --to-revision=2 -n heliosdb-staging

Step 3: Monitor Rollback

# Watch rollout status
kubectl rollout status deployment/conversational-bi -n heliosdb-staging

# Check pods
watch kubectl get pods -n heliosdb-staging -l app=conversational-bi

Step 4: Verify Rollback

# Check deployment status
kubectl describe deployment conversational-bi -n heliosdb-staging

# Check service health
kubectl exec -it -n heliosdb-staging deployment/conversational-bi -- curl http://localhost:8081/health

# View logs
kubectl logs -f -n heliosdb-staging deployment/conversational-bi

Procedure 4: Full Stack Rollback (Kubernetes)

Use Case: Rollback all Kubernetes deployments

Duration: 15-20 minutes

Step 1: Rollback All Deployments

# Rollback all feature services
kubectl rollout undo deployment/conversational-bi -n heliosdb-staging
kubectl rollout undo deployment/compliance -n heliosdb-staging
kubectl rollout undo deployment/embedded-cloud-sync -n heliosdb-staging

Step 2: Monitor Rollbacks

# Watch all rollouts
kubectl rollout status deployment/conversational-bi -n heliosdb-staging
kubectl rollout status deployment/compliance -n heliosdb-staging
kubectl rollout status deployment/embedded-cloud-sync -n heliosdb-staging

# Check all pods
watch kubectl get pods -n heliosdb-staging

Step 3: Verify All Services

# Check all deployments
kubectl get deployments -n heliosdb-staging

# Check all service endpoints
kubectl get endpoints -n heliosdb-staging

# Test health checks
for service in conversational-bi compliance embedded-cloud-sync; do
  echo "Testing $service..."
  kubectl exec -it -n heliosdb-staging deployment/$service -- curl http://localhost:8081/health
done

Procedure 5: Database Rollback (LAST RESORT)

Use Case: Data corruption, need to restore from backup

Duration: 30-60 minutes (depending on backup size)

WARNING: This will result in data loss for transactions after backup timestamp!

Step 1: Assess Data Loss Window

# Check latest backup timestamp
ls -lh /backups/heliosdb/ | tail -5

# Calculate data loss window
# Example: Backup from 6 hours ago = 6 hours of data loss

Step 2: Stop All Services

# Docker Compose
docker compose -f deployment/staging/docker-compose.yml down

# Kubernetes
kubectl scale deployment --all --replicas=0 -n heliosdb-staging

Step 3: Backup Current Database (Even if Corrupted)

# Create emergency backup
docker compose -f deployment/staging/docker-compose.yml up -d postgres

docker exec heliosdb-postgres \
  pg_dump -U heliosdb_admin -Fc heliosdb > \
  /backups/heliosdb/emergency_backup_$(date +%Y%m%d_%H%M%S).dump

docker compose -f deployment/staging/docker-compose.yml stop postgres

Step 4: Restore from Backup

# Start PostgreSQL
docker compose -f deployment/staging/docker-compose.yml up -d postgres

# Wait for ready
sleep 30

# Drop current database
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d postgres -c "DROP DATABASE heliosdb;"

# Create fresh database
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d postgres -c "CREATE DATABASE heliosdb;"

# Restore from backup
docker exec -i heliosdb-postgres \
  pg_restore -U heliosdb_admin -d heliosdb -v \
  < /backups/heliosdb/backup_YYYYMMDD_HHMMSS.dump

Step 5: Verify Database Integrity

# Check database size
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d heliosdb -c "\l+"

# Check table counts
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d heliosdb -c "
    SELECT schemaname, tablename,
           n_live_tup as row_count
    FROM pg_stat_user_tables
    ORDER BY n_live_tup DESC;"

# Run integrity checks
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d heliosdb -c "VACUUM ANALYZE;"

Step 6: Restart Services

# Restart all services
docker compose -f deployment/staging/docker-compose.yml up -d

# Or for Kubernetes
kubectl scale deployment --all --replicas=1 -n heliosdb-staging

Post-Rollback Validation

1. Health Checks

# Check all service health endpoints
curl http://localhost:8081/health  # Conversational BI
curl http://localhost:8082/health  # Compliance
curl http://localhost:8083/health  # Embedded+Cloud

# All should return: {"status": "healthy"}

2. Smoke Tests

Conversational BI

curl -X POST http://localhost:8081/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"question": "Show me users", "database": "heliosdb"}'

Compliance

curl http://localhost:8082/api/v1/compliance/status

Embedded+Cloud

curl http://localhost:8083/api/v1/sync/status

3. Metrics Validation

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | \
  jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# All should show: "health": "up"

4. Log Analysis

# Check for errors in last 5 minutes
docker compose -f deployment/staging/docker-compose.yml logs --since 5m | grep -i error

# Should see no critical errors

5. Load Testing (Optional)

# Run light load test to verify stability
# See VALIDATION_TEST_SUITE.md for test scripts

Incident Documentation

Rollback Checklist

After rollback completes:

  • [ ] Document root cause of issue
  • [ ] Record rollback timeline
  • [ ] Update incident ticket
  • [ ] Notify stakeholders of resolution
  • [ ] Schedule post-mortem meeting
  • [ ] Create action items for prevention
  • [ ] Update runbooks if needed

Post-Mortem Template

# Incident Post-Mortem: [Service Name] Rollback

**Date**: YYYY-MM-DD
**Incident Commander**: [Name]
**Severity**: P1/P2/P3
**Duration**: [Start] - [End]

## Summary
Brief description of what happened.

## Timeline
- HH:MM - Deployment started
- HH:MM - Issue detected
- HH:MM - Rollback initiated
- HH:MM - Rollback complete
- HH:MM - Service restored

## Root Cause
Detailed analysis of what caused the issue.

## Impact
- Services affected: [list]
- Users affected: [number]
- Data loss: [yes/no, how much]
- Duration of outage: [duration]

## Resolution
How the issue was resolved (rollback details).

## Action Items
1. [ ] Prevent recurrence: [action]
2. [ ] Improve detection: [action]
3. [ ] Update documentation: [action]
4. [ ] Training needed: [action]

## Lessons Learned
What we learned and how to prevent in future.

Best Practices

  1. Always Backup First: Before any rollback, ensure backups exist
  2. Test Rollbacks: Regularly test rollback procedures in staging
  3. Version Tagging: Use semantic versioning and git tags
  4. Keep Previous Images: Retain last 5 Docker images
  5. Document Changes: Maintain deployment changelog
  6. Gradual Rollout: Use canary or blue-green deployments when possible
  7. Quick Decision: Don't delay rollback if criteria are met
  8. Communicate: Keep team informed throughout process

Emergency Contacts

  • On-Call Engineer: [PagerDuty/Phone]
  • Team Lead: [Contact]
  • DevOps: [Contact]
  • Database Admin: [Contact]

Rollback Automation

For faster rollbacks, consider implementing:

  1. Automated Rollback Scripts: Pre-built scripts for common scenarios
  2. Feature Flags: Toggle features without deployment
  3. Blue-Green Deployments: Instant switchback capability
  4. Canary Releases: Gradual rollout with automatic rollback

Remember: Rollback is a recovery tool, not a failure. The goal is to restore service quickly and analyze issues later.