HeliosDB Rollback Procedures¶

Version: 7.0.0 Environment: Staging Last Updated: 2025-11-17

Table of Contents¶

Overview
When to Rollback
Rollback Decision Matrix
Pre-Rollback Checklist
Rollback Procedures
Post-Rollback Validation
Incident Documentation

Overview¶

This document provides procedures for rolling back HeliosDB Phase 1 deployments in the event of critical issues, bugs, or performance problems.

Rollback Strategy¶

HeliosDB supports multiple rollback strategies:

Service-Level Rollback: Rollback individual service (recommended)
Full Stack Rollback: Rollback entire deployment
Data Rollback: Restore database from backup (last resort)

Recovery Time Objectives (RTO)¶

Service-Level Rollback: < 5 minutes
Full Stack Rollback: < 15 minutes
Data Rollback: < 30 minutes (depending on backup size)

When to Rollback¶

Rollback Triggers¶

Execute rollback immediately when:

Service Unavailability
Service down for > 5 minutes
Cannot be resolved by restart
Affecting critical functionality
Data Corruption
Inconsistent data detected
Data loss occurring
Database integrity compromised
Security Vulnerability
Critical security flaw discovered
Exploit actively being used
Compliance breach
Performance Degradation
Latency increase > 500%
Error rate > 25%
Resource exhaustion
Compliance Violations
Audit log failures
Compliance framework violations
Regulatory breach risk

Don't Rollback When¶

Minor bugs that don't affect core functionality
Cosmetic issues
Performance degradation < 100%
Issues can be hotfixed quickly (< 30 minutes)

Rollback Decision Matrix¶

Severity	Impact	Response	Rollback?
P1 Critical	Service down, data loss	Immediate	Yes, immediate
P1 Critical	Security breach	Immediate	Yes, immediate
P2 High	Major degradation	Within 15 min	Yes, if no quick fix
P2 High	Partial functionality loss	Within 30 min	Consider after attempt to fix
P3 Medium	Minor degradation	Within 1 hour	No, fix forward
P4 Low	Cosmetic issues	Next business day	No

Pre-Rollback Checklist¶

Before executing rollback:

1. Incident Assessment¶

[ ] Confirm severity level (P1-P4)
[ ] Identify affected services
[ ] Document symptoms and error messages
[ ] Check if issue is deployment-related
[ ] Verify rollback is appropriate response

2. Communication¶

[ ] Notify team via Slack/email
[ ] Identify incident commander
[ ] Create incident ticket/doc
[ ] Prepare status update for stakeholders

3. Backup Verification¶

[ ] Verify backup availability
[ ] Check backup timestamp
[ ] Confirm backup integrity
[ ] Test backup accessibility

4. Rollback Plan¶

[ ] Determine rollback scope (service vs full stack)
[ ] Identify target version/commit
[ ] Review dependencies
[ ] Prepare rollback commands

Rollback Procedures¶

Procedure 1: Service-Level Rollback (Docker Compose)¶

Use Case: Rollback individual service to previous version

Duration: 3-5 minutes

Step 1: Identify Previous Version¶

# Check deployment history
docker images | grep heliosdb

# Identify previous working image tag
# Example: heliosdb-conversational-bi:v7.0.0-rc1

Step 2: Update Service to Previous Version¶

cd /home/claude/HeliosDB/deployment/staging

# Edit docker-compose.yml
# Change image tag for affected service
# Example:
# conversational-bi:
#   image: heliosdb-conversational-bi:v7.0.0-rc1  # Rollback to this

# Or pull previous version
docker pull heliosdb-conversational-bi:v7.0.0-rc1
docker tag heliosdb-conversational-bi:v7.0.0-rc1 heliosdb-conversational-bi:latest

Step 3: Restart Service¶

# Restart specific service
docker compose -f docker-compose.yml up -d --force-recreate conversational-bi

# Monitor logs
docker compose -f docker-compose.yml logs -f conversational-bi

Step 4: Verify Rollback¶

# Check service health
curl http://localhost:8081/health

# Check service version
curl http://localhost:8081/version

# Monitor metrics for 5 minutes
watch -n 2 'curl -s http://localhost:9091/metrics | grep up'

Procedure 2: Full Stack Rollback (Docker Compose)¶

Use Case: Rollback entire deployment to previous stable state

Duration: 10-15 minutes

Step 1: Stop Current Deployment¶

cd /home/claude/HeliosDB/deployment/staging

# Stop all services
docker compose -f docker-compose.yml down

Step 2: Checkout Previous Version¶

cd /home/claude/HeliosDB

# List available tags
git tag -l

# Checkout previous stable version
git checkout v7.0.0-rc1  # Replace with target version

# Verify checkout
git log -1

Step 3: Rebuild Images (if needed)¶

# Rebuild all images from previous version
docker compose -f deployment/staging/docker-compose.yml build

# Verify images
docker images | grep heliosdb

Step 4: Restore Configuration¶

# Restore previous .env file (if backed up)
cp /home/claude/HeliosDB/deployment/staging/.env.backup.YYYYMMDD \
   /home/claude/HeliosDB/deployment/staging/.env

# Verify configuration
cat deployment/staging/.env

Step 5: Start Services¶

# Start infrastructure first
docker compose -f deployment/staging/docker-compose.yml up -d postgres redis

# Wait for ready (30 seconds)
sleep 30

# Start feature services
docker compose -f deployment/staging/docker-compose.yml up -d \
  conversational-bi \
  compliance \
  embedded-cloud-sync

# Start monitoring
docker compose -f deployment/staging/docker-compose.yml up -d prometheus grafana

# Start load balancer
docker compose -f deployment/staging/docker-compose.yml up -d nginx

Step 6: Verify Full Stack¶

# Check all services
docker compose -f deployment/staging/docker-compose.yml ps

# Test all health endpoints
curl http://localhost:8081/health  # Conversational BI
curl http://localhost:8082/health  # Compliance
curl http://localhost:8083/health  # Embedded+Cloud

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'

Procedure 3: Service-Level Rollback (Kubernetes)¶

Use Case: Rollback individual Kubernetes deployment

Duration: 5-10 minutes

Step 1: Check Deployment History¶

# View rollout history
kubectl rollout history deployment/conversational-bi -n heliosdb-staging

# Example output:
# REVISION  CHANGE-CAUSE
# 1         Initial deployment
# 2         Update to v7.0.0
# 3         Current deployment

Step 2: Rollback Deployment¶

# Rollback to previous revision
kubectl rollout undo deployment/conversational-bi -n heliosdb-staging

# Or rollback to specific revision
kubectl rollout undo deployment/conversational-bi --to-revision=2 -n heliosdb-staging

Step 3: Monitor Rollback¶

# Watch rollout status
kubectl rollout status deployment/conversational-bi -n heliosdb-staging

# Check pods
watch kubectl get pods -n heliosdb-staging -l app=conversational-bi

Step 4: Verify Rollback¶

# Check deployment status
kubectl describe deployment conversational-bi -n heliosdb-staging

# Check service health
kubectl exec -it -n heliosdb-staging deployment/conversational-bi -- curl http://localhost:8081/health

# View logs
kubectl logs -f -n heliosdb-staging deployment/conversational-bi

Procedure 4: Full Stack Rollback (Kubernetes)¶

Use Case: Rollback all Kubernetes deployments

Duration: 15-20 minutes

Step 1: Rollback All Deployments¶

# Rollback all feature services
kubectl rollout undo deployment/conversational-bi -n heliosdb-staging
kubectl rollout undo deployment/compliance -n heliosdb-staging
kubectl rollout undo deployment/embedded-cloud-sync -n heliosdb-staging

Step 2: Monitor Rollbacks¶

# Watch all rollouts
kubectl rollout status deployment/conversational-bi -n heliosdb-staging
kubectl rollout status deployment/compliance -n heliosdb-staging
kubectl rollout status deployment/embedded-cloud-sync -n heliosdb-staging

# Check all pods
watch kubectl get pods -n heliosdb-staging

Step 3: Verify All Services¶

# Check all deployments
kubectl get deployments -n heliosdb-staging

# Check all service endpoints
kubectl get endpoints -n heliosdb-staging

# Test health checks
for service in conversational-bi compliance embedded-cloud-sync; do
  echo "Testing $service..."
  kubectl exec -it -n heliosdb-staging deployment/$service -- curl http://localhost:8081/health
done

Procedure 5: Database Rollback (LAST RESORT)¶

Use Case: Data corruption, need to restore from backup

Duration: 30-60 minutes (depending on backup size)

WARNING: This will result in data loss for transactions after backup timestamp!

Step 1: Assess Data Loss Window¶

# Check latest backup timestamp
ls -lh /backups/heliosdb/ | tail -5

# Calculate data loss window
# Example: Backup from 6 hours ago = 6 hours of data loss

Step 2: Stop All Services¶

# Docker Compose
docker compose -f deployment/staging/docker-compose.yml down

# Kubernetes
kubectl scale deployment --all --replicas=0 -n heliosdb-staging

Step 3: Backup Current Database (Even if Corrupted)¶

# Create emergency backup
docker compose -f deployment/staging/docker-compose.yml up -d postgres

docker exec heliosdb-postgres \
  pg_dump -U heliosdb_admin -Fc heliosdb > \
  /backups/heliosdb/emergency_backup_$(date +%Y%m%d_%H%M%S).dump

docker compose -f deployment/staging/docker-compose.yml stop postgres

Step 4: Restore from Backup¶

# Start PostgreSQL
docker compose -f deployment/staging/docker-compose.yml up -d postgres

# Wait for ready
sleep 30

# Drop current database
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d postgres -c "DROP DATABASE heliosdb;"

# Create fresh database
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d postgres -c "CREATE DATABASE heliosdb;"

# Restore from backup
docker exec -i heliosdb-postgres \
  pg_restore -U heliosdb_admin -d heliosdb -v \
  < /backups/heliosdb/backup_YYYYMMDD_HHMMSS.dump

Step 5: Verify Database Integrity¶

# Check database size
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d heliosdb -c "\l+"

# Check table counts
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d heliosdb -c "
    SELECT schemaname, tablename,
           n_live_tup as row_count
    FROM pg_stat_user_tables
    ORDER BY n_live_tup DESC;"

# Run integrity checks
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d heliosdb -c "VACUUM ANALYZE;"

Step 6: Restart Services¶

# Restart all services
docker compose -f deployment/staging/docker-compose.yml up -d

# Or for Kubernetes
kubectl scale deployment --all --replicas=1 -n heliosdb-staging

Post-Rollback Validation¶

1. Health Checks¶

# Check all service health endpoints
curl http://localhost:8081/health  # Conversational BI
curl http://localhost:8082/health  # Compliance
curl http://localhost:8083/health  # Embedded+Cloud

# All should return: {"status": "healthy"}

2. Smoke Tests¶

Conversational BI¶

curl -X POST http://localhost:8081/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"question": "Show me users", "database": "heliosdb"}'

Compliance¶

curl http://localhost:8082/api/v1/compliance/status

Embedded+Cloud¶

curl http://localhost:8083/api/v1/sync/status

3. Metrics Validation¶

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | \
  jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# All should show: "health": "up"

4. Log Analysis¶

# Check for errors in last 5 minutes
docker compose -f deployment/staging/docker-compose.yml logs --since 5m | grep -i error

# Should see no critical errors

5. Load Testing (Optional)¶

# Run light load test to verify stability
# See VALIDATION_TEST_SUITE.md for test scripts

Incident Documentation¶

Rollback Checklist¶

After rollback completes:

[ ] Document root cause of issue
[ ] Record rollback timeline
[ ] Update incident ticket
[ ] Notify stakeholders of resolution
[ ] Schedule post-mortem meeting
[ ] Create action items for prevention
[ ] Update runbooks if needed

Post-Mortem Template¶

# Incident Post-Mortem: [Service Name] Rollback

**Date**: YYYY-MM-DD
**Incident Commander**: [Name]
**Severity**: P1/P2/P3
**Duration**: [Start] - [End]

## Summary
Brief description of what happened.

## Timeline
- HH:MM - Deployment started
- HH:MM - Issue detected
- HH:MM - Rollback initiated
- HH:MM - Rollback complete
- HH:MM - Service restored

## Root Cause
Detailed analysis of what caused the issue.

## Impact
- Services affected: [list]
- Users affected: [number]
- Data loss: [yes/no, how much]
- Duration of outage: [duration]

## Resolution
How the issue was resolved (rollback details).

## Action Items
1. [ ] Prevent recurrence: [action]
2. [ ] Improve detection: [action]
3. [ ] Update documentation: [action]
4. [ ] Training needed: [action]

## Lessons Learned
What we learned and how to prevent in future.

Best Practices¶

Always Backup First: Before any rollback, ensure backups exist
Test Rollbacks: Regularly test rollback procedures in staging
Version Tagging: Use semantic versioning and git tags
Keep Previous Images: Retain last 5 Docker images
Document Changes: Maintain deployment changelog
Gradual Rollout: Use canary or blue-green deployments when possible
Quick Decision: Don't delay rollback if criteria are met
Communicate: Keep team informed throughout process

Emergency Contacts¶

On-Call Engineer: [PagerDuty/Phone]
Team Lead: [Contact]
DevOps: [Contact]
Database Admin: [Contact]

Rollback Automation¶

For faster rollbacks, consider implementing:

Automated Rollback Scripts: Pre-built scripts for common scenarios
Feature Flags: Toggle features without deployment
Blue-Green Deployments: Instant switchback capability
Canary Releases: Gradual rollout with automatic rollback

Remember: Rollback is a recovery tool, not a failure. The goal is to restore service quickly and analyze issues later.