HeliosDB Rollback Procedures¶
Version: 7.0.0 Environment: Staging Last Updated: 2025-11-17
Table of Contents¶
- Overview
- When to Rollback
- Rollback Decision Matrix
- Pre-Rollback Checklist
- Rollback Procedures
- Post-Rollback Validation
- Incident Documentation
Overview¶
This document provides procedures for rolling back HeliosDB Phase 1 deployments in the event of critical issues, bugs, or performance problems.
Rollback Strategy¶
HeliosDB supports multiple rollback strategies:
- Service-Level Rollback: Rollback individual service (recommended)
- Full Stack Rollback: Rollback entire deployment
- Data Rollback: Restore database from backup (last resort)
Recovery Time Objectives (RTO)¶
- Service-Level Rollback: < 5 minutes
- Full Stack Rollback: < 15 minutes
- Data Rollback: < 30 minutes (depending on backup size)
When to Rollback¶
Rollback Triggers¶
Execute rollback immediately when:
- Service Unavailability
- Service down for > 5 minutes
- Cannot be resolved by restart
-
Affecting critical functionality
-
Data Corruption
- Inconsistent data detected
- Data loss occurring
-
Database integrity compromised
-
Security Vulnerability
- Critical security flaw discovered
- Exploit actively being used
-
Compliance breach
-
Performance Degradation
- Latency increase > 500%
- Error rate > 25%
-
Resource exhaustion
-
Compliance Violations
- Audit log failures
- Compliance framework violations
- Regulatory breach risk
Don't Rollback When¶
- Minor bugs that don't affect core functionality
- Cosmetic issues
- Performance degradation < 100%
- Issues can be hotfixed quickly (< 30 minutes)
Rollback Decision Matrix¶
| Severity | Impact | Response | Rollback? |
|---|---|---|---|
| P1 Critical | Service down, data loss | Immediate | Yes, immediate |
| P1 Critical | Security breach | Immediate | Yes, immediate |
| P2 High | Major degradation | Within 15 min | Yes, if no quick fix |
| P2 High | Partial functionality loss | Within 30 min | Consider after attempt to fix |
| P3 Medium | Minor degradation | Within 1 hour | No, fix forward |
| P4 Low | Cosmetic issues | Next business day | No |
Pre-Rollback Checklist¶
Before executing rollback:
1. Incident Assessment¶
- [ ] Confirm severity level (P1-P4)
- [ ] Identify affected services
- [ ] Document symptoms and error messages
- [ ] Check if issue is deployment-related
- [ ] Verify rollback is appropriate response
2. Communication¶
- [ ] Notify team via Slack/email
- [ ] Identify incident commander
- [ ] Create incident ticket/doc
- [ ] Prepare status update for stakeholders
3. Backup Verification¶
- [ ] Verify backup availability
- [ ] Check backup timestamp
- [ ] Confirm backup integrity
- [ ] Test backup accessibility
4. Rollback Plan¶
- [ ] Determine rollback scope (service vs full stack)
- [ ] Identify target version/commit
- [ ] Review dependencies
- [ ] Prepare rollback commands
Rollback Procedures¶
Procedure 1: Service-Level Rollback (Docker Compose)¶
Use Case: Rollback individual service to previous version
Duration: 3-5 minutes
Step 1: Identify Previous Version¶
# Check deployment history
docker images | grep heliosdb
# Identify previous working image tag
# Example: heliosdb-conversational-bi:v7.0.0-rc1
Step 2: Update Service to Previous Version¶
cd /home/claude/HeliosDB/deployment/staging
# Edit docker-compose.yml
# Change image tag for affected service
# Example:
# conversational-bi:
# image: heliosdb-conversational-bi:v7.0.0-rc1 # Rollback to this
# Or pull previous version
docker pull heliosdb-conversational-bi:v7.0.0-rc1
docker tag heliosdb-conversational-bi:v7.0.0-rc1 heliosdb-conversational-bi:latest
Step 3: Restart Service¶
# Restart specific service
docker compose -f docker-compose.yml up -d --force-recreate conversational-bi
# Monitor logs
docker compose -f docker-compose.yml logs -f conversational-bi
Step 4: Verify Rollback¶
# Check service health
curl http://localhost:8081/health
# Check service version
curl http://localhost:8081/version
# Monitor metrics for 5 minutes
watch -n 2 'curl -s http://localhost:9091/metrics | grep up'
Procedure 2: Full Stack Rollback (Docker Compose)¶
Use Case: Rollback entire deployment to previous stable state
Duration: 10-15 minutes
Step 1: Stop Current Deployment¶
cd /home/claude/HeliosDB/deployment/staging
# Stop all services
docker compose -f docker-compose.yml down
Step 2: Checkout Previous Version¶
cd /home/claude/HeliosDB
# List available tags
git tag -l
# Checkout previous stable version
git checkout v7.0.0-rc1 # Replace with target version
# Verify checkout
git log -1
Step 3: Rebuild Images (if needed)¶
# Rebuild all images from previous version
docker compose -f deployment/staging/docker-compose.yml build
# Verify images
docker images | grep heliosdb
Step 4: Restore Configuration¶
# Restore previous .env file (if backed up)
cp /home/claude/HeliosDB/deployment/staging/.env.backup.YYYYMMDD \
/home/claude/HeliosDB/deployment/staging/.env
# Verify configuration
cat deployment/staging/.env
Step 5: Start Services¶
# Start infrastructure first
docker compose -f deployment/staging/docker-compose.yml up -d postgres redis
# Wait for ready (30 seconds)
sleep 30
# Start feature services
docker compose -f deployment/staging/docker-compose.yml up -d \
conversational-bi \
compliance \
embedded-cloud-sync
# Start monitoring
docker compose -f deployment/staging/docker-compose.yml up -d prometheus grafana
# Start load balancer
docker compose -f deployment/staging/docker-compose.yml up -d nginx
Step 6: Verify Full Stack¶
# Check all services
docker compose -f deployment/staging/docker-compose.yml ps
# Test all health endpoints
curl http://localhost:8081/health # Conversational BI
curl http://localhost:8082/health # Compliance
curl http://localhost:8083/health # Embedded+Cloud
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
Procedure 3: Service-Level Rollback (Kubernetes)¶
Use Case: Rollback individual Kubernetes deployment
Duration: 5-10 minutes
Step 1: Check Deployment History¶
# View rollout history
kubectl rollout history deployment/conversational-bi -n heliosdb-staging
# Example output:
# REVISION CHANGE-CAUSE
# 1 Initial deployment
# 2 Update to v7.0.0
# 3 Current deployment
Step 2: Rollback Deployment¶
# Rollback to previous revision
kubectl rollout undo deployment/conversational-bi -n heliosdb-staging
# Or rollback to specific revision
kubectl rollout undo deployment/conversational-bi --to-revision=2 -n heliosdb-staging
Step 3: Monitor Rollback¶
# Watch rollout status
kubectl rollout status deployment/conversational-bi -n heliosdb-staging
# Check pods
watch kubectl get pods -n heliosdb-staging -l app=conversational-bi
Step 4: Verify Rollback¶
# Check deployment status
kubectl describe deployment conversational-bi -n heliosdb-staging
# Check service health
kubectl exec -it -n heliosdb-staging deployment/conversational-bi -- curl http://localhost:8081/health
# View logs
kubectl logs -f -n heliosdb-staging deployment/conversational-bi
Procedure 4: Full Stack Rollback (Kubernetes)¶
Use Case: Rollback all Kubernetes deployments
Duration: 15-20 minutes
Step 1: Rollback All Deployments¶
# Rollback all feature services
kubectl rollout undo deployment/conversational-bi -n heliosdb-staging
kubectl rollout undo deployment/compliance -n heliosdb-staging
kubectl rollout undo deployment/embedded-cloud-sync -n heliosdb-staging
Step 2: Monitor Rollbacks¶
# Watch all rollouts
kubectl rollout status deployment/conversational-bi -n heliosdb-staging
kubectl rollout status deployment/compliance -n heliosdb-staging
kubectl rollout status deployment/embedded-cloud-sync -n heliosdb-staging
# Check all pods
watch kubectl get pods -n heliosdb-staging
Step 3: Verify All Services¶
# Check all deployments
kubectl get deployments -n heliosdb-staging
# Check all service endpoints
kubectl get endpoints -n heliosdb-staging
# Test health checks
for service in conversational-bi compliance embedded-cloud-sync; do
echo "Testing $service..."
kubectl exec -it -n heliosdb-staging deployment/$service -- curl http://localhost:8081/health
done
Procedure 5: Database Rollback (LAST RESORT)¶
Use Case: Data corruption, need to restore from backup
Duration: 30-60 minutes (depending on backup size)
WARNING: This will result in data loss for transactions after backup timestamp!
Step 1: Assess Data Loss Window¶
# Check latest backup timestamp
ls -lh /backups/heliosdb/ | tail -5
# Calculate data loss window
# Example: Backup from 6 hours ago = 6 hours of data loss
Step 2: Stop All Services¶
# Docker Compose
docker compose -f deployment/staging/docker-compose.yml down
# Kubernetes
kubectl scale deployment --all --replicas=0 -n heliosdb-staging
Step 3: Backup Current Database (Even if Corrupted)¶
# Create emergency backup
docker compose -f deployment/staging/docker-compose.yml up -d postgres
docker exec heliosdb-postgres \
pg_dump -U heliosdb_admin -Fc heliosdb > \
/backups/heliosdb/emergency_backup_$(date +%Y%m%d_%H%M%S).dump
docker compose -f deployment/staging/docker-compose.yml stop postgres
Step 4: Restore from Backup¶
# Start PostgreSQL
docker compose -f deployment/staging/docker-compose.yml up -d postgres
# Wait for ready
sleep 30
# Drop current database
docker exec heliosdb-postgres \
psql -U heliosdb_admin -d postgres -c "DROP DATABASE heliosdb;"
# Create fresh database
docker exec heliosdb-postgres \
psql -U heliosdb_admin -d postgres -c "CREATE DATABASE heliosdb;"
# Restore from backup
docker exec -i heliosdb-postgres \
pg_restore -U heliosdb_admin -d heliosdb -v \
< /backups/heliosdb/backup_YYYYMMDD_HHMMSS.dump
Step 5: Verify Database Integrity¶
# Check database size
docker exec heliosdb-postgres \
psql -U heliosdb_admin -d heliosdb -c "\l+"
# Check table counts
docker exec heliosdb-postgres \
psql -U heliosdb_admin -d heliosdb -c "
SELECT schemaname, tablename,
n_live_tup as row_count
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;"
# Run integrity checks
docker exec heliosdb-postgres \
psql -U heliosdb_admin -d heliosdb -c "VACUUM ANALYZE;"
Step 6: Restart Services¶
# Restart all services
docker compose -f deployment/staging/docker-compose.yml up -d
# Or for Kubernetes
kubectl scale deployment --all --replicas=1 -n heliosdb-staging
Post-Rollback Validation¶
1. Health Checks¶
# Check all service health endpoints
curl http://localhost:8081/health # Conversational BI
curl http://localhost:8082/health # Compliance
curl http://localhost:8083/health # Embedded+Cloud
# All should return: {"status": "healthy"}
2. Smoke Tests¶
Conversational BI¶
curl -X POST http://localhost:8081/api/v1/query \
-H "Content-Type: application/json" \
-d '{"question": "Show me users", "database": "heliosdb"}'
Compliance¶
Embedded+Cloud¶
3. Metrics Validation¶
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | \
jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# All should show: "health": "up"
4. Log Analysis¶
# Check for errors in last 5 minutes
docker compose -f deployment/staging/docker-compose.yml logs --since 5m | grep -i error
# Should see no critical errors
5. Load Testing (Optional)¶
Incident Documentation¶
Rollback Checklist¶
After rollback completes:
- [ ] Document root cause of issue
- [ ] Record rollback timeline
- [ ] Update incident ticket
- [ ] Notify stakeholders of resolution
- [ ] Schedule post-mortem meeting
- [ ] Create action items for prevention
- [ ] Update runbooks if needed
Post-Mortem Template¶
# Incident Post-Mortem: [Service Name] Rollback
**Date**: YYYY-MM-DD
**Incident Commander**: [Name]
**Severity**: P1/P2/P3
**Duration**: [Start] - [End]
## Summary
Brief description of what happened.
## Timeline
- HH:MM - Deployment started
- HH:MM - Issue detected
- HH:MM - Rollback initiated
- HH:MM - Rollback complete
- HH:MM - Service restored
## Root Cause
Detailed analysis of what caused the issue.
## Impact
- Services affected: [list]
- Users affected: [number]
- Data loss: [yes/no, how much]
- Duration of outage: [duration]
## Resolution
How the issue was resolved (rollback details).
## Action Items
1. [ ] Prevent recurrence: [action]
2. [ ] Improve detection: [action]
3. [ ] Update documentation: [action]
4. [ ] Training needed: [action]
## Lessons Learned
What we learned and how to prevent in future.
Best Practices¶
- Always Backup First: Before any rollback, ensure backups exist
- Test Rollbacks: Regularly test rollback procedures in staging
- Version Tagging: Use semantic versioning and git tags
- Keep Previous Images: Retain last 5 Docker images
- Document Changes: Maintain deployment changelog
- Gradual Rollout: Use canary or blue-green deployments when possible
- Quick Decision: Don't delay rollback if criteria are met
- Communicate: Keep team informed throughout process
Emergency Contacts¶
- On-Call Engineer: [PagerDuty/Phone]
- Team Lead: [Contact]
- DevOps: [Contact]
- Database Admin: [Contact]
Rollback Automation¶
For faster rollbacks, consider implementing:
- Automated Rollback Scripts: Pre-built scripts for common scenarios
- Feature Flags: Toggle features without deployment
- Blue-Green Deployments: Instant switchback capability
- Canary Releases: Gradual rollout with automatic rollback
Remember: Rollback is a recovery tool, not a failure. The goal is to restore service quickly and analyze issues later.