Disaster Recovery Guide
Overview
This guide provides procedures for recovering the Smart Smoker V2 infrastructure in case of failures, data loss, or catastrophic events.
Table of Contents
- Backup Procedures
- Recovery Scenarios
- Terraform State Recovery
- Container Recovery
- Data Restoration
- Testing Recovery Procedures
Backup Procedures
Terraform State Backups
Automated Backup (Recommended):
# Add to cron job (daily at 2 AM)
0 2 * * * cd /home/benjr70/Dev/Smart-Smoker-V2/infra/proxmox/terraform && \
cp state/terraform.tfstate state/terraform.tfstate.backup.$(date +\%Y\%m\%d) && \
find state/ -name "terraform.tfstate.backup.*" -mtime +30 -delete
Manual Backup:
cd infra/proxmox/terraform
cp state/terraform.tfstate state/terraform.tfstate.backup.$(date +%Y%m%d-%H%M%S)
Remote Backup (For production):
# Sync to remote storage
aws s3 cp state/terraform.tfstate s3://smart-smoker-terraform/state/terraform.tfstate.$(date +%Y%m%d)
# Or use rsync to remote server
rsync -avz state/terraform.tfstate backup-server:/backups/terraform/
Proxmox Backups
LXC Container Backups:
# On Proxmox host - backup all containers
vzdump 104 105 106 --mode snapshot --compress zstd --storage backup-storage
# Automated backup (add to Proxmox Datacenter > Backup)
# Schedule: Daily at 1 AM
# Retention: 7 daily, 4 weekly, 3 monthly
VM Backups:
# Backup virtual smoker VM
vzdump 9001 --mode snapshot --compress zstd --storage backup-storage
Configuration Backups
Backup terraform.tfvars (Encrypted):
# Encrypt and backup sensitive configuration
gpg --symmetric --cipher-algo AES256 terraform.tfvars
cp terraform.tfvars.gpg ~/secure-backups/terraform.tfvars.$(date +%Y%m%d).gpg
# Decrypt when needed
gpg --decrypt terraform.tfvars.gpg > terraform.tfvars
Git Repository Backup:
# Clone mirror to backup location
git clone --mirror https://github.com/benjr70/Smart-Smoker-V2.git
cd Smart-Smoker-V2.git
git remote update
Recovery Scenarios
Scenario 1: Single Container Failure
Symptoms: One container is not responding or corrupted
Recovery Steps: 1. Verify container status:
pct status 104 # Replace with affected VMID
-
Attempt to restore from backup:
# List available backups ls /var/lib/vz/dump/ | grep 104 # Restore from backup pct restore 104 /var/lib/vz/dump/vzdump-lxc-104-YYYY_MM_DD-HH_MM_SS.tar.zst -
If backup restoration fails, rebuild with Terraform:
cd infra/proxmox/terraform # Remove failed container from state terraform state rm module.dev_cloud[0].module.container.proxmox_virtual_environment_container.this # Re-create container terraform apply -target=module.dev_cloud
Scenario 2: Complete Proxmox Host Failure
Symptoms: Proxmox server is completely offline/unrecoverable
Recovery Steps: 1. Install Fresh Proxmox VE: - Install Proxmox VE on new hardware - Configure network to match previous setup - Set up storage pools matching original configuration
-
Restore Configuration:
# Restore Terraform configuration from git git clone https://github.com/benjr70/Smart-Smoker-V2.git cd Smart-Smoker-V2/infra/proxmox/terraform # Restore terraform.tfvars from encrypted backup gpg --decrypt ~/secure-backups/terraform.tfvars.YYYYMMDD.gpg > terraform.tfvars -
Rebuild Infrastructure:
# Initialize Terraform terraform init # Review plan terraform plan # Apply infrastructure terraform apply -
Restore Application Data (see Data Restoration section)
Scenario 3: Terraform State File Corruption
Symptoms: terraform plan shows unexpected changes or errors
Recovery Steps: 1. Restore from backup:
cd infra/proxmox/terraform/state
# List available backups
ls -lh terraform.tfstate.backup.*
# Restore most recent valid backup
cp terraform.tfstate.backup.YYYYMMDD terraform.tfstate
# Verify
terraform plan
- If no backup available, rebuild state:
# Import existing resources terraform import 'module.github_runner[0].module.container.proxmox_virtual_environment_container.this' 105 terraform import 'module.dev_cloud[0].module.container.proxmox_virtual_environment_container.this' 104 terraform import 'module.prod_cloud[0].module.container.proxmox_virtual_environment_container.this' 106 # Verify state is correct terraform plan
Scenario 4: Accidental Resource Deletion
Symptoms: Critical resources were destroyed (prevented by lifecycle policy in production)
Recovery Steps:
1. If resources have prevent_destroy = true:
- Terraform will prevent deletion
- Error message: "Error: Instance cannot be destroyed"
-
If deletion occurred before lifecycle protection:
# Check if resources still exist in Proxmox pct list qm list # If resources exist, re-import to Terraform terraform import 'module.prod_cloud[0].module.container.proxmox_virtual_environment_container.this' 106 # If resources don't exist, recreate terraform apply -target=module.prod_cloud -
Restore data from backup
Terraform State Recovery
Remote Backend Migration (Recommended for Production)
Setup S3 Backend:
# infra/proxmox/terraform/backend.tf
terraform {
backend "s3" {
bucket = "smart-smoker-terraform-state"
key = "proxmox/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
Migrate from Local to Remote:
# Backup current state
cp state/terraform.tfstate state/terraform.tfstate.backup.pre-migration
# Update backend.tf with remote configuration
# Run migration
terraform init -migrate-state
# Verify migration
terraform plan
State Locking
PostgreSQL Backend with Locking:
terraform {
backend "pg" {
conn_str = "postgres://terraform:password@postgres.example.com/terraform_state?sslmode=require"
}
}
Container Recovery
GitHub Runner Recovery
Priority: HIGH - Required for CI/CD
- Restore container from backup or recreate with Terraform
-
Reinstall GitHub Actions runner:
pct enter 105 cd /opt/actions-runner ./config.sh --url https://github.com/benjr70/Smart-Smoker-V2 --token <NEW_TOKEN> ./svc.sh install ./svc.sh start -
Verify runner connectivity in GitHub repository settings
Development Cloud Recovery
Priority: MEDIUM
- Restore container from backup or recreate
- Redeploy application:
# Trigger deployment workflow or manual deploy ssh root@10.20.0.20 cd /opt/smart-smoker docker-compose pull docker-compose up -d
Production Cloud Recovery
Priority: CRITICAL
-
Restore from most recent backup:
pct restore 106 /var/lib/vz/dump/vzdump-lxc-106-latest.tar.zst pct start 106 -
Verify application data integrity:
pct enter 106 docker ps docker-compose logs -
Restore database if needed (see Data Restoration)
-
Update DNS/traffic routing after verification
Data Restoration
Database Restoration
PostgreSQL/TimescaleDB (if used):
# Restore from backup
pct enter 106
cd /var/lib/postgresql/backups
pg_restore -U postgres -d smart_smoker latest.dump
# Or from continuous archive
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
Docker Volume Restoration
# Backup volumes
docker run --rm -v smart-smoker_data:/data -v $(pwd):/backup ubuntu \
tar czf /backup/data-backup.tar.gz /data
# Restore volumes
docker run --rm -v smart-smoker_data:/data -v $(pwd):/backup ubuntu \
tar xzf /backup/data-backup.tar.gz -C /
Application Configuration
# Restore environment files
pct push 106 /secure-backups/.env /opt/smart-smoker/.env
# Restore application configs
pct push 106 /secure-backups/config /opt/smart-smoker/config
Testing Recovery Procedures
Quarterly DR Tests
Schedule: First weekend of each quarter
Test Procedures:
-
Test Terraform State Backup/Restore:
# Backup current state cp state/terraform.tfstate state/terraform.tfstate.CURRENT # Restore from old backup cp state/terraform.tfstate.backup.LAST_WEEK state/terraform.tfstate # Verify terraform plan # Restore current state mv state/terraform.tfstate.CURRENT state/terraform.tfstate -
Test Container Recovery:
# Create test container terraform apply -target=module.test_container # Destroy terraform destroy -target=module.test_container # Recover terraform apply -target=module.test_container -
Test Backup Restoration:
# Restore dev-cloud to test VMID pct restore 199 /var/lib/vz/dump/vzdump-lxc-104-latest.tar.zst # Verify application starts pct start 199 pct enter 199 docker-compose up -d # Cleanup pct stop 199 pct destroy 199
DR Drill Checklist
- [ ] Terraform state backup exists and is recent (< 24 hours)
- [ ] Can successfully restore Terraform state from backup
- [ ] Proxmox backups are configured and running
- [ ] Can restore LXC container from backup
- [ ] Can recreate container from Terraform
- [ ] terraform.tfvars is backed up securely
- [ ] Database backups are automated and tested
- [ ] Docker volumes are backed up
- [ ] Recovery procedures documentation is up-to-date
- [ ] All team members know where to find this guide
- [ ] Emergency contact information is current
Emergency Contacts
Infrastructure Team: - Primary: [Name] - [Email] - [Phone] - Secondary: [Name] - [Email] - [Phone]
Escalation Path: 1. Infrastructure Team 2. Development Lead 3. Technical Director
External Vendors: - Proxmox Support: [Contact Info] - Cloud Provider: [Contact Info]
Recovery Time Objectives (RTO) & Recovery Point Objectives (RPO)
| Component | RTO | RPO | Priority |
|---|---|---|---|
| Terraform State | 15 minutes | 24 hours | HIGH |
| GitHub Runner | 30 minutes | N/A | HIGH |
| Production Cloud | 1 hour | 1 hour | CRITICAL |
| Development Cloud | 4 hours | 24 hours | MEDIUM |
| Virtual Smoker VM | 4 hours | 24 hours | LOW |
| Application Data | 1 hour | 1 hour | CRITICAL |
Post-Recovery Checklist
After any recovery operation:
- [ ] Document what happened (incident report)
- [ ] Verify all services are operational
- [ ] Check monitoring and alerting
- [ ] Run
terraform planto verify infrastructure matches desired state - [ ] Test application functionality
- [ ] Verify backups are working
- [ ] Update this document if procedures changed
- [ ] Conduct post-mortem meeting
- [ ] Identify preventative measures
Related Documentation
- Terraform Configuration
- Proxmox Infrastructure
- Proxmox Infrastructure README
- Secrets Management Guide
Last Updated: 2025-10-05 Next Review Date: 2026-01-05 Document Owner: Infrastructure Team