Skip to content

Ansible Operations Guide

This guide provides operational procedures for managing Smart Smoker infrastructure using Ansible.

Overview

Ansible is used to configure and maintain all Proxmox LXC containers with Infrastructure as Code principles. All infrastructure changes should be made through Ansible playbooks rather than manual SSH configuration.

Infrastructure Components

Ansible Roles

The infrastructure is managed through 7 specialized Ansible roles:

  1. common - Base system configuration
  2. SSH hardening (key-only authentication)
  3. UFW firewall configuration
  4. fail2ban for brute force protection
  5. Base package installation

  6. docker - Container runtime

  7. Docker Engine installation
  8. Docker Compose plugin
  9. User permissions and daemon configuration

  10. terraform - Infrastructure tool (GitHub runner only)

  11. Terraform CLI from HashiCorp repository
  12. Latest stable version

  13. nodejs - Application runtime

  14. Node.js 20 LTS from NodeSource
  15. npm package manager

  16. github-runner - CI/CD runner

  17. GitHub Actions runner download & setup
  18. Service configuration and registration

  19. cloud-app - Cloud application environment

  20. Application directories (/opt/smart-smoker-{dev,prod})
  21. MongoDB data directories
  22. User/group setup

  23. virtual-device - Virtual smoker device

  24. Device directories
  25. Python tools for simulation
  26. Hardware mocking tools

Inventory Structure

Servers are organized into logical groups:

  • runners: GitHub Actions self-hosted runners
  • cloud_servers: Dev and production cloud servers
  • devices: Virtual smoker device for testing

See infra/proxmox/ansible/inventory/hosts.yml for current inventory.

Running Ansible Playbooks

Prerequisites

# Install Ansible
pip3 install ansible

# Install required collections
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

Available Playbooks

Master Playbook (Configure Everything)

cd infra/proxmox/ansible

# Configure all infrastructure
ansible-playbook playbooks/site.yml --extra-vars "github_runner_token=YOUR_TOKEN"

Individual Server Playbooks

# GitHub runner only
ansible-playbook playbooks/setup-github-runner.yml \
  --extra-vars "github_runner_token=YOUR_TOKEN"

# Development cloud server
ansible-playbook playbooks/setup-dev-cloud.yml

# Production cloud server
ansible-playbook playbooks/setup-prod-cloud.yml

# Virtual smoker device
ansible-playbook playbooks/setup-virtual-smoker.yml

Verification Playbook

# Verify all infrastructure is correctly configured
ansible-playbook playbooks/verify-all.yml

Testing Connectivity

# Test SSH connectivity to all servers
ansible all -m ping

# Test connectivity to specific group
ansible cloud_servers -m ping
ansible runners -m ping

Common Operations

Update System Packages

# Update all servers
ansible all -m apt -a "update_cache=yes upgrade=dist" --become

Restart Docker Service

# Restart Docker on all servers
ansible all -m systemd -a "name=docker state=restarted" --become

Check Service Status

# Check Docker status on all servers
ansible all -m systemd -a "name=docker" --become

# Check GitHub runner status
ansible runners -m systemd -a "name=actions.runner.*" --become

Run Ad-hoc Commands

# Check disk space
ansible all -m shell -a "df -h /"

# Check memory usage
ansible all -m shell -a "free -h"

GitHub Runner Management

Registering a Runner (Automatic)

Runner registration is now fully automated via a GitHub PAT (Personal Access Token). The Ansible role auto-generates short-lived registration tokens from the PAT -- no manual token generation is needed.

In CI (recommended): The ansible-provision.yml workflow automatically passes the RUNNER_PAT GitHub Secret to the role.

Local runs: Pass the PAT via --extra-vars:

ansible-playbook playbooks/setup-github-runner.yml \
  --extra-vars "github_runner_pat=github_pat_YOUR_TOKEN"

Fallback (manual token): You can still pass a manually generated token if needed:

ansible-playbook playbooks/setup-github-runner.yml \
  --extra-vars "github_runner_token=YOUR_SHORT_LIVED_TOKEN"

Runner Self-Healing

The runner has a self-healing systemd timer (runner-health-check.timer) that runs every 5 minutes and automatically detects and fixes stale registrations without requiring Ansible or GitHub Actions.

What it checks:

  • Runner .runner config file exists
  • Runner systemd service is active
  • No error loops in recent service logs (>3 errors in 5 min = unhealthy)

What it does when unhealthy:

  • Checks DNS resolution for api.github.com (falls back to 8.8.8.8 if needed)
  • Auto-generates a registration token from a stored PAT (/etc/github-runner/pat)
  • Stops and uninstalls the stale runner service
  • Re-registers with --replace --unattended
  • Installs and starts the new service

How to monitor:

# Check timer status
systemctl status runner-health-check.timer

# View recent health check logs
journalctl -u runner-health-check --since "1 hour ago"

# Manually trigger a health check
systemctl start runner-health-check.service

Checking Runner Status

# Check runner service status
ssh -J root@192.168.1.151 root@10.20.0.10 \
  'systemctl status actions.runner.* --no-pager'

# Check runner logs
ssh -J root@192.168.1.151 root@10.20.0.10 \
  'journalctl -u actions.runner.* -n 50'

# Check self-healing timer status
ssh -J root@192.168.1.151 root@10.20.0.10 \
  'systemctl status runner-health-check.timer'

# Check runner status via GitHub API
gh api repos/benjr70/Smart-Smoker-V2/actions/runners \
  --jq '.runners[] | select(.name=="smart-smoker-runner-1")'

Removing a Runner

Note: The self-healing timer will automatically re-register the runner unless you also remove the PAT file at /etc/github-runner/pat.

# Stop the runner service
ansible runners -m systemd -a "name=actions.runner.* state=stopped" --become

# Remove runner from GitHub (via web UI or API)
gh api -X DELETE repos/benjr70/Smart-Smoker-V2/actions/runners/RUNNER_ID

# To prevent auto-re-registration, also remove the PAT file:
ssh -J root@192.168.1.151 root@10.20.0.10 'rm -f /etc/github-runner/pat'

Security Best Practices

SSH Key Management

Current Status: SSH public keys are configured in inventory/group_vars/all.yml

Recommendations: - Keep personal SSH keys out of the repository - Use environment variables or external files for team keys - Rotate SSH keys regularly

Secrets Management

  • Never commit sensitive values to the repository
  • Use --extra-vars for sensitive data like GitHub tokens
  • Consider using Ansible Vault for encrypted variables

Firewall Rules

Default UFW configuration: - Default incoming: DENY - Default outgoing: ALLOW - Allowed ports: 22 (SSH), 80 (HTTP), 443 (HTTPS) - MongoDB port: Restricted to internal network only

fail2ban Configuration

  • Enabled on: All servers
  • Protected services: SSH
  • Default ban time: Based on Debian defaults
  • Recommendation: Consider stricter settings for production

Troubleshooting

SSH Connection Issues

# Test SSH connectivity
ansible all -m ping -vvv

# Manually test SSH
ssh -J root@192.168.1.151 root@10.20.0.10

# Check SSH service status
ansible all -m systemd -a "name=sshd" --become

Ansible Playbook Failures

# Run playbook in check mode (dry run)
ansible-playbook playbooks/site.yml --check

# Run with verbose output
ansible-playbook playbooks/site.yml -vvv

# Run specific tasks with tags
ansible-playbook playbooks/site.yml --tags "docker"

GitHub Runner Issues

# Check runner service status
ssh -J root@192.168.1.151 root@10.20.0.10 \
  'systemctl restart actions.runner.* && systemctl status actions.runner.*'

# View runner logs
ssh -J root@192.168.1.151 root@10.20.0.10 \
  'journalctl -u actions.runner.* -n 100'

Docker Issues

# Restart Docker on all servers
ansible all -m systemd -a "name=docker state=restarted" --become

# Check Docker status
ansible all -m shell -a "docker ps"

CI/CD Integration

Automated Ansible Validation

All Ansible code is validated in CI/CD via .github/workflows/ansible-lint.yml: - ansible-lint on all playbooks and roles - Syntax validation - Inventory verification

Future: Automated Ansible Execution

After bootstrap, Ansible can be executed automatically via GitHub Actions on infrastructure changes. This requires: 1. Dedicated automation SSH key 2. GitHub Actions workflow for Ansible execution 3. Secure secrets management in GitHub

Best Practices

  1. Idempotency: All playbooks are designed to be run multiple times safely
  2. Check Mode: Test changes with --check before applying
  3. Verification: Always run verify-all.yml after infrastructure changes
  4. Version Control: Commit all Ansible changes to git
  5. Documentation: Update this guide when adding new roles or playbooks

Directory Structure

infra/proxmox/ansible/
├── ansible.cfg                    # Ansible configuration
├── inventory/
│   ├── hosts.yml                  # Server inventory
│   ├── group_vars/                # Group variables
│   │   ├── all.yml                # Common variables
│   │   ├── runners.yml            # Runner-specific vars
│   │   ├── cloud_servers.yml      # Cloud server vars
│   │   └── devices.yml            # Device vars
│   └── host_vars/                 # Host-specific variables
├── roles/                         # Ansible roles (7 total)
│   ├── common/
│   ├── docker/
│   ├── terraform/
│   ├── nodejs/
│   ├── github-runner/
│   ├── cloud-app/
│   └── virtual-device/
├── playbooks/                     # Ansible playbooks
│   ├── site.yml                   # Master playbook
│   ├── setup-github-runner.yml
│   ├── setup-dev-cloud.yml
│   ├── setup-prod-cloud.yml
│   ├── setup-virtual-smoker.yml
│   └── verify-all.yml             # Verification playbook
└── README.md                      # Quick reference

References