Ansible Operations Guide

This guide provides operational procedures for managing Smart Smoker infrastructure using Ansible.

Overview

Ansible is used to configure and maintain all Proxmox LXC containers with Infrastructure as Code principles. All infrastructure changes should be made through Ansible playbooks rather than manual SSH configuration.

Infrastructure Components

Ansible Roles

The infrastructure is managed through 7 specialized Ansible roles:

common - Base system configuration
SSH hardening (key-only authentication)
UFW firewall configuration
fail2ban for brute force protection
Base package installation
docker - Container runtime
Docker Engine installation
Docker Compose plugin
User permissions and daemon configuration
terraform - Infrastructure tool (GitHub runner only)
Terraform CLI from HashiCorp repository
Latest stable version
nodejs - Application runtime
Node.js 20 LTS from NodeSource
npm package manager
github-runner - CI/CD runner
GitHub Actions runner download & setup
Service configuration and registration
cloud-app - Cloud application environment
Application directories (/opt/smart-smoker-{dev,prod})
MongoDB data directories
User/group setup
virtual-device - Virtual smoker device
Device directories
Python tools for simulation
Hardware mocking tools

Inventory Structure

Servers are organized into logical groups:

runners: GitHub Actions self-hosted runners
cloud_servers: Dev and production cloud servers
devices: Virtual smoker device for testing

See infra/proxmox/ansible/inventory/hosts.yml for current inventory.

Running Ansible Playbooks

Prerequisites

# Install Ansible
pip3 install ansible

# Install required collections
ansible-galaxy collection install community.general
ansible-galaxy collection install ansible.posix

Available Playbooks

Master Playbook (Configure Everything)

cd infra/proxmox/ansible

# Configure all infrastructure
ansible-playbook playbooks/site.yml --extra-vars "github_runner_token=YOUR_TOKEN"

Individual Server Playbooks

# GitHub runner only
ansible-playbook playbooks/setup-github-runner.yml \
  --extra-vars "github_runner_token=YOUR_TOKEN"

# Development cloud server
ansible-playbook playbooks/setup-dev-cloud.yml

# Production cloud server
ansible-playbook playbooks/setup-prod-cloud.yml

# Virtual smoker device
ansible-playbook playbooks/setup-virtual-smoker.yml

Verification Playbook

# Verify all infrastructure is correctly configured
ansible-playbook playbooks/verify-all.yml

Testing Connectivity

# Test SSH connectivity to all servers
ansible all -m ping

# Test connectivity to specific group
ansible cloud_servers -m ping
ansible runners -m ping

Common Operations

Update System Packages

# Update all servers
ansible all -m apt -a "update_cache=yes upgrade=dist" --become

Restart Docker Service

# Restart Docker on all servers
ansible all -m systemd -a "name=docker state=restarted" --become

Check Service Status

# Check Docker status on all servers
ansible all -m systemd -a "name=docker" --become

# Check GitHub runner status
ansible runners -m systemd -a "name=actions.runner.*" --become

Run Ad-hoc Commands

# Check disk space
ansible all -m shell -a "df -h /"

# Check memory usage
ansible all -m shell -a "free -h"

GitHub Runner Management

Registering a Runner (Automatic)

Runner registration is now fully automated via a GitHub PAT (Personal Access Token). The Ansible role auto-generates short-lived registration tokens from the PAT -- no manual token generation is needed.

In CI (recommended): The ansible-provision.yml workflow automatically passes the RUNNER_PAT GitHub Secret to the role.

Local runs: Pass the PAT via --extra-vars:

ansible-playbook playbooks/setup-github-runner.yml \
  --extra-vars "github_runner_pat=github_pat_YOUR_TOKEN"

Fallback (manual token): You can still pass a manually generated token if needed:

ansible-playbook playbooks/setup-github-runner.yml \
  --extra-vars "github_runner_token=YOUR_SHORT_LIVED_TOKEN"

Runner Self-Healing

The runner has a self-healing systemd timer (runner-health-check.timer) that runs every 5 minutes and automatically detects and fixes stale registrations without requiring Ansible or GitHub Actions.

What it checks:

Runner .runner config file exists
Runner systemd service is active
No error loops in recent service logs (>3 errors in 5 min = unhealthy)

What it does when unhealthy:

Checks DNS resolution for api.github.com (falls back to 8.8.8.8 if needed)
Auto-generates a registration token from a stored PAT (/etc/github-runner/pat)
Stops and uninstalls the stale runner service
Re-registers with --replace --unattended
Installs and starts the new service

How to monitor:

# Check timer status
systemctl status runner-health-check.timer

# View recent health check logs
journalctl -u runner-health-check --since "1 hour ago"

# Manually trigger a health check
systemctl start runner-health-check.service

Checking Runner Status

# Check runner service status
ssh -J root@192.168.1.151 root@10.20.0.10 \
  'systemctl status actions.runner.* --no-pager'

# Check runner logs
ssh -J root@192.168.1.151 root@10.20.0.10 \
  'journalctl -u actions.runner.* -n 50'

# Check self-healing timer status
ssh -J root@192.168.1.151 root@10.20.0.10 \
  'systemctl status runner-health-check.timer'

# Check runner status via GitHub API
gh api repos/benjr70/Smart-Smoker-V2/actions/runners \
  --jq '.runners[] | select(.name=="smart-smoker-runner-1")'

Removing a Runner

Note: The self-healing timer will automatically re-register the runner unless you also remove the PAT file at /etc/github-runner/pat.

# Stop the runner service
ansible runners -m systemd -a "name=actions.runner.* state=stopped" --become

# Remove runner from GitHub (via web UI or API)
gh api -X DELETE repos/benjr70/Smart-Smoker-V2/actions/runners/RUNNER_ID

# To prevent auto-re-registration, also remove the PAT file:
ssh -J root@192.168.1.151 root@10.20.0.10 'rm -f /etc/github-runner/pat'

Security Best Practices

SSH Key Management

Current Status: SSH public keys are configured in inventory/group_vars/all.yml

Recommendations: - Keep personal SSH keys out of the repository - Use environment variables or external files for team keys - Rotate SSH keys regularly

Secrets Management

Never commit sensitive values to the repository
Use --extra-vars for sensitive data like GitHub tokens
Consider using Ansible Vault for encrypted variables

Firewall Rules

Default UFW configuration: - Default incoming: DENY - Default outgoing: ALLOW - Allowed ports: 22 (SSH), 80 (HTTP), 443 (HTTPS) - MongoDB port: Restricted to internal network only

fail2ban Configuration

Enabled on: All servers
Protected services: SSH
Default ban time: Based on Debian defaults
Recommendation: Consider stricter settings for production

Troubleshooting

SSH Connection Issues

# Test SSH connectivity
ansible all -m ping -vvv

# Manually test SSH
ssh -J root@192.168.1.151 root@10.20.0.10

# Check SSH service status
ansible all -m systemd -a "name=sshd" --become

Ansible Playbook Failures

# Run playbook in check mode (dry run)
ansible-playbook playbooks/site.yml --check

# Run with verbose output
ansible-playbook playbooks/site.yml -vvv

# Run specific tasks with tags
ansible-playbook playbooks/site.yml --tags "docker"

GitHub Runner Issues

# Check runner service status
ssh -J root@192.168.1.151 root@10.20.0.10 \
  'systemctl restart actions.runner.* && systemctl status actions.runner.*'

# View runner logs
ssh -J root@192.168.1.151 root@10.20.0.10 \
  'journalctl -u actions.runner.* -n 100'

Docker Issues

# Restart Docker on all servers
ansible all -m systemd -a "name=docker state=restarted" --become

# Check Docker status
ansible all -m shell -a "docker ps"

CI/CD Integration

Automated Ansible Validation

All Ansible code is validated in CI/CD via .github/workflows/ansible-lint.yml: - ansible-lint on all playbooks and roles - Syntax validation - Inventory verification

Future: Automated Ansible Execution

After bootstrap, Ansible can be executed automatically via GitHub Actions on infrastructure changes. This requires: 1. Dedicated automation SSH key 2. GitHub Actions workflow for Ansible execution 3. Secure secrets management in GitHub

Best Practices

Idempotency: All playbooks are designed to be run multiple times safely
Check Mode: Test changes with --check before applying
Verification: Always run verify-all.yml after infrastructure changes
Version Control: Commit all Ansible changes to git
Documentation: Update this guide when adding new roles or playbooks

Directory Structure

infra/proxmox/ansible/
├── ansible.cfg                    # Ansible configuration
├── inventory/
│   ├── hosts.yml                  # Server inventory
│   ├── group_vars/                # Group variables
│   │   ├── all.yml                # Common variables
│   │   ├── runners.yml            # Runner-specific vars
│   │   ├── cloud_servers.yml      # Cloud server vars
│   │   └── devices.yml            # Device vars
│   └── host_vars/                 # Host-specific variables
├── roles/                         # Ansible roles (7 total)
│   ├── common/
│   ├── docker/
│   ├── terraform/
│   ├── nodejs/
│   ├── github-runner/
│   ├── cloud-app/
│   └── virtual-device/
├── playbooks/                     # Ansible playbooks
│   ├── site.yml                   # Master playbook
│   ├── setup-github-runner.yml
│   ├── setup-dev-cloud.yml
│   ├── setup-prod-cloud.yml
│   ├── setup-virtual-smoker.yml
│   └── verify-all.yml             # Verification playbook
└── README.md                      # Quick reference