Disaster Recovery – Wave One

Overview

If you saw my previous article on Disaster Recovery, you know what Wave One is. Mainly it’s the steps needed to be able to start running the Ansible Playbooks. This blog post lists all the steps I need to do to get going, then what additional work I needed to do to be ready to start configurations.

Build Templates

As this is a Proxmox environment, I need to set up templates for use when building systems. I downloaded the following Operating System Installation Media (ISOs):

  • Rocky Linux 9.3
  • OpenSUSE Linux 16.0
  • Ubuntu Linux 24.04.2
  • OpenBSD Install 68
  • FreeBSD 12.1
  • Solaris 10

Of these, I created two templates to start. The other ISOs will be modified and made available for testing. I will note that these are all systems I’ve used in the past in one workplace or another or for personal projects.

I generally set up the file systems out of habit. The default Hard Disk is 50 Gigabytes. I set for 2 CPU cores and 2 Gigabytes of RAM. Most of what I’m poking at don’t require much more and I plan on having any additional installations to into their own drive vs extending the original boot drive.

File System Layout

  • Boot – Location of the kernel and associated files. 2 Gigabytes
  • Root – Root file system. 4 Gigabytes
  • Usr – Utilities. 4 Gigabytes
  • Home – Home Directories. 8 Gigabytes
  • Opt – Optional Applications. 4 Gigabytes
  • Var – Logs and application settings. 8 Gigabytes
  • Swap – 4 Gigabytes

Service Account

For my systems, the service account is, ‘unixsvc‘. The scripts all expect it to have a 5000 UID and be a member of the sysadmin group.

Initialization

I have a backup of the old VMware based VMs. The bad part is I didn’t back up home directories and didn’t have the cron jobs used to manage systems documented. I was able to pull the old VMs off of the old VMware ESXI servers so I recovered most of what I needed.

Once I got pfSense running and properly configured, I set up all the VMs. I have multiple environments, VLAN101 – VLAN105. Since I’m using the pfSense firewall and system level firewalls, I disabled the firewall setting for every network interface for all the systems. In addition, I needed to add the VLAN tag for each network for each network interface. Once that was done and all systems accessible, it was on to the next step.

Setup

As the unixsvc account, I logged into the console for every VM and ran the following commands:

  • hostnamectl set-hostname [hostname].[domain]
  • nmcli con mod ens18 ipv4.method manual ipv4.addresses 192.168.[net].[ip]/24 ipv4.gateway 192.168.[net].254 ipv4.dns 8.8.8.8,192.168.[net].154 ipv4.dns-search ‘[domain],schelin.org’

Then I rebooted to set all the information. Then since everything was on Rocky 9.3, I ran a dnf upgrade -y on every system.

Restoration

For the unixsvc account, I restored the backups of that account to every system where I felt I needed a backup. Then from the tool servers, I added the IPs and hostnames with domain to the /etc/hosts file. With that, I logged into every system, created the .ssh directory, set it to 0700, then copied the new id_rsa.pub file into the .ssh directory as authorized_keys. This gave the unixsvc account access to all systems in its domain.

As I gradually restored directories, I copied backed up files and started getting things going again. I still needed to run the initialize ansible playbook and the unixsuite ansible playbook to get every system under control and reporting. After that, we’re just setting up systems.

Ansible Playbooks

On the tool servers, before we can run any ansible playbooks, ansible has to be installed.

In order for it to be installed, the epel-release needs to be installed. Once that’s installed, activate the Code Ready Builder repository then install Ansible. You might also install Python3.

  • dnf install epel-release -y
  • crb enable
  • dnf install ansible -y

Initialization

The first script to run is the initialization.yaml file. I did have a few errors in my script as my servers are now a lot newer. Plus the root systems are all 192.168.5.0/24 now so I had to update a few files where the configuration was still 192.168.1.0/24.

It took a bit of searching the error for something to click. The installation had bailed because I hadn’t used nmcli on every system, just manually updating the /etc/resolv.conf file due to not having DNS servers up yet. The second error was installing mailx changed to a different package, s-nail which has mailx in the package. Unfortunately I started getting the following error:

fatal: [bldr0cuomknode2]: FAILED! => {"ansible_facts": {}, "changed": false, "failed_modules": {"ansible.legacy.setup": {"failed": true, "module_stderr": "Shared connection to bldr0cuomknode2 closed.\r\n", "module_stdout": "\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}}, "msg": "The following modules failed to execute: ansible.legacy.setup\n"}

It took some searching but one of the installations was an updated sudoers on the redhat side. But I didn’t create the group or add the users to the group first so the error is my service account can’t become root as it’s not in the sudoers file any more. I had to go fix the servers before I could continue.

One of the more amusing aspects is I replace the sudoers file but in the mean time, add the unixsvc account to the new sysadmin group. Unless I log out as unixsvc on the tool server, it fails subsequent runs because of course group changes only occur when you log out and back in. Once done, rerunning the playbook works fine.

Conclusion

Once the necessary files are restored (mainly the website on the tool servers for now), we can start actually configuring the various systems. Mainly looking at what I already have configured and followed by creating new playbooks for other systems.

This entry was posted in Computers and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *