Disaster Recovery – Wave One

Overview

If you saw my previous article on Disaster Recovery, you know what Wave One is. Mainly it’s the steps needed to be able to start running the Ansible Playbooks. This blog post lists all the steps I need to do to get going, then what additional work I needed to do to be ready to start configurations.

Build Templates

As this is a Proxmox environment, I need to set up templates for use when building systems. I downloaded the following Operating System Installation Media (ISOs):

  • Rocky Linux 9.3
  • OpenSUSE Linux 16.0
  • Ubuntu Linux 24.04.2
  • OpenBSD Install 68
  • FreeBSD 12.1
  • Solaris 10

Of these, I created two templates to start. The other ISOs will be modified and made available for testing. I will note that these are all systems I’ve used in the past in one workplace or another or for personal projects.

I generally set up the file systems out of habit. The default Hard Disk is 50 Gigabytes. I set for 2 CPU cores and 2 Gigabytes of RAM. Most of what I’m poking at don’t require much more and I plan on having any additional installations to into their own drive vs extending the original boot drive.

File System Layout

  • Boot – Location of the kernel and associated files. 2 Gigabytes
  • Root – Root file system. 4 Gigabytes
  • Usr – Utilities. 4 Gigabytes
  • Home – Home Directories. 8 Gigabytes
  • Opt – Optional Applications. 4 Gigabytes
  • Var – Logs and application settings. 8 Gigabytes
  • Swap – 4 Gigabytes

Service Account

For my systems, the service account is, ‘unixsvc‘. The scripts all expect it to have a 5000 UID and be a member of the sysadmin group.

Initialization

I have a backup of the old VMware based VMs. The bad part is I didn’t back up home directories and didn’t have the cron jobs used to manage systems documented. I was able to pull the old VMs off of the old VMware ESXI servers so I recovered most of what I needed.

Once I got pfSense running and properly configured, I set up all the VMs. I have multiple environments, VLAN101 – VLAN105. Since I’m using the pfSense firewall and system level firewalls, I disabled the firewall setting for every network interface for all the systems. In addition, I needed to add the VLAN tag for each network for each network interface. Once that was done and all systems accessible, it was on to the next step.

Setup

As the unixsvc account, I logged into the console for every VM and ran the following commands:

  • hostnamectl set-hostname [hostname].[domain]
  • nmcli con mod ens18 ipv4.method manual ipv4.addresses 192.168.[net].[ip]/24 ipv4.gateway 192.168.[net].254 ipv4.dns 8.8.8.8,192.168.[net].154 ipv4.dns-search ‘[domain],schelin.org’

Then I rebooted to set all the information. Then since everything was on Rocky 9.3, I ran a dnf upgrade -y on every system.

Restoration

For the unixsvc account, I restored the backups of that account to every system where I felt I needed a backup. Then from the tool servers, I added the IPs and hostnames with domain to the /etc/hosts file. With that, I logged into every system, created the .ssh directory, set it to 0700, then copied the new id_rsa.pub file into the .ssh directory as authorized_keys. This gave the unixsvc account access to all systems in its domain.

As I gradually restored directories, I copied backed up files and started getting things going again. I still needed to run the initialize ansible playbook and the unixsuite ansible playbook to get every system under control and reporting. After that, we’re just setting up systems.

Ansible Playbooks

On the tool servers, before we can run any ansible playbooks, ansible has to be installed.

In order for it to be installed, the epel-release needs to be installed. Once that’s installed, activate the Code Ready Builder repository then install Ansible. You might also install Python3.

  • dnf install epel-release -y
  • crb enable
  • dnf install ansible -y

Initialization

The first script to run is the initialization.yaml file. I did have a few errors in my script as my servers are now a lot newer. Plus the root systems are all 192.168.5.0/24 now so I had to update a few files where the configuration was still 192.168.1.0/24.

It took a bit of searching the error for something to click. The installation had bailed because I hadn’t used nmcli on every system, just manually updating the /etc/resolv.conf file due to not having DNS servers up yet. The second error was installing mailx changed to a different package, s-nail which has mailx in the package. Unfortunately I started getting the following error:

fatal: [bldr0cuomknode2]: FAILED! => {"ansible_facts": {}, "changed": false, "failed_modules": {"ansible.legacy.setup": {"failed": true, "module_stderr": "Shared connection to bldr0cuomknode2 closed.\r\n", "module_stdout": "\r\n", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}}, "msg": "The following modules failed to execute: ansible.legacy.setup\n"}

It took some searching but one of the installations was an updated sudoers on the redhat side. But I didn’t create the group or add the users to the group first so the error is my service account can’t become root as it’s not in the sudoers file any more. I had to go fix the servers before I could continue.

One of the more amusing aspects is I replace the sudoers file but in the mean time, add the unixsvc account to the new sysadmin group. Unless I log out as unixsvc on the tool server, it fails subsequent runs because of course group changes only occur when you log out and back in. Once done, rerunning the playbook works fine.

DNS

One of the things I needed to do was get my git server up and running. Since I’m having to make changes to the configurations, the 192.168.1.0/24 to 192.168.5.0/24 for example, I wanted to make sure my repos were up to date.

First though I had to create a git ansible playbook 🙂 There are four servers that need git installed, at least initially. I created it on the tool server and then copied it over to my installation repository. Ran it and the bldr0cuomgit1 server is ready for updates. I added the git playbooks and on to DNS.

The main DNS issue was originally my production environment was prod.internal.pri. I’d modified it but left the template with the prod designation. Since I’m trying to rebuild from scratch, I had to go in and add a check to create the named configurations and zone files properly. For production, just internal.pri, but for the other zones, dev.internal.pri and so on.

And as always, errors creep in. The annoying thing about jinja2 is there isn’t an easy way to determine what the errors are. After a bunch of troubleshooting, I finally just backed up the file and copied the one from the git server and gradually updated the template. I reran the ansible-playbook command each time until it matched what I was trying to do, then did a diff against the backed up file and found the errors, in both places! After sorting them out, I was able to get the production DNS servers up along with the dev, and qa dns servers.

Stage wasn’t working though. No hosts found. Looks like my automatically generated inventory didn’t have the stage dns servers tagged. Quick update and stage is working.

And now home isn’t working. Missing name. It took a similar, delete blocks of code until I found the errors. I’d created CNAME entries but it was a dictionary and not designated entries so I changed dv.name to just dv and it build the master zone file, however it failed to start the server. Quick grep of the messages file and I found the error. I have CNAMEs for my development and tool server. Dev has the dev websites and Tool has the finished builds for testing. Unfortunately it’s the same name duplicated so I blocked out the Tool side for now in order to get DNS running and I’ll dig into it later.

There were a few other minor issues with the named.conf files. I had a forwarder block for internal.pri and changing prod.internal.pri to internal.pri had conflicts. Since I have a default forwarder, the internal.pri forwarder block was unnecessary so I removed it and reinstalled.

Finally I copied all the changed files to my git server and added and commited the changes.

Conclusion

Once the necessary files are restored (mainly the website on the tool servers for now), we can start actually configuring the various systems. Mainly looking at what I already have configured and followed by creating new playbooks for other systems.

This entry was posted in Computers and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *