Act II, Take Two

Jun 3, 2021 · 527 words · 3 minute read

In February of this year, I drafted an article called “Act II: Redesign,” as a followup to “Another Year Another Crisis,” criticizing the obsolescence of my network and calling for a redesign around OpenStack. I had every intention of publishing said article in February, however a quick glance at a calendar will reveal it is now June. What happened?

Reality sunk in.

On February 13th, I ordered two new Precision workstations I could use to spearhead the effort of transitioning my network to OpenStack. A week later, the machines arrived, and in that week, I built my plans. The two machines would work as redundant nodes so I could reboot one for updates without the VMs going down. I would have an Ansible playbook, ready to provision the new machines right away. I’d move the InfiniBand cards I’d previously been using for backups to the new machines so they could be used as a trunk for Neutron traffic between the nodes. Later on, when I’d moved everything over, I’d get up the money to buy more SSDs and reprovision my old as redundant GlusterFS nodes, ready to act as a backend to Cinder. Everything would be redundant and resilient and I’d never have an outage again. It was going to be wonderful.

But then I ran packstack, waited my 15-20 minutes, and watched that idyllic pipe dream fly out the window. It took me two entire days to realize I was simply out of my league. A week later, however, I had a functioning dark cloud of mystery.

OpenStack isn’t a simple project one can dump in /var/www; install Apache, PHP, and MariaDB; and simply call it a day. It’s plenty complex sitting on just one node. Multiple different services need to talk to and offer themselves to even a one-node “cluster”. Then, to deploy those services on a second node becomes even more complex, as one learns of OVS Bridges and how they relate to Neutron. To make this system redundant? Well, the OpenStack team is working on that. I built a system that worked, but I had little idea of how, or - more importantly - how to perform backups.

Then, one day, I decided to run dnf update and learned very quickly why Triple-O is a project. The update completed, leaving me unable to access OpenStack’s Horizon web portal. A reboot may fix it. If not, rerunning packstack might. But, as life marches on I find myself with paradoxically little time to do even simple things like these. My OpenStack cluster has been down for at least a couple weeks now.

So that’s it, then. Give up and try a different solution? Of course not, but it’s clear that I have more homework to do before I can trust myself to run “production” VMs on OpenStack. I need to learn how the pieces fit together better. I need to read through configuration files. I need to learn how to achieve high availability with only two nodes, if possible. And above all learn to backup and maintain my cluster - and automate that process in lieu of the time needed to do so by hand.