The backbone of my network is a Xeon E3-based “server” I built a few years ago for hosting virtual machines. The server runs six virtual machines, providing services like DHCP, PXE, authentication, DNS, update caching, and a VPN for me to connect to and manage the network from school. While I did have the server running in RAID1, for a long time none of these VMs got properly backed up.
Eventually, as I replaced my main desktop, I converted my old one into a file server, fitting it with five 4 TB Western Digital Red drives, run in RAIDz2 by ZFS on Linux. I suddenly had a place to back up, and a place to back up means a new script to write. This script tells QEMU to create an “overlay” file, which accepts changes while the VM is running. Then, it uses rsync to copy the disk image to my file server. Finally, it merges the overlay file back into the disk image.
When I started doing this, I noticed it would cause the entire server and all of its VMs to become mostly unresponsive until the backup had completed. At about this time, I learned from Linus Tech Tips that 40 Gbps Infiniband equipment is absurdly cheap on eBay these days. Infiniband’s use of RDMA means the CPU isn’t so concerned with crafting packets that it can’t do much of anything else. So, I picked up a couple 40 Gbps cards and started doing backups that way. This was last December.
Fast forward to March, and I start getting some strange issues. Every once in a while, the server will kernel panic. During the file transfer stage. And as if that wasn’t strange enough, when it does this, it knocks the entire house offline. Even resetting the modem fails, as it will not to connect to the ISP until that server is restarted or unplugged from the network, as indicated by its status LEDs.
Everything I know about networking tells me this makes no sense. The only thing I can think of is that the system spews bogus data through its network interface which interferes with the operation of the router somehow. Though my network has much to gain, from some VLANs in terms of security, these servers are behind a router, which should block any such spurious transmissions.
To troubleshoot, I intend to attach an old hub I have in storage between the server and the router. Since hubs broadcast packets to all ports, I’ll then be able to catch any signals from Wireshark. If this reveals some packets, even when the server is panicked, I’ll have to do some thinking of what might cause such an abberation. Otherwise, I’ll look more into the router. It is several years old; perhaps it’s simply starting to show its age.
In the meantime, my projects GitLab is unavailable. I’ll put up a temporary projects page until I can get things back up and running. It’s not easy to troubleshoot an offline network from 90 miles away.
Update: I got someone to reset the network for me and it’s back under my control. I’ve come to suspect bad memory may be the cause of the crashes, so next time I’m near my network, I’ll run a memtest.
Update 2: Given the system’s behavior, I am quite certain at this point that my virtualization server has bad memory. It runs for a day or two, then has a kernel panic. I’m getting pretty tired of fixing corrupted VMs, as well. If it were any other week than dead week, I’d run home and check.