Pete's Log: Moderatelier Available Pi Cluster

Entry #2034, Wed, November 17, 2021, 22:49 CST (Home Automation)
(posted when I was 43 years old.)

I'm a little embarrassed that one of the pis in my cluster went down a few days ago and I didn't notice until today.

I'm also a little proud that one of the pis in my cluster went down a few days ago and I didn't notice until today.

Recently we noticed some little footprints in the sandbox in the garage. We're sort of lax about keeping the garage door closed during the day when we're at home, and we do have a lot of critters in the yard. I have a spare camera that I haven't used in a while, so we figured maybe we'd plug it in in the garage and see what we see.

Of course, I only have two outlets by the garage data center and they're both in use. So instead of getting a power strip, I decided this was the excuse I needed for putting a UPS in the garage. Today Jamie and I went on a lunch date and among other things stopped at Microcenter where the UPS was acquired. (Still no Raspberry Pis in stock at Microcenter, but luckily the cluster still has plenty of capacity)

Before unplugging things to plug them into the UPS, I wanted to gently shut down the two cluster nodes in the garage. That's when I discovered that one of my nodes was down. So I guess I have now successfully (and involuntarily) tested an involuntary disruption. Because all my services were just up and running. I guess I should work on configuring some alerting.

I'm glad I gave all my nodes zone labels, since I forgot where the node that was down was actually located (laundry room). The zone label was definitely quicker than tracking it down on my switches via its MAC address. A hard reboot got the node back online. The only issue was that syslog was being flooded with variants of

orphaned pod found but error not a directory occurred when trying to remove the volumes dir

This issue on GitHub indicated I could just manually delete the stale directories, and that did in fact clear things up for me.

So now all was well with the world again and I could start my voluntary disruptions. Drained the two garage nodes, ran a quick apt upgrade on them while I was at it, and then shut them down. Installed the UPS, brought them back up, and now my moderately available cluster is slightly more moderately available, with four out of five nodes now on backup power.

Then for good measure I brought the other nodes down one at a time for apt updates. Most services only take a minute or two to come back up when they get evicted from a node. Home Assistant takes about five. But I feel pretty content. Other than my like of alerting.

Now I'm admiring how good the night vision on this camera is in our dark garage and trying to identify all the reflecty bits bouncing the IR light back at the camera. And also entertained that I can watch a light blinking in the car. No critters so far, though. Guess I should probably go to sleep.