Pete's Log: Moderately Available Pi Cluster

Entry #2017, (Coding, Hacking, & CS stuff, Home Automation)
(posted when I was 43 years old.)

I performed my first availability test on the pi cluster last night. It went fairly well. The test involved a voluntary disruption, i.e. the node went down in a clean fashion. The node in question was hosting mosquitto at the time, so I was able to monitor MQTT status on my music display. It brought me great joy to see the MQTT status on the display change to down shortly after I instructed Kubernetes to drain the node and then shortly afterwards see the status switch back to up without me having had to do anything else. The pod had automatically started up again on a different node and all the layers in between did their thing to make things work again. A genuine little victory.

I don't think I'm ready yet to test an involuntary disruption, although I'm more confident now in coping with one of those. The goal isn't even really high availability. My current goals for the cluster are:

  1. Be absurd
  2. Learn things
  3. Minimize MTTR if one of the pis dies

If I get moderate availability as a side effect, so much the better. Although there's still a decent chance I'm building a house of cards that will crash down spectacularly at some point. But in that case I've still accomplished goals 1 and 2.


The availability test was not primarily intended as such, but was instead a bonus from some reshuffling I did to the cluster. I bought a new pi yesterday and decided this one would live on my desk. And since I now had the cluster spread across three locations and since the cluster consisted of three masters, I decided this new pi should be one of the masters so that each location would have one master. So one of the existing masters had to be demoted.

It was also time to repurpose the pi that had been hosting the dedicated prometheus and grafana servers since those services are now running well enough on the cluster. So step 1 was to bring that pi up as a worker node in the cluster. Once that was done, it was time to delete the demoted master with these steps:

  1. Disable scheduling for that node in Longhorn and request eviction of all replicas hosted on it
  2. kubectl drain the node
  3. kubectl delete the node
  4. uninstall k3s from the node

Other than brief outages to services currently running on that node, everything went great. The only problem I encountered was when I tried to join my new master server to the cluster. I got an error that k3s could not start. In my syslog I found that etcd wouldn't allow a new node to join because I had an unhealthy cluster:

Oct 9 22:27:30 server-03 k3s[7497]: time="2021-10-09T22:27:30.952148133-05:00" level=fatal msg="starting kubernetes: preparing server: start managed database: joining etcd cluster: etcdserver: unhealthy cluster"

OK, yes, I agree the cluster is unhealthy since it only has two masters, but that's exactly why I'm trying to add a third master node. After some digging I learned that kubectl delete deletes the node from the Kubernetes cluster but not from the etcd cluster. And k3s doesn't yet include etcd management tools such as etcdctl. Luckily I found instructions to download etcdctl and run it against a k3s cluster. So with that in hand, it was just a question of getting the etcd id of the node I had deleted and then deleting it in etcd.

sudo ETCDCTL_ENDPOINTS='https://127.0.0.1:2379' ETCDCTL_CACERT='/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/k3s/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/k3s/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member list

sudo ETCDCTL_ENDPOINTS='https://127.0.0.1:2379' ETCDCTL_CACERT='/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt' ETCDCTL_CERT='/var/lib/rancher/k3s/server/tls/etcd/server-client.crt' ETCDCTL_KEY='/var/lib/rancher/k3s/server/tls/etcd/server-client.key' ETCDCTL_API=3 etcdctl member remove nodeid-from-above

Phew. Now my new pi master was able to join the cluster and the cluster returned to healthy. Then I joined the demoted pi back to the cluster as a worker, bringing the cluster up to three masters and two workers. k3s doesn't taint the master nodes, so all five nodes are actually workers.

I should really try to understand etcd quorum better, because something in the back of my head tells me I'm probably doing something wrong. But time is precious and my todo list is long. This really is a baffling hobby.

I also updated my private docker registry to run two replicas because when the registry goes down, it creates a bit of a bottleneck on the cluster if other things are also reshuffling. k3s decided to schedule both registry pods on the same node, which was not entirely my intention.

So I ended up tagging all my nodes with a topology.kubernetes.io/zone label - one of laundry-room, garage, or basement-office as appropriate. And I think I was giggling to myself while doing so because this felt like I was achieving the absurdity goal for sure. Once I had my zone labels in place I applied a topologySpreadConstraints to my registry deployment configuration that instructed Kubernetes to spread the pods across zones. So now my pods are spread not just across two nodes, but also those nodes are in separate zones.

Not that the zones matter a whole lot, since there are plenty of single points of failure. But it was a fun way of achieving the goal of just having the pods on different nodes.