Over the past year and a bit, I’ve been using kubeadm to build and upgrade my clusters starting with Kubernetes 1.14. I switched from the home grown scripts I’d initially created for the 1.2 installation and continued through 1.12 to kubeadm in large part due to the automatic certificate renewals done when upgrading but also due to all the changes that needed to be followed up on between versions. The upgrade from 1.9 to 1.10 required a total rebuild of the clusters due to changes in the underlying networking tools and at 1.12, the certificates had expired causing no end of problems.
Every quarter, I’d research the upgrades, write up a page on what was changing, and create a doc on the upgrade process.
Recently when the first Master Node was rebooted, it failed to start up. Researching the problem found that the second and third Master Nodes started up without a problem. A search of the differences found /etc/kubernetes had a kubelet.conf and a bootstrap.kubelet.conf file and apparently the bootstrap.kubelet.conf file was refreshing the kubelet certificate however there isn’t a bootstrap.kubelet.conf file on the first Master Node.
While certificate management is done automatically by the kubeadm upgrade process, kubelet is not part of this process. It’s a separate binary that does need the certificate but it isn’t updated by kubeadm.
Further review found that as of Kubernetes 1.17, a bug was fixed. See, the kubelet.conf file in older versions contains the certificate for access to the cluster however the developers had identified that as a bug because the certificate was being upgraded, but in a separate file in /var/lib/kubelet/pki/kubelet-client-current.pem. But the kubelet.conf file wasn’t being upgraded to point to the updated file. It still contained the old, expired certificate.
Modifying the file to point to the current certificate took care of the problem and resolved it for future upgrades as well.
The bootstrap.kubelet.conf file was identified as a security issue and at least with 1.18.8 (what I’m currently running), it has been deleted after it was used to bootstrap the new Master Nodes into the cluster.