The dark side of careful planning
It is probably because of my lack of experience being a system admin but what I do when I have to organize the maintenance of a server which requires many distinct steps is that I will write a list.
A list? Yes, a list of all the steps I have to take. The main reason for me to do this is that it keeps my mind at ease: it is a way for me to be sure that I won’t forget something important (e.g. changing RAM if downgrading a server’s resources, resuming the monitoring on a prod server, etc.). To be honest, as far as I know, I’m probably the only one in my team to do this. I expect that my colleagues just go and perform the maintenance just like that, winging it.
Most of the time, I can’t… or I don’t want to do that. I just don’t feel that sure of myself yet to have that level of, well, confidence.
To be fair, even for a seasoned admin, I do see lists like these as a way to minimize mistakes or to avoid forgetting something important.
That said, on last Friday, I learnt the hard way that while lists can be helpful, you shouldn’t mindlessly follow them when you actually perform your maintenance. Why? Because you run the risk of making a stupid mistake. What kind of mistake? Well I’m glad you asked!
Last Friday, the root partition of an important prod machine was getting really close to becoming empty. I think there was like two digits of KB left on it. It was that bad. I figured that with the weekend approaching, I didn’t want to have to fix this during my time off, so I just went ahead and called a maintenance.
What should have been a 10 minutes total operation caused a downtime of nearly one hour. Why? Because instead of extending the current VM’s disk by 16GB, I… I… extended it by 16TB. When I fuck up, I don’t like to half-ass it.
What I realized after the mistake was done was that what I had prepared for was changing the RAM amount… in MiB. So what I did was calculating 16*1024 to get to a whopping 16384 MB. Except that it was 16384 GiB of hard disk added. This was stupid. Very stupid. The only reason why that happened in the first place was that I was so confident of the steps I had written down to prepare myself for this maintenance that I just completely unplugged my brain while doing it.
You DO NOT want to unplug your brain for something like this. So on top of feeling extremely dumb and ashamed of my mistake, well it still needed to be fixed. We, by the way, didn’t and still don’t have 16TB of free disk space on that hypervisor, or for all the hypervisors for that matter.
So because it was an urgent matter, someone else helped to fix it to get things done, well… kind of in emergency. In the end, everything ended ok: the machine got back up, I extended the partition as I had originally intended to do and well, that was pretty much it.
But just to add that the easy fix was to ssh to a node that has access to the disk file (qcow2 format) and use these commands (while the VM is obviously turned off):
qemu-img resize filename.qcow2 -16000GB
qm rescan
After the second command, you should see that the disk size has been reduced correctly. Just start the VM and confirm that everything is still working as expected (a good idea is to make a backup of the qcow2 file if you have the space for that before you perform the resize command; you know, just in case). I tested all of this in a test environment of the hypervisor we use.
Anyway, is this what people talk about when they mention horror stories? Because that didn’t feel good. Not at all. I know that I’m far from being the first person to make a stupid mistake but still. It taught me a valuable lesson: writing a list to prepare for a maintenance is good, but it should not be a substitute for using your brain.