One of my past roles included managing a networking lab and a handful of virtualized web apps and their ESXi hosts. If something went wrong, I was the one to receive a phone call at 4am.
Now, things in the lab generally ran smoothly and we had decent processes in place. App upgrades always included taking a VM snapshot and, after successful deployment and verification, the snapshot was deleted.
As per usual, a new release went live and all was well. Weeks passed and the release became history. That is, until I got a phone call from The Netherlands at 4am.
Our brief conversation went something like this:
“Drew, I’m sorry to wake you, but we can’t access [$system] and [$otherSystem] seems to be misbehaving as well. We waited as long as we could to call.”
“Let me get my computer and check it out. Thanks for letting me know. I’ll message you when I know more.”
My laptop screen felt like it was on torch mode 🔦 as I rubbed my eyes and connected to the VPN. I logged into vCenter and saw that the host was up and responding. As I’m checking things over, I noticed the little alarm icon❗on the host. “Datastore usage on disk” says the alarm tab. At this point, I’m confused. The few systems running on this host were provisioned ~500 GB of storage and they’re on a 3 TB datastore.
“How the $#%& is it out of space?” I exclaim into the empty room, illuminated only by this obnoxiously bright screen.
Those of you reading may have already guessed out what happened- I neglected to delete the snapshot after the last deployment 😑. The snapshot for this 80 GB VM had slowly consumed nearly 2.5 TB, clogging up the datastore and choking out normal operations. Deleting the snapshot and rebooting the affected systems was a simple fix to a problem that could have easily been avoided.
Our workflow would have been sufficient if I hadn't allowed myself to get distracted. Snapshots are not a backup. Those alarm notifications could have been sent to my email too. Oh well, I learned about storage from that!
Now it’s your turn- What things have you learned the hard way?
Community Manager \\ NetApp