One of my past roles included managing a networking lab and a handful of virtualized web apps and their ESXi hosts. If something went wrong, I was the one to receive a phone call at 4am.
Now, things in the lab generally ran smoothly and we had decent processes in place. App upgrades always included taking a VM snapshot and, after successful deployment and verification, the snapshot was deleted.
As per usual, a new release went live and all was well. Weeks passed and the release became history. That is, until I got a phone call from The Netherlands at 4am.
Our brief conversation went something like this:
“Drew, I’m sorry to wake you, but we can’t access [$system] and [$otherSystem] seems to be misbehaving as well. We waited as long as we could to call.”
“Let me get my computer and check it out. Thanks for letting me know. I’ll message you when I know more.”
My laptop screen felt like it was on torch mode 🔦 as I rubbed my eyes and connected to the VPN. I logged into vCenter and saw that the host was up and responding. As I’m checking things over, I noticed the little alarm icon❗on the host. “Datastore usage on disk” says the alarm tab. At this point, I’m confused. The few systems running on this host were provisioned ~500 GB of storage and they’re on a 3 TB datastore.
“How the $#%& is it out of space?” I exclaim into the empty room, illuminated only by this obnoxiously bright screen.
Those of you reading may have already guessed out what happened- I neglected to delete the snapshot after the last deployment 😑. The snapshot for this 80 GB VM had slowly consumed nearly 2.5 TB, clogging up the datastore and choking out normal operations. Deleting the snapshot and rebooting the affected systems was a simple fix to a problem that could have easily been avoided.
Our workflow would have been sufficient if I hadn't allowed myself to get distracted. Snapshots are not a backup. Those alarm notifications could have been sent to my email too. Oh well, I learned about storage from that!
Now it’s your turn- What things have you learned the hard way?
There's baby photos of me in front of an Apple II (I come from geek stock..), as soon as I could read the labels on the disks (shout out to Frogger, Ghostbusters, California Games), I started to wonder what the rest of the stuff on it meant (Nashua, Memorex, as well as "DSDD", etc).
I finally found out in 1993.. I was doing high school computing, and my high school computing teacher started drawing a stack of big ovals on the board and drilling into us what cylinders, heads and sectors were. At about the same time, I found out about the "undelete" command on DOS, and started understanding filesystems. In 1994 we were gifted some HP Apollo systems from a local university, a friend's dad had a Sun IPC, another friend had a slackware system and I got into unix.
I went to university, did a year of student exchange in Canada, got 92% for a unit on Unix and C, and when I graduated, started working as a sysadmin at a university here in Australia. I learnt about FCAL attached drives, RAID systems, RAID system rebuilds (thanks Deathstars..) relearnt about SCSI, and after 8 years of variations of that, moved back to Canada and got a job at another university with about 6PB of NetApp systems on the floor (R200, FAS3050, FAS3170, FAS3270). After three years there, finding out the ins and outs of aggregates, raid groups, volumes and LUNs, I went on to a job at a Partner as a Solutions Architect/PS Engineer, doing NetApp (and other vendor..) implementations.
And that's the story of how I met your mother(board).
I think I've learned the most about storage from situations where people who should have known better didn't do the right thing: AKA storage failures....
Like a lot of people, I came to learn enterprise storage later in my career - up until that point in time I mostly heard about storage when something went down. I remember thinking "If there are so many disks, how can just one failing make all the data go away?" As the operation I was responsible for grew, we started to absorb storage infrastructure along with compute so I had to take a crash course in storage resiliency. I got to learn about spares: "Makes sense for a car - but dang these drives are expensive. I can't afford to just have drives sitting around doing nothing, okay, fine..." I got to learn about reconciling "unreasonable" customer requests to the storage we had available: "No, we're not going to host your backup file share on a RAID 10 array."
Mostly I learned that you need to view storage as the underpinning of your enterprise. Much like networking, if you don't get the basics right then all the fancy virtualization, GPU processors and blade servers don't get much work done - or worse you disrupt your customer's business and lose their trust. It's was a steep learning curve, but one that was critical for my career growth and the success of my endeavors.