Flying Through Clouds 6: Story Time: Calming the Storm! Part II

By Dan Chilton and Bhavik Desai, NetApp Solutions Performance Engineers


Just as Sunil was taking the first sip of his Mochachino-lite, Don storms into his office.


“Sunil, we are having a major problem this week!  The access to our HR database is horrendous! Multiple users are reporting that it is taking 3 minutes run queries that were only taking 30 seconds last week. This is slowing down our ability to complete the annual review process!!!  HR is telling me everyone’s raises could be delayed this year!”


Sunil coughs as coffee goes down the wrong hole.


“Don, I don’t understand! Did something change in your environment?  This has been running great for 6 months now.  Give me an hour and I will look into it and get back to you.”


Without knowing anything about the new R&D application, Sunil sighs and begins to login to the OnCommand Insight.


Sunil first looks at VM2.    He quickly realizes that VM2 IOs are much lower from the average over the past several weeks and latencies are up.  He uses the OCI correlation techniques,and quickly determines determines the decrease in IOs per second (IOPs) and increase in latency that VM2 is experiencing based on additional work occurring on the same node and data aggregate as VM2.  He looks at the top correlated work on the node and sees that VM30 is generating large amount of IOPs on the same storage pool (aggregate) as VM2.  He checks the documented SLO agreements and VM30 is a new experimental app that purchased 300 IOPs and 20 ms latencies.


“What in the heck is VM30 up to?  They bought 300 IOPs and they are consuming almost 1200 IOPs.” 


He checks a few other stats and confirms that the latencies look great, but they are pounding the storage.





He calls the application owner, Steve. 


“Hey buddy, what’s the deal? You guys requested and pay for only 300 IOPs and your new application is running hot with 1200 IOPs.” 


“I’m sorry Sunil; we didn’t know.” said Steve. “We are still in our testing phase and won’t really be able to alter its IO pattern until we get to phase 5 of our process, 6 months out or so.” 


Sunil replied, “Well, do you want to purchase 1200 IOPs and lower latencies?  I can move you up a service platform, but it will cost you.”


”I am sorry bro, but we can’t afford that and we can’t tune the application for several more month. Is there anything you can do on your end?” 


Calming the Storm

Sunil drapes his head as he says his goodbyes to Steve.  As he is pondering, he remembers clustered Data ONTAP Quality of Service(QOS) allows you to apply a policy that will limit the maximum IOPs for a particular workload.  He jumps up from his seat and wastes no time in taking action.  He creates and applies a QOS policy to the R&D workload. Within the hour, the QOS policy restricts the IOPs to a maximum of 300.  Once his policy is applied, the QOS limit works quickly to calm the “storm workload” on VM30 and VM2 returns to good performance!


VM2 Happy Again










Relieved, he wastes no time in dialing Don’s office number.


“Hello Don,” says Sunil.


“Yes, Sunil, do you have good news for me?” replies Don.


“How is your HR access now?”


The phone goes silent with the faint sounds of keystrokes in the background then Don gasps and says, “We’re back!  You are a wizard!  What did you do?”


“Only my job Don, but lunch is on your next time?” asks Sunil.


“My treat for sure,” says Don.