Flying Through Clouds 2: Storage Performance Philosophy

By Dan Chilton and Bhavik Desai, NetApp Solutions Performance Engineers

 

Welcome back to Flying Through Clouds with Bhavik here. Thanks Dan for the introduction in our first article in the Flying Through Clouds series. Whenever there is a need to purchase or build out a new infrastructure, performance is one of the most important criteria folks look at when making those decisions.  In fact if you ask an admin about their system, more often than not the admins who are happy will say, “I’m happy with the way it is performing”. 

 

Performance can be described in different ways depending on what your requirements are.  Taking the analogy of flying, in the early 2000s if my goal was to travel from New York to London in the least amount of time, I would have picked the Concorde aircraft to fly, which traversed that distance in about 3 hours and 30 minutes. But if my goal was to fly the maximum number of people, a Boeing 777-300 would let me fly 3X-4X times more people than a Concorde would. A Concorde was faster in transporting a single person, but a 777-300 was more efficient in transporting many passengers.

 

As you think about storage performance in your cloud environment, the main criterion that determines if the system is performing well or not is the latency that the application and end users experience.  Latency can be defined as how quickly the end user retrieves or processes the data.  It is one approach for defining performance service level agreements (SLAs).  The typical latency graph looks like a hockey stick. Good latencies are on the blade of stick and bad latencies are on the handle. What causes latencies to move from blade to handle? Exhausting resources like disk, CPU and memory in the system which causes operations to queue and wait for those resources pushing the latencies high very quickly like the hockey stick handle.


You can make sure that your cloud environment remains happy by keeping latencies from the storage system on the blade. I’d like to share with you an analogy that Tony Gaddis, one of the leading performance experts at NetApp uses to explain this. Think of the system in terms of buckets and water levels. The key to keeping latencies low is to avoid exceeding the water levels in buckets. What are the main buckets and water levels in clustered Data ONTAP  according to Tony?

 

 

  • Disk/Aggregate Busy: Keeping disk busy < 50%.
  • CPU: Average Processor Utilization <50%(not CPU_BUSY)

 

 

The above are some key water levels that will help you to keep your systems happy, latencies low, and maintain predictable failover performance.  Depending on your application performance Service Level Agreements (SLAs), they can be adjusted accordingly.   Tools like NetApp OnCommand Insight (OCI) can help you track these metrics and show trends over a period of time.

 

In the next entry we will map ONTAP buckets and water levels to the corresponding metrics in OCI.  And for you old school CLI geeks, we’ll show you what counters to look at in ONTAP. Stay Tuned!