Community

Flying Through Clouds 8: Story Time: Guilty Until Proven Innocent

By Dan Chilton and Bhavik Desai, NetApp Solutions Performance Engineers

 

Our storage performance series continues.

 

How does the accused plead?  Guilty or Not Guilty?

 

Sunil was worried.  He kept getting calls from Steve and his manager to complain that the data warehouse was too slow.  The end of month full scan of the data warehouse that was essential for inventory was not going to finish in time for the new orders to be placed.  The virtualization team was pointing the finger at the storage . . . pointing the finger at him! He knew this database was stored on storm-03, a NetApp storage controller with a mix of 15K RPM SAS drives and a Flash Cache.  He pulled up OnCommand Insight (OCI) and glanced at the controller and aggregate utilization.  With storage controller system utilization at 12% and disk utilization at 9%, this did not look like a storage bottleneck.

 


Why did it always feel like the storage was guilty until proven innocent?

 

He checked the two data warehouse servers, VM37 and VM38 and overlaid the ESX host server utilization with the VM.

The ESX server had CPU utilization of only 5.8%.  VM37 had throughput of 67MB/sec and VM38 had 96MB/sec throughput respectively for a total of 163MB/sec. 

 

Sunil called Steve and asked, “How much throughput are the data warehouse guys expecting?”

 

Steve replied, “At least 300MB/sec at peak performance”. 

 

Sunil asked, “Well I checked the storage controller and it does not look like the problem.  I think this might be a network problem between the VMware server and the storage.  Did you check to make sure that the ESX server has 10 Gb Ethernet connections to the NFS datastore? I don’t have access to log-in.”

 

Steve said, ”I think they are, but I can’t log-in either and Joe the VMware guy is on vacation.”

 

Sunil said, “I’ll check a few more things then call you back.”

 

Sunil checked to make sure that the logical interface on the storage controller was on a 10 Gb Ethernet port and that it was not saturated.  The data warehouse was one of the only apps deployed so far on this cluster node, so there was plenty of network bandwidth to spare.  He got another angry email from his manager telling him that he better solve this and quickly.

He called Steve back, “Steve, I’m pulling my hair out over here.  Are you sure the ESX server network interfaces are 10 Gig? I really think this is a network problem.”

 

Steve replied, “I am trying to reach Megan, on the VMware team, but she is not calling me back.”

 

Sunil’s Vice President called and said that this problem had executive level exposure.  He had never spoken with his Vice President before and this was not how he hoped to meet her.  Think, think, think!  He had to get out of this situation quick, but how?  Then Sunil had an idea. What if he used vol move to move the VM volumes from the current storage controller to a NetApp All-Flash FAS?  Sunil wasn’t a DBA, but he knew that disk throughput was often the biggest bottleneck for a data warehouse.  If he took disk bottlenecks out of the picture by moving the workload to a powerful All-Flash FAS, it would either improve performance or prove that storage was not guilty.  He used OCI to check that storm-05 had resources to spare and started the vol move.

 

“Proving beyond a shadow of a doubt . . .  that storage is not guilty.”

 

It didn’t take long to finish.  He checked the throughput of VM37 and VM38 on their new All Flash FAS storage.   VM37 and VM38 each had average throughput of 85 MB/sec, for a total throughput of 170 MB/sec.

Court is adjourned!

 

Sunil called Steve back, “Steve I moved the data warehouse workloads to an all Flash FAS and the total throughput only increased from 164 MB/sec to 170 MB/sec.  Storage is obviously not guilty here.  You need to escalate this problem with the virtualization team if you want to get this fixed.” 

 

“OK Sunil, I’ll get on it right away, thanks.” said Steve.

 

Steve called back a few hours later, “Sunil you were right, they didn’t have a 10 Gigabit connection from the ESX server to the storage controller for the NFS datastore.  It was only Gigabit Ethernet.  We installed a 10 Gigabit card and we are now seeing throughput of 300 MB/sec.  By the way, that was a slick way to prove that storage was not guilty.  How did you think of it?”

 

Sunil smiled and said, “That’s why they pay me the big bucks Steve.”  Sunil got off the phone and smiled, that NetApp All-Flash FAS was a great investment.  Now if he could only get a second one for his new cluster.

Comments
on ‎2014-08-27 01:06 PM

Does it provide stats for NAS protocols like CIFS?

dchilton Former NetApp Employee on ‎2014-08-27 01:15 PM

Not at this time, but which ones would you like to see?