I have a few V3270 controllers using external storage on an HP StorageWorks XP24000. I've read the best practices guide and implementation guide, and I've had this setup running for quite some time. Somewhat recently, I've started to face some performance problems. There are of course many possible sources, and I wouldn't be surprised if I'm hitting the usable limits of the hardware at hand.
Searches for other posts with this configuration points to changing the option disk.target_port.cmd_queue_dept to a lower value, so I don't overload the XP24000 host ports with IO it can't really handle. This option is currently set to 256, which I assume is the default. Unfortunately I don't have a test environment, and my simple questions is if I can change this option on the fly in a production system.
Also, at the end of the output of "storage load summary" I get "I/O was last balanced at: Thu Jan 1 01:02:46 CET 1970" indicating it has infact never been done. I saw the "storage load balance" command and again, is this safe to run in production?
The XP24K is more than capable of handling our requests at the port level (assuming the ports are dedicated to a single initiator). The queueing problems are primarily on the mid-range Hitachi AMS arrays. Changing the disk.target_port.cmd_queue_depth option is not needed here.
You may certainly run the "storage load balance" command, however. But I would suggest, instead of throwing darts on the wall hoping we hit something, we gather some more data about the symptoms. The answer may be obvious if we can see exactly where the bottleneck is. You should open a case with NetApp Support, asking for help figuring out why your performance has dropped off.
One of the things they should do is help you to collect a Perfstat (Perfstat is a host application that collects performance data about a NetApp controller). The Perfstat will show us where, if at all, the Netapp controller is having problems.
It will also help if you can quantify what you mean by a performance problem. What exactly have you noticed? Is it a particular application? Is it tied to a particular aggregate? The more information you can provide, the quicker we'll be able to help identify the source.
I have another HA Pair connected to the XP Port, a V3170 model though, so I'm sharing the ports with two more filers. There are no other filers or other hosts connected however.
I'm getting reports of various system performing a lot worse than previously, Oracle batch jobs taking a lot longer. Sluggish systems, and in some cases SnapDrive is timing out creating disks in previously created qtrees.
A support case has been logged, a perfstat has been run. The Earth has rotated. Nothing new here. NetApp support has not seen anything out of the ordinary. We recently updated the filers to Ontap 8.1.1 from 8.0.2, and we saw a sharp increase in CPU activity, this was the cause for raising the support case. Support came back and explained that the CPU usage was due to background dedup processes, and this was normal.
So here I am, still trying to find what's going on. I have access to detailed XP performance data. I can see a high microprocessor usage on the host ports, but the latency per port is reasonable. Browsing through a wall of numbers, no latency over 10 ms so I should be good.
Other people are looking at individual volumes on the filer side, and it seems we have some heavy hitting Oracle apps. Could very well be that the filer is overwhelmed with IO bursts, but the investigation continues.