2009-03-31 08:22 AM - edited 2015-12-18 01:41 AM
We are running a FAS3070 cluster. On 1 node we have iSCSI luns that our SQL servers point to. On our SQL server monitoring tools we are seeing high disk queues lengths (>5, sometimes above 10 and even above 100). The interesting part is we check the filer, using Performance Monitor, and it's not having any problems. CPU isn't above 50%, sometime only around 25%, Protocol latencies are below 1 millisec, Network Throughput seems fine and Protocol Ops isn't max.
Does anyone know of an issue with external monitoring tools trying to interface with NetApp filers and reporting disk queue lengths? At issue here is can the monitor get a correct number due to the use of multiple disks being used in the aggregate and therefore the lun. Or does the filer head "translate" the multiple disk queue lengths into 1 combined queue length?
Trying to prove to the SQL DBAs that it's not the filers having a problem. lol
As I was typing this, the SQL monitoring tool raised an alarm of an average disk queue length of 5.3. Filer is showing 7% CPU usage, ~20 mb_per_sec network throughput, ~400 iSCSI_ops per sec.
Ideas? And yes it may be the network but trying to prove it's not the filers.
Other way to look at it is what average disk queue length shows the filers are slowing down the process?
2009-03-31 02:37 PM
You want to check a few extra stats on the Filer. Look at the LUN read/write latencies, and also the disk utilisation.
I've seen Microsoft reporting different stats than what the NetApp side of things do. The disk queues are in terms of Windows disks, so it's the queue length on the LUN, not of the physical disks. The physical disks are completely masked from the host. I'd also recommend troubleshooting the network, run some standard network connectivity tests and performance tests to make sure that is not the culprit here.
The real problem is that the NetApp stats do a very good job of proving the that NetApp system is not at fault. So all the stats show you how things are from the controller down to the disk level. If there is an issue with the connectivity to the host, it probably won't be apparent from the storage. However from the Windows layer, the Application and OS may be configured fine, and all checks will prove that, but they will point to the storage. Innevitably it's actually something on the network config. Tricky to troubleshoot sometimes if it's not obvious!
If the filer is reporting low latency all round and doing very little in general, then I'd look to investigate the host side of things and the connectivity between the host and the storage. On the switches check the VLAN config, check for any packet errors, check if there's any multi-pathing or link aggregation conflicts. Also make sure you have the NetApp host utils installed and you are using all supported versions of hardware / software pieces.
2009-03-31 05:45 PM
To follow up a bit on Chris's post (althoug not much - he covered a lot), to see overall disk utilization run "sysstat -x -s 1" from the filer console (you'll need to make the console window wider to handle the extra columns without nasty line-wrap).
For deeper info than that (down to the aggregate, volume, lun or disk level), you can use Performance Advisor (part of Ops Mgr if you have it).
2010-03-30 05:00 PM
I thought the guidelines were, in general, 2-3 per spindle. If you had 28 spindles, then you wouldn't want to see the OS disk queue go above ~70.
If I have that right, then seeing spikes of 5 or 10 is nothing to worry about.
You can check the SQL 2005 DMVs for Wait statistics; that'll tell you if SQL Server thinks it's IO-bound or not.
2010-03-31 01:37 AM
Although general guidelines for disk queue length would indicate a problem, I would focus on other stats as well - latency. You could check the latency from the windows or even SQL side of things as well, some high disk queue spike in and of itself is nothing to worry about. Especially since you probably have a greater number of disks in the aggregate than you would have with a regular RAID set.
Another thing could be the queue depth, SQL likes it to be somewhat higher (depending on the SQL instance) that regular apps. Also MPIO and iSCSI is a nice subject as well, can you confirm from the Windows side of things you are actually using more than 1 interface simultaneously for the SQL data? Keywords to look for when investigating is MCS or multiple connections per session.
But first thing is to establish whether you really have a problem in the first place.
2010-03-31 04:14 AM
On a SQL Server implementation, what latency number would be good vs. bad? From Windows, it's generally < 25 ms, but we're occasionally seeing 40ms or greater, with spikes even higher.
2010-03-31 12:48 PM
Great question Keith. In Windows make sure you use "LogicalDisk" counters. This should be fairly accurate when reporting queue length to the array. You can throw the "PhysicalDisk" counters out the window. Additionally in SAN environments the biggest indicator of performance IMHO is "Avg. Disk sec/Transfer" This will measure response time to your storage. I would say you want these times to be under 20ms which would equate to .020 on the counter display. Queue length will spike from time to time, but you want to watch out for sustained queued I/O. Spikes are normal, and expected.
From the NetApp check your LUN stats.
lun stats -o -i 1 /vol/volumename/lunname
This will show latency and queue length to the LUN. Additionally you might want to run a reallocate measure on the LUN. Let us know how it goes!
2010-06-01 06:10 AM
Your suggestions has been most helpful to me. Much appreciated!
But could you give some guidelines/hints on which values are OK and when my system should be considered overloaded when I run the 'lun stats -o -i 1' command?
Read Write Other QFull Read Write Average Queue Partner Lun
Ops Ops Ops kB kB Latency Length Ops kB
2010-06-01 07:08 AM
Well in terms of performance you will want to look at latency and queue length. For queue length anything sustained over 2-3 would be an area of concern. Remember that with a sample rate of 1 second you will see times where the queue length is over that number, but the very next second will be under. That is why you must measure for a long period to determine your average queue length. Latency is a better number for performance IMHO. This number you will want under 20 across the board, but again you will see spikes. For the most part we average about 5-7ms latency for all of our LUN's.
2010-06-06 11:55 PM
Thanks again for your great answers. I have now run the lun stats with a 15 second sample for about 30 minutes. Then I inserted all the values into Excel and had Excel to do an average on all values.
This is what I got, what do you think?