Data Protection
Data Protection
We are running a FAS3070 cluster. On 1 node we have iSCSI luns that our SQL servers point to. On our SQL server monitoring tools we are seeing high disk queues lengths (>5, sometimes above 10 and even above 100). The interesting part is we check the filer, using Performance Monitor, and it's not having any problems. CPU isn't above 50%, sometime only around 25%, Protocol latencies are below 1 millisec, Network Throughput seems fine and Protocol Ops isn't max.
Does anyone know of an issue with external monitoring tools trying to interface with NetApp filers and reporting disk queue lengths? At issue here is can the monitor get a correct number due to the use of multiple disks being used in the aggregate and therefore the lun. Or does the filer head "translate" the multiple disk queue lengths into 1 combined queue length?
Trying to prove to the SQL DBAs that it's not the filers having a problem. lol
As I was typing this, the SQL monitoring tool raised an alarm of an average disk queue length of 5.3. Filer is showing 7% CPU usage, ~20 mb_per_sec network throughput, ~400 iSCSI_ops per sec.
Ideas? And yes it may be the network but trying to prove it's not the filers.
Other way to look at it is what average disk queue length shows the filers are slowing down the process?
Thanks,
Keith
You want to check a few extra stats on the Filer. Look at the LUN read/write latencies, and also the disk utilisation.
I've seen Microsoft reporting different stats than what the NetApp side of things do. The disk queues are in terms of Windows disks, so it's the queue length on the LUN, not of the physical disks. The physical disks are completely masked from the host. I'd also recommend troubleshooting the network, run some standard network connectivity tests and performance tests to make sure that is not the culprit here.
The real problem is that the NetApp stats do a very good job of proving the that NetApp system is not at fault. So all the stats show you how things are from the controller down to the disk level. If there is an issue with the connectivity to the host, it probably won't be apparent from the storage. However from the Windows layer, the Application and OS may be configured fine, and all checks will prove that, but they will point to the storage. Innevitably it's actually something on the network config. Tricky to troubleshoot sometimes if it's not obvious!
If the filer is reporting low latency all round and doing very little in general, then I'd look to investigate the host side of things and the connectivity between the host and the storage. On the switches check the VLAN config, check for any packet errors, check if there's any multi-pathing or link aggregation conflicts. Also make sure you have the NetApp host utils installed and you are using all supported versions of hardware / software pieces.
To follow up a bit on Chris's post (althoug not much - he covered a lot), to see overall disk utilization run "sysstat -x -s 1" from the filer console (you'll need to make the console window wider to handle the extra columns without nasty line-wrap).
For deeper info than that (down to the aggregate, volume, lun or disk level), you can use Performance Advisor (part of Ops Mgr if you have it).
I thought the guidelines were, in general, 2-3 per spindle. If you had 28 spindles, then you wouldn't want to see the OS disk queue go above ~70.
If I have that right, then seeing spikes of 5 or 10 is nothing to worry about.
You can check the SQL 2005 DMVs for Wait statistics; that'll tell you if SQL Server thinks it's IO-bound or not.
Although general guidelines for disk queue length would indicate a problem, I would focus on other stats as well - latency. You could check the latency from the windows or even SQL side of things as well, some high disk queue spike in and of itself is nothing to worry about. Especially since you probably have a greater number of disks in the aggregate than you would have with a regular RAID set.
Another thing could be the queue depth, SQL likes it to be somewhat higher (depending on the SQL instance) that regular apps. Also MPIO and iSCSI is a nice subject as well, can you confirm from the Windows side of things you are actually using more than 1 interface simultaneously for the SQL data? Keywords to look for when investigating is MCS or multiple connections per session.
But first thing is to establish whether you really have a problem in the first place.
On a SQL Server implementation, what latency number would be good vs. bad? From Windows, it's generally < 25 ms, but we're occasionally seeing 40ms or greater, with spikes even higher.
Great question Keith. In Windows make sure you use "LogicalDisk" counters. This should be fairly accurate when reporting queue length to the array. You can throw the "PhysicalDisk" counters out the window. Additionally in SAN environments the biggest indicator of performance IMHO is "Avg. Disk sec/Transfer" This will measure response time to your storage. I would say you want these times to be under 20ms which would equate to .020 on the counter display. Queue length will spike from time to time, but you want to watch out for sustained queued I/O. Spikes are normal, and expected.
From the NetApp check your LUN stats.
lun stats -o -i 1 /vol/volumename/lunname
This will show latency and queue length to the LUN. Additionally you might want to run a reallocate measure on the LUN. Let us know how it goes!
Regards,
Erick
Hi Erick,
Your suggestions has been most helpful to me. Much appreciated!
But could you give some guidelines/hints on which values are OK and when my system should be considered overloaded when I run the 'lun stats -o -i 1' command?
Read      Write      Other      QFull      Read       Write           Average       Queue          Partner  Lun
Ops        Ops        Ops       kB          kB           Latency       Length         Ops             kB
Thanks!
/ Stefan
Well in terms of performance you will want to look at latency and queue length. For queue length anything sustained over 2-3 would be an area of concern. Remember that with a sample rate of 1 second you will see times where the queue length is over that number, but the very next second will be under. That is why you must measure for a long period to determine your average queue length. Latency is a better number for performance IMHO. This number you will want under 20 across the board, but again you will see spikes. For the most part we average about 5-7ms latency for all of our LUN's.
Hi,
Thanks again for your great answers. I have now run the lun stats with a 15 second sample for about 30 minutes. Then I inserted all the values into Excel and had Excel to do an average on all values.
This is what I got, what do you think?
| Read | Write | Other | QFull | Read | Write | Average | Queue | Partner | Lun | 
| Ops | Ops | Ops | kB | kB | Latency | Length | Ops | kB | |
| 0,00 | 24,87 | 0,04 | 0,00 | 0,00 | 1006,79 | 2,44 | 0,45 | 0,00 | 0,00 | 
Rgs
/ Stefan
I don't think your chart formated properly. You are missing the Write kB column and some of those numbers don't look right. I can assume your Average Latency should be the 2.44 number, and if that is the case everything is golden.
Yeap, your right the headers where wrong.... Here is the correct table
| Read | Write | Other | QFull | Read | Write | Average | Queue | Partner | Lun | 
| Ops | Ops | Ops | kB | kB | Latency | Length | Ops | kB | |
| 0,00 | 24,87 | 0,04 | 0,00 | 0,00 | 1006,79 | 2,44 | 0,45 | 0,00 | 0,00 | 
So a second opinion whould be highly appreciated
RGS
/ Stefan
Those numbers look great. Queue length under 1 means you don't have I/O waiting for access, and 2.44ms access time to the LUN is fantastic. If you start having performance issues with a LUN the first thing you will see is the latency increase. You don't have to worry about that until you start approaching >20ms latency.
Thanks for your quick relies!
We are having performance issues, that's why I started posting here.... But the sample above was not taken at a time when we had problems.
Today I have done some more monitoring and this is a 5sec intervall collected during aprox 5 minutes. (Averaged in Excel)
| Read | Write | Other | QFull | Read | Write | Average | Queue | Partner | Lun | 
| Ops | Ops | Ops | kB | kB | Latency | Length | Ops | kB | |
| 33,74 | 20,65 | 0,00 | 0,00 | 8635,17 | 116,67 | 43,34 | 3,32 | 0,00 | 0,00 | 
As you can see, neither Latency nor Queue length looks good any more...
What would be your recommendation? Add more disks or what?
RGS
/ Stefan
Well it certainly looks like you are having some performance issues now. There are so many factors that come into play now that I wuold just be guessing. How many disk are in the aggregate this LUN resides in? How many other volumes are a part of this aggregate? What does their I/O look like? I would open up a support call with NetApp and they will be able to tell you exactlly where the bottleneck is occuring. They will probably have you run a statit to gather information. It certainly seems you are having disk contention issues.
Fascinating discussion. Last year we purchased our own Netapp as the disk backend for our production sql server database. I've been using tools from Quest Software ( Spotlight) to monitor performance but after reading this discussion I see that average disk queue length is not the best measurement to use. This discussions led me to trying to create custom counters that would access the WMI classes, but the "formatted" query below brings back values so low as to be unusable ( shows as zero in Spotlight) and the Raw returns values around 3 billion.
In general, direct access to our Netapp is restricted to our Systems team, but I was wondering if there are tools that would give a read-only view of it, particularly throughput from sql to Netapp?
select AvgDiskSecPerWrite from Win32_PerfFormattedData_PerfDisk_LogicalDisk where Name ="L:"
select AvgDiskSecPerWrite  from Win32_PerfRawData_PerfDisk_LogicalDisk where Name  ="L:"
