I've got an odd one and I'm hoping some of the SQL gurus can help me.
Windows Server 2008 R2
Microsoft Failover Clustering
Microsoft SQL 2008 R2
FAS 3160 w/ 7.3.6 (igroup ALUA enabled, MS DSM)
32- 600GB SAS disks in 2x 16 disk rg's: 79% utilized/provisioned.
2x Cisco DS-X9148 4GB/s FC blades
HP BL460c G7 w/ FC expansion card and two passthrough switches.
DBA contacted me concerned because she's seeing disk latency >10s (not ms, sec!) in perfmon for the last 3m. I was a little surprised as dfm sends me an e-mail any time there's >50ms of latency for 60s. I started digging into it and found a momentary spike up to 100ms when the performance problems started. Then things settled down: ~15ms. For the next hour perfmon was consistently showing latency >10s but OnTap was showing 15ms or lower (average of 9ms.) This is a huge disconnect.
My first thought was he had an SFP flaking out but there's no sign of that in the HP, Cisco, or NetApp error logs.
I did have a number of "B" (back to back) CP's at the time but no "b" (deferred). NetApp CPU was around 60% utilized (that's average across the four cores, not the level of the highest core that for some reason sysstat still shows by default.)
So, if anyone has any ideas on why MS perfmon shows >10s latency but OnTap is recording <15ms I'd appreciate it.
What Ontap counters are you using to monitor latency, and is it read or write latency that spiked? I am guessing it was write latency?
I dont like seeing either CP types of B or b in sysstat. Out of interest are there other aggregates on this filer or only the 32- 600GB SAS disks? Do you monitor latency from a host perspective on other hosts that are using this filer? Maybe check it out in the ESXi (if you using it) historic performance stats for the time this occurred.
I do have a ticket open with NetApp support, they're digging through some perfstats I ran. I thought I'd ping the community to see if anyone has seen this before.
Some more information, SQL is reporting "SQL Server has encountered X occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [%path%] in database [%name%]. The OS file handle is %hex%. The offset of the latest long I/O is %hex%.
I'm still really weirded out on why perfmon would show >10s of latency while ontap is showing <15ms of latency.