Re: FAS3240 Performance - Latency Spike during Oracle DB Backup

ASUNDSTROM · ‎2012-11-30

Our Primary Storage for the datacenter is a FAS3240 HA Pair with 512GB PAM per controller. FAS1 provisions DB and VMware storage provided by our SAS 600GB disks. We have one SAS Aggregate with 69 disks divided into 3 (23 disk) RAID groups. FAS2 utilizes our SATA drives which are used in 4 Aggregates of varied sized RAID groups with a mix of 2TB and 1TB Disks. We know now that we should have probably tried to balance the performance load across the 2 filers by breaking up the SAS Aggregate into smaller Aggrs and balancing them across the Filers based on performance requirements. Multiple Aggrs balanced across the filers would have mitigated possible Disk Util Performance issues into the different Aggr domains.

We are now seeing latency alarms generated by our VMware vCenter Alerts. We have narrowed down the cause to about 5 min worth of Oracle DB dump activity on a particular LUN, which is hosted by the same Aggr that provides the LUNs for our VMware Infrastructure.

I have attached just a 1 minute sample during the time in question. Files were posted with the support discussion at https://forums.netapp.com/message/154420

We started running SysStat's (1sec interval for 1min) and StatIt's(1min intervals) on the Filer and noticed the following things:

SysStat - Disk Util runs from 46-85% during the latency spikes
SysStat - CP Time runs at a consistent 100% during the 5 minutes
SysStat - CP Types = Bx in most returns during the 5 minutes. In some instances this is solid Back to Back CPs, at least its not the little "b"'s
StatIt - WAFL Statistics (per second) - 0.66 back-to-back CP - 0.40 dirty_blk_cnt generated CP

With all the CP's going on, and keep in mind I have seen where the entire 5 minutes was nothing but B2B CP's, what is the bottleneck?

To me, Disk Utilization is down while CP's are at 100% and most are B2B, so I would think that the load is too much for the NVRAM\MainMemory and not due to the disks. I have been trying to read up on the write caching process, but in doing so, am confused as to why there are no counters in SysStat and StatIt for the main memory. I would like more information how the write caching actually works, if someone has more info it would be greatly appreciated.

Optimizing Storage Performance and Cost with Intelligent Caching | WP-7107

"NetApp reduces this penalty by buffering NVRAM-protected writes in memory and then writing full RAID stripes plus parity whenever possible."

NetApp Flash Pool Design and Implementation Guide | TR-4070

"The commitment of the data to NVRAM immediately, before a disk write, allows the NetApp system to maintain low latency in its response to writes. In place of a large battery-backed, RAID-mapped memory for write caches, the NetApp system uses smaller NVRAM sizes. The NVRAM is accessed only in the event of a failure; it is strictly a backup of recent write operations. All of the actual write work occurs in regular system memory."

thomas_glodde · ‎2012-11-30

hi there,

are you reading the data & writing the backup on the same aggregate?

what version of ontap are you using?

ASUNDSTROM · ‎2012-12-04

Yes it is. We are in the process of moving the backup lun to a separate filer, but I was more interested in pinpointing the bottleneck, in this case.

thomas_glodde · ‎2012-12-04

backup is a single thread sequential io operation so it will constantly jam the nvram and lead to back to back cps, depending on the amount of disks you can go up to 400mb/sec and thats it.

You should at least be at ontap 8.1.2 btw, seems most known performance problems have been fixed in that release.

ASUNDSTROM · ‎2012-12-04

Is that 400MB/sec or 400Mb/sec?

thomas_glodde · ‎2012-12-04

400 megabytes per second, a 700mb xvid movie in under 2 seconds 😉

CCOLEMAN_ · ‎2012-11-30

What time does the alert go off? Check if it's running at the same time as dedupe. Also, what's the latency threshold set to on vCenter?

RICHARD_NIMBONA · ‎2013-10-15

Hello,

Am having the same issue FAS 3240 performance issue. No backup is running on storage. we running SQL DB 2012. I really need your help.

>>sysstat -x 1

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s

in out read write read write age hit time ty util in out in out

59% 0 0 0 1593 7 1 348652 11776 0 0 0s 38% 100% :f 100% 0 821 0 23240 143993 0 0

65% 0 0 0 1780 7 1 387860 2768 0 0 0s 39% 100% :f 100% 0 887 0 25447 154070 0 0

66% 0 0 0 1686 15 12 374068 5872 0 0 0s 40% 25% Fn 100% 0 891 0 35942 142496 0 0

69% 0 0 0 1684 7 0 348784 164248 0 0 0s 38% 100% :s 100% 3 843 0 24607 143531 0 0

61% 0 0 0 1474 8 1 306632 65540 0 0 0s 47% 100% :f 100% 0 777 0 18895 127186 0 0

58% 0 0 0 1524 4 0 301580 46352 0 0 0s 41% 100% :f 100% 0 821 0 22283 129481 0 0

66% 0 0 0 1842 7 1 376376 28832 0 0 0s 38% 100% :f 100% 0 866 0 29533 162650 0 0

59% 0 0 0 1498 15 12 348084 3644 0 0 0s 37% 60% : 100% 0 814 0 32070 142793 0 0

59% 0 0 0 1569 6 1 324108 0 0 0 1 45% 0% - 100% 3 803 0 18264 129762 0 0

61% 0 0 0 1965 7 0 331960 24 0 0 0s 40% 0% - 100% 0 970 0 27089 135249 0 0

63% 0 0 0 1513 7 1 351728 14992 0 0 0s 38% 23% Fn 100% 0 640 0 20032 130024 0 0

74% 0 0 0 1772 6 0 332234 260613 0 0 2s 39% 100% :s 100% 0 671 0 24582 123022 0 0

68% 0 0 0 1398 15 12 302443 17401 0 0 2s 51% 100% :f 100% 0 640 0 20967 119129 0 0

62% 0 0 0 1971 8 1 318543 7076 0 0 2s 47% 87% : 99% 3 935 0 36118 128537 0 0

67% 0 0 0 2361 7 0 340559 0 0 0 3s 43% 0% - 100% 0 1270 0 39767 153173 0 0

65% 0 0 0 1974 6 1 375216 32 0 0 3s 40% 0% - 100% 0 966 0 30029 158790 0 0

68% 0 0 0 1654 8 0 374718 110440 0 0 3s 46% 60% Fn 100% 0 850 0 27352 124105 0 0

79% 0 0 0 2307 16 12 372733 164793 0 0 3s 43% 100% :f 100% 0 979 0 38797 154522 0 0

66% 0 0 0 1713 8 0 274684 37560 0 0 3s 53% 100% :f 100% 3 870 0 37554 130175 0 0

60% 0 0 0 1737 9 1 298430 4000 0 0 3s 47% 60% : 99% 0 1067 0 34960 133720 0 0

DAVE_WITHERS · ‎2013-10-17

We battled latency issues on our 3240 filers for a better part of 6 months. In our case we believe we narrowed it down to a network issue. We were experiencing Microbursts on specific ports that would cause the filer to pause briefly, while waiting for data, which would cascade into latency going up across all protocols until data was able to 'catch-up'. We went the route of running 24x7 perfstats, eliminating every single hotspot that was found, balancing data on aggregates, updating ontaps up to 8.1.2 P4 and still experienced the issue. It wasnt until we reworked our SAN/DATA network was our situation mostly resolved. Convincing our network team that we believed the issue was on their end was what took us so long to narrow it down.