ONTAP Hardware

FAS3240 Performance - Latency Spike during Oracle DB Backup

ASUNDSTROM
7,597 Views

Our Primary Storage for the datacenter is a FAS3240 HA Pair with 512GB PAM per controller.  FAS1 provisions DB and VMware storage provided by our SAS 600GB disks.  We have one SAS Aggregate with 69 disks divided into 3 (23 disk) RAID groups.  FAS2 utilizes our SATA drives which are used in 4 Aggregates of varied sized RAID groups with a mix of 2TB and 1TB Disks.  We know now that we should have probably tried to balance the performance load across the 2 filers by breaking up the SAS Aggregate into smaller Aggrs and balancing them across the Filers based on performance requirements.  Multiple Aggrs balanced across the filers would have mitigated possible Disk Util Performance issues into the different Aggr domains.

We are now seeing latency alarms generated by our VMware vCenter Alerts.  We have narrowed down the cause to about 5 min worth of Oracle DB dump activity on a particular LUN, which is hosted by the same Aggr that provides the LUNs for our VMware Infrastructure.

I have attached just a 1 minute sample during the time in question. Files were posted with the support discussion at https://forums.netapp.com/message/154420

We started running SysStat's (1sec interval for 1min) and StatIt's(1min intervals) on the Filer and noticed the following things:

  1. SysStat - Disk Util runs from 46-85% during the latency spikes
  2. SysStat - CP Time runs at a consistent 100% during the 5 minutes
  3. SysStat - CP Types = Bx in most returns during the 5 minutes.  In some instances this is solid Back to Back CPs, at least its not the little "b"'s
  4. StatIt -  WAFL Statistics (per second) - 0.66 back-to-back CP - 0.40 dirty_blk_cnt generated CP

With all the CP's going on, and keep in mind I have seen where the entire 5 minutes was nothing but B2B CP's, what is the bottleneck?

To me, Disk Utilization is down while CP's are at 100% and most are B2B, so I would think that the load is too much for the NVRAM\MainMemory and not due to the disks.  I have been trying to read up on the write caching process, but in doing so, am confused as to why there are no counters in SysStat and StatIt for the main memory.  I would like more information how the write caching actually works, if someone has more info it would be greatly appreciated.

Optimizing Storage Performance and Cost with Intelligent Caching | WP-7107

"NetApp reduces this penalty by buffering NVRAM-protected writes in memory and then writing full RAID stripes plus parity whenever possible."

NetApp Flash Pool Design and Implementation Guide | TR-4070

"The commitment of the data to NVRAM immediately, before a disk write, allows the NetApp system to maintain low latency in its response to writes. In place of a large battery-backed, RAID-mapped memory for write caches, the NetApp system uses smaller NVRAM sizes. The NVRAM is accessed only in the event of a failure; it is strictly a backup of recent write operations. All of the actual write work occurs in regular system memory."

8 REPLIES 8

thomas_glodde
7,597 Views

hi there,

are you reading the data & writing the backup on the same aggregate?

what version of ontap are you using?

ASUNDSTROM
7,597 Views

Yes it is.  We are in the process of moving the backup lun to a separate filer, but I was more interested in pinpointing the bottleneck, in this case.

thomas_glodde
7,597 Views

backup is a single thread sequential io operation so it will constantly jam the nvram and lead to back to back cps, depending on the amount of disks you can go up to 400mb/sec and thats it.

You should at least be at ontap 8.1.2 btw, seems most known performance problems have been fixed in that release.

ASUNDSTROM
7,597 Views

Is that 400MB/sec or 400Mb/sec?

thomas_glodde
7,597 Views

400 megabytes per second, a 700mb xvid movie in under 2 seconds 😉

CCOLEMAN_
7,597 Views

What time does the alert go off? Check if it's running at the same time as dedupe. Also, what's the latency threshold set to on vCenter?

RICHARD_NIMBONA
7,597 Views

Hello,

Am having the same issue FAS 3240 performance issue. No backup is running on storage. we running SQL DB 2012. I really need your help.

>>sysstat -x 1

CPU    NFS   CIFS   HTTP   Total     Net   kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP  CP  Disk   OTHER    FCP  iSCSI     FCP   kB/s   iSCSI   kB/s

                                       in    out    read  write    read  write    age    hit  time  ty  util                            in    out      in    out

59%      0      0      0    1593       7      1  348652  11776       0      0     0s    38%  100%  :f  100%       0    821      0   23240 143993       0      0

65%      0      0      0    1780       7      1  387860   2768       0      0     0s    39%  100%  :f  100%       0    887      0   25447 154070       0      0

66%      0      0      0    1686      15     12  374068   5872       0      0     0s    40%   25%  Fn  100%       0    891      0   35942 142496       0      0

69%      0      0      0    1684       7      0  348784 164248       0      0     0s    38%  100%  :s  100%       3    843      0   24607 143531       0      0

61%      0      0      0    1474       8      1  306632  65540       0      0     0s    47%  100%  :f  100%       0    777      0   18895 127186       0      0

58%      0      0      0    1524       4      0  301580  46352       0      0     0s    41%  100%  :f  100%       0    821      0   22283 129481       0      0

66%      0      0      0    1842       7      1  376376  28832       0      0     0s    38%  100%  :f  100%       0    866      0   29533 162650       0      0

59%      0      0      0    1498      15     12  348084   3644       0      0     0s    37%   60%  :   100%       0    814      0   32070 142793       0      0

59%      0      0      0    1569       6      1  324108      0       0      0     1     45%    0%  -   100%       3    803      0   18264 129762       0      0

61%      0      0      0    1965       7      0  331960     24       0      0     0s    40%    0%  -   100%       0    970      0   27089 135249       0      0

63%      0      0      0    1513       7      1  351728  14992       0      0     0s    38%   23%  Fn  100%       0    640      0   20032 130024       0      0

74%      0      0      0    1772       6      0  332234 260613       0      0     2s    39%  100%  :s  100%       0    671      0   24582 123022       0      0

68%      0      0      0    1398      15     12  302443  17401       0      0     2s    51%  100%  :f  100%       0    640      0   20967 119129       0      0

62%      0      0      0    1971       8      1  318543   7076       0      0     2s    47%   87%  :    99%       3    935      0   36118 128537       0      0

67%      0      0      0    2361       7      0  340559      0       0      0     3s    43%    0%  -   100%       0   1270      0   39767 153173       0      0

65%      0      0      0    1974       6      1  375216     32       0      0     3s    40%    0%  -   100%       0    966      0   30029 158790       0      0

68%      0      0      0    1654       8      0  374718 110440       0      0     3s    46%   60%  Fn  100%       0    850      0   27352 124105       0      0

79%      0      0      0    2307      16     12  372733 164793       0      0     3s    43%  100%  :f  100%       0    979      0   38797 154522       0      0

66%      0      0      0    1713       8      0  274684  37560       0      0     3s    53%  100%  :f  100%       3    870      0   37554 130175       0      0

60%      0      0      0    1737       9      1  298430   4000       0      0     3s    47%   60%  :    99%       0   1067      0   34960 133720       0      0

DAVE_WITHERS
7,597 Views

We battled latency issues on our 3240 filers for a better part of 6 months.  In our case we believe we narrowed it down to a network issue.  We were experiencing Microbursts on specific ports that would cause the filer to pause briefly, while waiting for data, which would cascade into latency going up across all protocols until data was able to 'catch-up'.   We went the route of running 24x7 perfstats, eliminating every single hotspot that was found, balancing data on aggregates, updating ontaps up to 8.1.2 P4 and still experienced the issue.  It wasnt until we reworked our SAN/DATA network was our situation mostly resolved.    Convincing our network team that we believed the issue was on their end was what took us so long to narrow it down.

Public