Our Primary Storage for the datacenter is a FAS3240 HA Pair with 512GB PAM per controller. FAS1 provisions DB and VMware storage provided by our SAS 600GB disks. We have one SAS Aggregate with 69 disks divided into 3 (23 disk) RAID groups. FAS2 utilizes our SATA drives which are used in 4 Aggregates of varied sized RAID groups with a mix of 2TB and 1TB Disks. We know now that we should have probably tried to balance the performance load across the 2 filers by breaking up the SAS Aggregate into smaller Aggrs and balancing them across the Filers based on performance requirements. Multiple Aggrs balanced across the filers would have mitigated possible Disk Util Performance issues into the different Aggr domains.
We are now seeing latency alarms generated by our VMware vCenter Alerts. We have narrowed down the cause to about 5 min worth of Oracle DB dump activity on a particular LUN, which is hosted by the same Aggr that provides the LUNs for our VMware Infrastructure.
With all the CP's going on, and keep in mind I have seen where the entire 5 minutes was nothing but B2B CP's, what is the bottleneck?
To me, Disk Utilization is down while CP's are at 100% and most are B2B, so I would think that the load is too much for the NVRAM\MainMemory and not due to the disks. I have been trying to read up on the write caching process, but in doing so, am confused as to why there are no counters in SysStat and StatIt for the main memory. I would like more information how the write caching actually works, if someone has more info it would be greatly appreciated.
Optimizing Storage Performance and Cost with Intelligent Caching | WP-7107
"NetApp reduces this penalty by buffering NVRAM-protected writes in memory and then writing full RAID stripes plus parity whenever possible."
NetApp Flash Pool Design and Implementation Guide | TR-4070
"The commitment of the data to NVRAM immediately, before a disk write, allows the NetApp system to maintain low latency in its response to writes. In place of a large battery-backed, RAID-mapped memory for write caches, the NetApp system uses smaller NVRAM sizes. The NVRAM is accessed only in the event of a failure; it is strictly a backup of recent write operations. All of the actual write work occurs in regular system memory."
We battled latency issues on our 3240 filers for a better part of 6 months. In our case we believe we narrowed it down to a network issue. We were experiencing Microbursts on specific ports that would cause the filer to pause briefly, while waiting for data, which would cascade into latency going up across all protocols until data was able to 'catch-up'. We went the route of running 24x7 perfstats, eliminating every single hotspot that was found, balancing data on aggregates, updating ontaps up to 8.1.2 P4 and still experienced the issue. It wasnt until we reworked our SAN/DATA network was our situation mostly resolved. Convincing our network team that we believed the issue was on their end was what took us so long to narrow it down.