2013-02-21 10:28 AM
We use a single AMS2500 behind two V-series filers, a V3140 and a V3240, for NFS and CIFS file-based data. We've had this setup for a while with mixed results in terms of performance. I'm attempting to dig into the various setting on both the AMS and the V-Series to get an optimal configuration. Most of our data access is fairly low latency, but our VMware over NFS is higher than I would expect. I have read the best-practice guides, but I'm curious how others have set up the following attributes:
AMS Disk layout: RG vs. HDP, RAID Level, LUN size.
AMS LU settings: Multi-stream mode, Prefetch Next, Prefetch Base, Fixed Size, Base Size, Count of Judgement Sequential, Accelerated Wide Striping Mode.
Use of DP Optimization
AMS Controller LUN ownership
NetApp Aggregate RG Size
NetApp options disk.target_port.cmd_queue_depth setting
NetApp Read Reallocation Scan settings, especially when expanding an aggregate
2013-02-21 05:05 PM
Before I can provide any answers, I must ask a lot of questions. Please bear with me.
Can you describe your array? How many disks does it have?
Can you describe how the array LUNs are mapped to the physical disks? How many LUNs per RG? Do any of these LUNs belong to the same aggr?
Are you sharing array target ports between the V-Series?
Have you opened a support case?
You say latencies are higher than you expect. What do you expect? How are you measuring this? Are latencies high from the V-Series to the host, or from the array to the V-Series?
Are you seeing any disk queuing on the AMS?
Are these systems sending AutoSupports? Have you run a perfstat? http://support.netapp.com/NOW/download/tools/perfs
Do you have a NetApp account team? Do you know who your SE is?
I think that's enough questions for now.
- Dan Isaacs
2013-02-22 04:19 AM
The array has both SAS, 300GB and 450GB 15K, and 1TB SATA drives. The 300GB SAS is in 9+1 RAID5 DP Pools, 450GB SAS in 6+1 RAID5 DP Pools, and the 1TB SATA is in 8+2 RAID6 DP Pools. There are 1:1 DP Pools to NetApp aggregates. Some of the LUNs are 300GB and some are 783GB. In most cases there are between 3-5 parity groups per DP Pool. Total disk count is above 400.
So as an example, we have a DP Pool with 4x parity groups of 9+1 300GB drives. In this dp pool we have 32x 300GB LUNs allocated to a V3240 pair, all owned by one filer in the same aggregate with RG size of 8.
Our AMS has 4 front-end ports per controller, 8 total. Each port is only dedicated to one filer, with each filer having a front-end port on each controller and thus two paths to each device.
I have opened cases in the past with both HDS and NetApp. NetApp told me that the latencies were caused by the HDS disk and HDS told me that the latencies I saw for the SAS LUNs at the time were expected. The latencies were between 25-50ms from the AMS on these SAS LUNs.
The higher than expected latencies are primarily on VMware NFS datastores. The majority of our volumes perform between 2-10ms host latency on average. The VMware datastores average ~20ms with peaks going above 100ms and these peaks are fairly common. These are host latencies as measured from both vcenter and NetApp Performance Advisor. We've been round and round with VMware performance, vmdk alignment, network best practices, etc. Things don't seem to change and all the evidence continue to point to the back-end disk.
Am I seeing disk queuing on the AMS? I think the answer would be yes here, but I'm not sure how to quantify this. HDS offers a metric called "tag count" and they show it on a per drive and per LUN basis. The tag count can range from 2-25.
These systems do send autosupports. I have run perfstat in the past and have provided this for NetApp's troubleshooting.
I have a NetApp Account Team and I do know who my SE is.
I don't expect this community to solve my problems, but I am mostly just asking to see what other folks may be doing in a similar setup. Most of the settings I've listed above have very little documentation on their own, let alone on a V-series on AMS.
2013-02-22 07:13 AM
Excellent. You appear to be doing things the right way.
Those latencies are not ideal. 20ms is the typical threshold for acceptable performance. If you have your NetApp case #, can you email that to me? They should still have your perfstat, and there is something I can check in that to see if the back-end is the problem.
2013-02-22 07:19 AM
The case was a while back and our layout has changed a bit since then. I can get you a more recent perfstat if you'd like. What perfstat interval and duration, etc. would you like me to use?