So I have been battleing VMware performance issues with my S500 and I noticed something recently. I have been running "sysstat -x 1" a lot lately and watching the counters. I have noticed that right before my snapshot my "cache age" will be around 15-20 minutes and as soon as the snapshot kicks off the cache age will drop to 0. For the next 30-45 minutes the cache age will stay below 1 minute. The disk activity will be less than 10% before a snapshot and climb to a statained 50-60% percent for the next 30-45 minutes. I know NetApp claims that their snapshots are zero impact but I can't image that high disk utilitzation and young cache help performance. Is this normal?
I am snapshotting my NFS export (1100 GB), my CIFS share (150 GB), and 1 LUN (5GB) every hour.
I wanted to show a screen shot of a the first 24 seconds after 2pm PDT. There should have been 3 snapshots taken at 2pm, NFS, CIFS, 1 LUN, and no replications scheduled. Is it the deletion of the "hourly.23" snapshot that is causing the disk usage 0 cache age? Do the FAS serires suffer from the 0 cahce age problem with snapshots?
I'd love to help you resolve your VMware performance issues you're experiencing.
In order to do that, I think we should try to establish some baselines of what is going on, indications of performance and also set appropriate expectations.
What are some of the symptoms of your performance being impacted within VMware? issues, troubles, metrics and the like?
Also, being that this is an S500, I imagine you have somewhere along the lines of 9-11 disks dedicated to the holding aggregate for the volumes you mentioned (NFS Export, CIFS share, and LUN)
You mention that you take these snapshots every hour, which is perfectly fine, and the snapshot itself is 0 impact, no issue.
Outside of the cache age you're reporting via sysstat -x 1, what impact do your applications or services reflect during the cache age situation?
Also, a question of your snapshot schedule - Do you experience the same level of impact/problems if you have your snapshots taken at different points during the hour? At times, due to resources on my filers (usually during snapmirror replications, bandwidth resources) I would stagger my snapshots to not all kick off at the same moment. Top of the hour, 15 past, half past, quarter to - and the like.
If you're problems are exclusive to VMware performance, and the performance is suffering always, or even just sporadically - I would advise a few things.
It is an excellent resource and can help resolve any ESX side issues, as well as ensuring that it is configured in conjunction with your S500.
Looking at it from another perspective, depending upon the workload your system may be under, you could be looking at a bottleneck at a Network layer (ESX, Switch, Filer), or even a Storage limitation (IOPS).
The best I can offer is based upon the information you've provided - so I'd love to know more details about the types of challenges you're experiencing so we can isolate and address to address the problems you're having.
Thank you for the detailed response. The one question I didn't seem to get answered from your last response is "Is it normal for taking a snapshot to dump your cache age back to zero?" Now let me answer some of the questions you raised. The main problem is that we have a desktop applications that access SQL database VMs and the user experience has became very slow. The SQL servers used to be physical machines and I did a P2V conversion. From reading the PDF that you linked me to I noticed that my partitions are not on the correct offset. Do you think that is really a big deal? Its too bad there is not an easier solution. I also noticed a TCP recieve window setting that I cannot set at the storevault command line, I think it might be a FAS only setting. Also through the storevault manager application I am unable to change the offset of when the snapshots take place. It looks to be locked at the top of every hour.
My S500 has 12 500GB discs in a raid-dp with 1 hot spare. I would say CIFS iops averages 200-300, NFS IOPS averages 200-300, iSCSI is almost 0. Is there a way to capture histories of performance counters and get real averages? I would say total IOPS is usually under 500 and rarely have I every seen it spike over 1000. How many IOPS should the S500 be able to handle? I have both NICs on the S500 teamed into a virtual NIC for 2gb and the switch ports bonded. In I have two ESX host, each ESX host has 4 NICs. 2 for VM traffic and 2 for the VMkernel and console ports. This is all connected through an HP gigabit switch. The switch has QoS so I have lowered the priority of snapmirror traffic and raised the priority of NFS. Is there anything else that wold be of use to you?
Also, What would happen if you put higher density DDR memory in a S500 to give it 2GB or 4GB of cache?
The information you've provided is exactly what we're looking for!
Let me first start with Partition offset - This is quite possibly one of the most important things you can do.
When your partitions are not aligned, you will experience performance problems (So, that appears to be in sync with what we're seeing!)
So, if you're able to apply partition alignment best practices for your Windows VM's, you'll be better off as far as your performance goes.
The inability to set the TCP receive window setting in the Storevault may not break you, if we're not experiencing network bottlenecks (Please advise)
From an IOPS perspective, those type of IOPS are not terrible, and should be within the boundaries that 11 SATA disks - Longterm if you sustain and run on the higher-side there, you can experience some limitations at the spindle level, but for the time being it isn't screaming spindle bound performance issue.
Your network configuration sounds pretty solid, so I won't challenge that as part of your problems!
Increasing the memory from 2GB to 4GB in this particular unit is not supported, so I'd not suggest going that route. Otherwise though, if you are having performance issues with VMware induced by partition boundaries, all it would do is mask the issue. So I would suggest we hold off on unsupported memory upgrades! 🙂
For the time being, moving the timing for the Snapshots we can hold off on, I'll see what methods are possible there, but it not being found in the SVM introduces a challenge which I don't think we should try to pursue.
And last but not least, your first question about Cache Age!
These two articles alone should give you a good idea of to what extent the problem you are experiencing exists.
If it is rather significant, I would suggest correcting the Alignment(s) if possible, and/or engaging NetApp Professional Services Virtualization teams, to help ensure you're meeting the maximum capabilities of your deployment.
Hope this helps get you off the ground and running on your issues Nick,
Thank you for you feedback, I have found it very helpfull. Sadly I cannot access the 3 articles you linked me too. I am getting an unauthorized access messages even though I am logged in. Can you please make them file attachements?