We have seen performance degraded a lot when backups are running in the evening on a SAS aggregate, based on analysis in Unified Manager, the contestion is on the same node and the same aggregate. So, we are planning on moving backups volumes over to SATA aggr.
Shifting data between aggregates/nodes is one of the great value propositions for cDot as it is seemless and essentially transparent to all protocols.
For this particular case, have to look at a couple of things. You said that your analysis (based on Unified Manager) points to congestion on the node and aggregate in question where the backups are going (if I read your original post correctly). Next question is the congestion due to disk response time or to controller load?
The SATA aggregate will likely respond differently than the SAS aggregate. Question again is whether it runs faster or slower. Assume it isn't a controller load issue. Well, it's possible that the SAS aggregate responds so poorly now that moving the backup load to the SATA aggregate actually allows the backup load to perform better because it isn't contending for a resource. That implies that backup data might come in faster. Similarly the load still on the SAS aggregate might try to go faster. Suddenly you might have enough load that the controller now does become an issue, as the controller might not be able to drive both aggregates as fast as the disk now operates. You are adding IOPs capability by splitting the load across aggregates.
Of course that is true only when considering the load on the SAS aggregate today. What about the load already on the SATA aggregate (unless it's brand new). Will that now be impacted with a new chunk of backup load?
Then again, the controller might be the issue today. Granted there is a lot of traffic to the SAS aggregate, but can the aggregate itself keep up but the controller can't feed it fast enough? If a controller based issue, moving data may not change much of anything.
Will the SATA aggregate handle the backup load at the current IOPs rate or will that now slow down and create other issues for the backup traffic or the system as a whole because the backup window grows?
There is no hard and fast answer of course but these are some of the possibilities. Given the very brief description we can only highlight what might happen. As you transfer load between aggregates (if you can shift in in stages rather than all at once) you can watch the performance changes to ensure you are getting results that you want and make adjustments to your plan as needed.
If you are sure your disks in the sas aggregate are the bottleneck, simply start moving some of the volumes away (preferably during low utilization hours. sata should be the preferred place for backups anyway) In case your controller is the bottleneck you could relocate an aggregate from one controller to it's HA partner non-disruptively.
If you are not sure about the bottleneck, please open a support ticket with netapp and let them have a look at it. In case you have any further questions, please don't hesitate to contact me.
P.s. If you feel this answer is helpful please mark it KUDO or correct answer to make it easier for others to find.
P.S. if you feel this post is useful, please KUDO or “accept as a solution" so other people may find it faster.
During the period of slow response, it was so slow when DBA typed a command, it took him very long time to get the result back. I am getting quite a few alerts from OnCommand Performance Manager, similar to the following:
ora_vol1_arch is slow due to ora_vol2_backups causing contention on node2_02_sas_aggr1
ora_vol3 is slow at the data processing node
ora_vol4 is slow due to ora_vol4 causing contention on the data processing node
Message like above repeated mulitple times on different volumes.
So, the contention could be coming from any one or two volumes, or from the node2 (all above mentioned volumes are on the same node2). Everytime when it happened, backpus are running and are always ones to be complained about. So, since the backup volumes should be the first target to considered to be moved to slower SATA aggr anyway, that's why I am thinking no matter which volume really cause the contention, I should move out backup vols as the first step.
The question is, is it really the aggr or controller causing the contention? or both? since the alerts are complaining about the node and the aggr as well. I would say probably the i/o was so busy on aggr, and made the node so busy to handle. Futher, if it was the controller, since i could not do anything about it in short time, the only option that I have for now is to move off volumes.
What are your thoughts?
The other question:
Since only node2 has SATA aggr, how much contention I can reduce on node2 congtroller by moving backup volumems off SAS aggr?
First - one thing to keep in mind is that for "traditional" processing, despite handling everything well during "the day" the biggest overall system stress creater is a backup cycle. Backups can kill perfectly good architectures simply due to everything processing data at once at massive I/O rates compared to regular processing. Very common to push things too far. So the fact that backups are stressing your environment is nothing new. Everybody runs into it eventually.
So back to the issue at hand. Without having the detail performance data that you can get at the moment OCUM/OPM declare an event and working from just the event descriptions, yes, it sounds like you have a node processing capability issue. Thus - on the surface - moving data to a new aggregate on the same node may not make a big difference. The same node is still processing all the data.
OPM is pretty good at detecting when issues are due to aggregates or nodes or network or any of the other processing elements in the entire data flow chain. You'll note that one event called out an aggregate in particular, whereas others called out the node. That *could* mean a couple of things - the aggregate called out might be slowing the node down for other processing and/or the node may simply be unable to keep up despite the heavy load on a single aggregate.
In a traditional "fix-it" mindset - first do you have an aggregate on a partner node to which you could move part of the data in question (even if SATA disk is only on the one node)? As opposed to just moving to a different aggregate on the same node that is. Both options can be tried, but with the limited information available I can't advise whether you will just move the performance issue or actually see improvement. I expect with the information at hand that if, based on total I/O, a performance (SAS) disk aggregate can't handle load and the node is bogged down because of it, moving some of that load to a capacity (SATA) tier on the same node is not likely to make things better and will likely make things worse. Worth a shot perhaps, but not expecting any miracles here.
So - now the non-traditional fixes. The backups appear to be Oracle based backups, based on the node names. I am going to assume that your backups are RMAN based - can you use the RMAN commands in the backup job to limit the bandwidth available to the backup device so as to artifically lower the backup load (the RATE parameter on the device configuration if I remember my Oracle correctly - it's been a while since I worked with Oracle DBs). By default RMAN tries to go as fast as it possibly can - some artificial limits will increase your backup time but may also lower total I/O load to and from your target volumes.
Then again, can you use Snapshots to handle your Oracle backups instead, either through manual means or through SnapManager for Oracle integration into RMAN? Rather than stream traditional backup data to a target volume, Snapshot based backup and Oracle integration products are part of the total value proposition of a NetApp solution.
[ soapbox mode on ]
In my opinion, if you don't leverage them everywhere you can, especially for situations like this, why spend the extra money on NetApp as opposed to just a dumb bunch of disks?
[ soapbox mode off ]
And lastly, assuming the systems are under maintenance, what are your account engineers and NetApp support teams suggesting with regards to this issue?