My FAS2650 2-node cluster is running out of gas. I'm exceeding the amount of data that this number of spindles can deliver. I need to micromanage performance to keep us up and running while we wait for additional resources to arrive. The filer is running 9.3P2.
I'm noticing that when I back up volumes with NDMP (using NetVault Backup), the backup traffic can consume a significant portion of the available performance capacity. I'm looking for a way to throttle NDMP so it doesn't impact the overall performance of the filer.
I've looked at storage QoS, but I can't find a way to manage requests by protocol. In some cases, I want a volume to give high priority to NFS requests while giving low priority to NDMP requests to that same volume. I haven't found a way to do that.
Let's try to narrow down the issue. I can imagine when Nodes are running at 100% performance utilization it can be tricky to balance the load, this is where QoS can be very handy, but again QoS is applied at either: 1) IOPS 2) Throughput (Network Bandwidth) MB/sec
My reasoning behind why QoS don't apply to Protocols: I believe NDMP is not a general front-end file-system protocol, it's a Data Management Protocol (specially designed for backups) therefore, if a particular volume is serving CIFS, it will only serve CIFS and same applies to NFS and iSCSI, therefore applying QoS (IOPS/Throughput) wil do the Job, however it will restrict all IOPS irrespective of the Protocol it is serving. Hence, it can be tricky if the same volume is serving NFS IOPS and also being backed-up b'cos both will consume IOPS (NDMP backups: will consume read IOPS).
As you mentioned, you have a 2-Node cluster which is running out of gas, could you give us some idea how the vol/node/LIF setup is? I am asking this question b'cos I want to know:
1) Is NDMP backups running on both the Nodes and what is their schedule ? 2) Is both Nodes Headroom performance is crossing threshold limits ? a) CPU (n-blade) b) Aggregate/Disk (d-blade)
This is one of the reasons, it is recommended to use 'Secondary Mirror/Vault' as a Backup destination for NDMP/General backup purpose, this way Primary Filers are not touched for backup purposes. Does those NFS volumes have replications goings to another cluster ?
You also mentioned, 'the backup traffic can consume a significant portion of the available performance capacity' ? Is the "NDMP backup throughput" causing high network traffic?
Do you have OnCommand Unified Manager/Active IQ Unified Manager ? This tool can be handy in getting some perspective on which volumes (IOPS, MB/sec) and which Nodes are under pressure most, week's or last 72 hrs analysis wrt time frame can be very useful it troubleshooting. Based, on it some decisions can be made such as : can we move the Data LIF that is serving NFS to less pressure HA node (say-Node-B), while they are being backed-up on a Node-A. This is just an idea but something on these lines can be done.