ONTAP Discussions

QOS Best Practices

TMADOCTHOMAS
8,901 Views

First some background: we are running OnTAP 9.5P14 on a 4-node AFF8080 cluster. I am moving volumes and shelves over to reduce this down to a 2-node cluster by the end of this month. 

 

We have never used QOS policies because we've never had a need with a 4 node AFF. Now that we're dropping to 2 nodes, I believe we might potentially need to prevent a few test/dev volumes from dominating the CPU. As a result I am researching the QOS feature in detail for the first time. I am reading the basic documentation in the Documentation Center, however I have not been able to find a best practices guide for this feature. Is there one, and/or can anyone point to a good blog post or other article that provides some recommendations? I am not a performance expert and want to be sure I don't break anything. Would appreciate any recommendations! Thank you.

1 ACCEPTED SOLUTION

paul_stejskal
8,728 Views

Hi. This is great feedback. I'm one of the senior perf TSEs here in AMER and also have been working to improve our KB site....

 

I would say talking to the account team is definitely important here too. This is more an architecting question as to how to design/use the storage, and from the Support side we do the problems as we identify them.

 

A QoS policy is literally set it and see. I would say start with QoS and not worry about minimum throughputs or adaptive QoS just yet. Adaptive QoS has some things depending on ONTAP version (changed behavior in 9.7) as well as it will throttle volumes outside of the policy. Here's a KB on it: https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/What_is_Adaptive_QoS_and_how_does_it_work%3F

 

The qos statistics  commands are live, and AIQUM will show the IOPs and throughput over a 5 minute policy. To literally set the QoS policy, it is covered in the "What is QoS" KB, but you just create the policy and apply to the volume. This talks about using AIQUM (you might try 9.7 or 9.8!) to analyze some of this: https://kb.netapp.com/Advice_and_Troubleshooting/Data_Infrastructure_Management/Active_IQ_Unified_Manager/How_to_monitor_volume_latency_from_ActiveIQ_...

 

A lot of customers use a three tier approach, and some of them use QoS on noisy neighbors (bully/shark workloads). You can definitely set it and see. You can fire up a test volume with a synthetic workload to see what it is like. Be careful not to set limits too low (5iops,5MB/s when application wants 40000iops,10000MB/s) otherwise it will overwhelm the network layer. Set it and monitor with qos statistics volume latency/performance show -volume <volume> -vserver <svm name>. Use both commands.

 

Let me know if this helps.

 

View solution in original post

7 REPLIES 7

darb0505
8,820 Views

Hi TMADOCTHOMAS,

 

I was not able to find any best practices for overall QoS feature.  There are some additional KBs that might help you with setting up the QoS in your environment.

 

KBs:

Documentation Center:

 

Let us know if you have any additional questions or concerns about the QoS feature.


Thanks

 

 

Team NetApp

aladd
8,814 Views

Hello @TMADOCTHOMAS 

 

As stated by @darb0505 there are a few resources, but nothing that is hard and fast concerning a best practice as this can be relevant to the workload you are trying to manage.

 

there are a few additional resources that can be helpful in guiding you through setting up a policy that works best for you.

 

FlackBox:

https://www.flackbox.com/netapp-storage-qos-tutorial

 

ONTAP 9 Documentation concerning QoS ceiling and floor:

https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.pow-perf-mon%2FGUID-77DF9BAF-4ED7-43F6-AECE-95DFB0680D2F.html

 

TMADOCTHOMAS
8,735 Views

Thank you @darb0505 and @aladd ! I am checking out the links. I think I will likely create a policy with no settings and then review statistics to determine what the policy should be set to.  One more question regarding this: does the qos statistics performance show command provide cumulative average statistics, or does it only show the current statistic? If the latter, how do you get the details to know which QOS policy to set? Can it be obtained from ActiveIQ UM? (We are on 9.6)?

paul_stejskal
8,729 Views

Hi. This is great feedback. I'm one of the senior perf TSEs here in AMER and also have been working to improve our KB site....

 

I would say talking to the account team is definitely important here too. This is more an architecting question as to how to design/use the storage, and from the Support side we do the problems as we identify them.

 

A QoS policy is literally set it and see. I would say start with QoS and not worry about minimum throughputs or adaptive QoS just yet. Adaptive QoS has some things depending on ONTAP version (changed behavior in 9.7) as well as it will throttle volumes outside of the policy. Here's a KB on it: https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/What_is_Adaptive_QoS_and_how_does_it_work%3F

 

The qos statistics  commands are live, and AIQUM will show the IOPs and throughput over a 5 minute policy. To literally set the QoS policy, it is covered in the "What is QoS" KB, but you just create the policy and apply to the volume. This talks about using AIQUM (you might try 9.7 or 9.8!) to analyze some of this: https://kb.netapp.com/Advice_and_Troubleshooting/Data_Infrastructure_Management/Active_IQ_Unified_Manager/How_to_monitor_volume_latency_from_ActiveIQ_...

 

A lot of customers use a three tier approach, and some of them use QoS on noisy neighbors (bully/shark workloads). You can definitely set it and see. You can fire up a test volume with a synthetic workload to see what it is like. Be careful not to set limits too low (5iops,5MB/s when application wants 40000iops,10000MB/s) otherwise it will overwhelm the network layer. Set it and monitor with qos statistics volume latency/performance show -volume <volume> -vserver <svm name>. Use both commands.

 

Let me know if this helps.

 

TMADOCTHOMAS
8,724 Views

Thank you @paul_stejskal ! My only hesitation on 'set it and see' is the one you raised, and that I've seen in documentation - I don't want to inadvertently set the policy too low and cause a problem for the object in question.  I do notice that AIUM provides specific QOS recommendations for volumes that are triggering their performance thresholds which is helpful. I will likely just make it very conservative to play it safe initially. Thank you again for the helpful links and info!

paul_stejskal
8,720 Views

Ah yes, forgot that's part of AIQUM! Definitely I would go by that. It knows your workloads mathematically so it should have some idea.

 

As far as setting it and testing, it literally takes effect as soon as you apply it, and as soon as you remove it, within a couple seconds the policy is gone.

TMADOCTHOMAS
8,710 Views

Thank you @paul_stejskal , good to know!

Public