Node's Performance Capacity >.100%, but Aggregate's Performance Capacity is ~30%, latency is low

heightsnj · ‎2020-01-24

Do we have performance concern?

In last 4 days, The node's Performance Capacity is very high constantly reaches >100%, or even >150%, AFF/SDD aggregate's Performance Capacity is only about 30%.

My understanding is Peformance Capacity is just telling you, you cannot add more workloads. Does that necessraily tell you if are having performance issue?

An application is slow. It is based on VM's and NFS datastore . But the Datastore volume seems okay, latency is not too high, only < 5ms. The latency graph of the volume doens't match with the slow response.

So, we are not sure if we have resources contention, and if Node's Perfomance Capacity is indicating an issue?

kahuna · ‎2020-01-25

I guess you are referring to the Graph in OnCommand

100% will just suggest that from this point on, latency will increase exponentially

heightsnj · ‎2020-01-25

As I said, Node's performance capacity reached >150%, and last for many days, but Latency graph doesn't show too high, only about 5 ms/

Any other ideas?

kahuna · ‎2020-01-25

just focus on latency

paul_stejskal · ‎2020-01-27

Please give output of:

::> set d -c off; qos statistics workload resource cpu show -node XXXXXXXXXXXX

::> qos statistics volume latency show -volume XXXXXXXX -vserver XXXXXXXXXXXX

::> qos statistics volume characteristics show -volume XXXXXXXXXXXXX -vserver XXXXXXXXXXXXXX

What application is this? Also I'd recommend upgrading to AIQUM 9.7 as you can add in your VM level. This will help you compare VM and filer latencies and IOPs.

heightsnj · ‎2020-01-28

@paul_stejskal ,

Please find outputs of three commands you recommended,and let me know your thoughts. Thanks!

OUTPUT OF COMMAND1:

Workload ID CPU Wafl_exempt Kahuna Network Raid Exempt Protocol
--------------- ----- ----- ----------- ------ ------- ----- ------ --------
-total- (2000%) - 773% 63% 0% 203% 0% 507% 0%
System-Default 1 599% 0% 0% 165% 0% 434% 0%
_WAFL 7 85% 48% 0% 0% 0% 37% 0%
_WAFL_SCAN 19 44% 13% 0% 0% 0% 31% 0%
User-Default 2 37% 0% 0% 37% 0% 0% 0%
_USERSPACE_APPS 14 1% 0% 0% 0% 0% 1% 0%
_CLOUD 25 1% 0% 0% 0% 0% 1% 0%
-total- (2000%) - 688% 72% 0% 200% 0% 416% 0%
System-Default 1 508% 0% 0% 159% 0% 349% 0%
_WAFL 7 79% 54% 0% 0% 0% 25% 0%
_WAFL_SCAN 19 53% 16% 0% 0% 0% 37% 0%
User-Default 2 41% 0% 0% 41% 0% 0% 0%
_USERSPACE_APPS 14 1% 0% 0% 0% 0% 1% 0%
_CLOUD 25 1% 0% 0% 0% 0% 1% 0%
-total- (2000%) - 670% 67% 0% 180% 0% 421% 0%
System-Default 1 498% 0% 0% 145% 0% 353% 0%
_WAFL 7 80% 51% 0% 0% 0% 28% 0%
_WAFL_SCAN 19 49% 15% 0% 0% 0% 34% 0%
User-Default 2 35% 0% 0% 35% 0% 0% 0%
_USERSPACE_APPS 14 2% 0% 0% 0% 0% 2% 0%
-total- (2000%) - 535% 79% 0% 168% 0% 288% 0%
System-Default 1 371% 0% 0% 132% 0% 239% 0%
_WAFL_SCAN 19 65% 21% 0% 0% 0% 44% 0%
_WAFL 7 59% 57% 0% 0% 0% 2% 0%
User-Default 2 36% 0% 0% 36% 0% 0% 0%
_USERSPACE_APPS 14 1% 0% 0% 0% 0% 1% 0%
Workload ID CPU Wafl_exempt Kahuna Network Raid Exempt Protocol
--------------- ----- ----- ----------- ------ ------- ----- ------ --------
-total- (2000%) - 404% 78% 0% 164% 0% 162% 0%
System-Default 1 233% 0% 0% 122% 0% 111% 0%
_WAFL_SCAN 19 66% 21% 0% 0% 0% 45% 0%
_WAFL 7 60% 56% 0% 0% 0% 4% 0%
User-Default 2 41% 0% 0% 41% 0% 0% 0%
-total- (2000%) - 457% 79% 0% 202% 0% 176% 0%
System-Default 1 285% 0% 0% 157% 0% 128% 0%
_WAFL_SCAN 19 66% 21% 0% 0% 0% 45% 0%
_WAFL 7 56% 56% 0% 0% 0% 0% 0%
User-Default 2 45% 0% 0% 45% 0% 0% 0%
_USERSPACE_APPS 14 1% 0% 0% 0% 0% 1% 0%
-total- (2000%) - 407% 76% 0% 208% 0% 121% 0%
System-Default 1 228% 0% 0% 160% 0% 68% 0%
_WAFL_SCAN 19 64% 20% 0% 0% 0% 44% 0%
_WAFL 7 61% 55% 0% 0% 0% 5% 0%
User-Default 2 47% 0% 0% 47% 0% 0% 0%
_USERSPACE_APPS 14 1% 0% 0% 0% 0% 1% 0%
-total- (2000%) - 455% 75% 0% 249% 1% 130% 0%
System-Default 1 274% 0% 0% 192% 0% 82% 0%
_WAFL_SCAN 19 65% 21% 0% 0% 0% 44% 0%
User-Default 2 57% 0% 0% 57% 0% 0% 0%
_WAFL 7 53% 53% 0% 0% 0% 0% 0%
Workload ID CPU Wafl_exempt Kahuna Network Raid Exempt Protocol
--------------- ----- ----- ----------- ------ ------- ----- ------ --------
-total- (2000%) - 633% 59% 0% 277% 0% 297% 0%
System-Default 1 446% 0% 0% 218% 0% 228% 0%
_WAFL 7 79% 45% 0% 0% 0% 34% 0%
User-Default 2 59% 0% 0% 59% 0% 0% 0%
_WAFL_SCAN 19 44% 13% 0% 0% 0% 31% 0%
_USERSPACE_APPS 14 1% 0% 0% 0% 0% 1% 0%
_CLOUD 25 1% 0% 0% 0% 0% 1% 0%

OUTPUT OF COMMAND2:

Workload ID Latency Network Cluster Data Disk QoS NVRAM Cloud FlexCache SM Sync
--------------- ------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
-total- - 577.00us 129.00us 167.00us 194.00us 66.00us 13.00us 0ms 8.00us 0ms 0ms
XXXXXXX.. 25859 387.00us 70.00us 146.00us 171.00us 0ms 0ms 0ms 0ms 0ms 0ms
-total- - 578.00us 124.00us 177.00us 181.00us 82.00us 0ms 0ms 14.00us 0ms 0ms
XXXXXXX.. 25859 298.00us 59.00us 156.00us 83.00us 0ms 0ms 0ms 0ms 0ms 0ms
-total- - 559.00us 121.00us 169.00us 188.00us 67.00us 0ms 0ms 14.00us 0ms 0ms
XXXXXXXXXX.. 25859 309.00us 68.00us 149.00us 90.00us 2.00us 0ms 0ms 0ms 0ms 0ms
-total- - 891.00us 131.00us 207.00us 358.00us 181.00us 0ms 0ms 14.00us 0ms 0ms
XXXXXXXX.. 25859 372.00us 78.00us 165.00us 129.00us 0ms 0ms 0ms 0ms 0ms 0ms

OUTPUT OF COMMAND3

Workload ID IOPS Throughput Request Size Read Concurrency
--------------- ------ -------- ---------------- ------------ ---- -----------
-total- - 159792 3479.28MB/s 22831B 43% 160
XXXXXXX.. 25859 44 139.13KB/s 3213B 3% 0
-total- - 129506 2289.89MB/s 18540B 33% 113
XXXXXXX.. 25859 40 99.44KB/s 2545B 5% 0
-total- - 146848 3286.96MB/s 23470B 42% 132
XXXXXXXX.. 25859 56 126.82KB/s 2332B 1% 0
-total- - 162914 4078.31MB/s 26249B 37% 194
XXXXXXXX.. 25859 41 92.88KB/s 2301B 0% 0
-total- - 151828 3366.92MB/s 23253B 40% 147
XXXXXXXX.. 25859 31 114.78KB/s 3750B 0% 0
-total- - 139929 3091.28MB/s 23164B 33% 140
XXXXXXXXX.. 25859 32 121.75KB/s 3895B 1% 0

paul_stejskal · ‎2020-01-28

Ok, so you're doing close to 3 GB/s in this cluster it looks like. That's pretty impressive. Most of those are writes. It looks like Exempt and Nwk_Exempt are probably the busiest, so I'm suspecting that's due to write workload. My honest feel from just those commands is, you are just doing quite a bit of work and are at a comfortable limit with what the controller can handle without adding more work.

If you want to open a case you can, or if you have a hostname/serial and perf archive I can take a deeper look at some actual numbers.

heightsnj · ‎2020-01-30

This is the approach I would like to see. I will send you a private message and send information you need, because I don't want to share the organization infor in the public.

Now, why are you saying we are doing close to 3 GB/s in this cluster? Is this because following output:

-total- - 159792 3479.28MB/s 22831B 43% 160

3Gb/s on this volume consider to be good? I checked several other volumes, they are all around 3Gb/s.

paul_stejskal · ‎2020-01-30

Gigabytes, not Gigabits. B vs b. Big difference (factor of 8)!.

That output is the overall throughput across the cluster.

And 3 GB/s honestly isn't bad, but I don't know how many nodes that is nor model #.

Node's Performance Capacity >.100%, but Aggregate's Performance Capacity is ~30%, latency is low

I2A Registration is Open!