CDOT 8.3.1 hot spindle in node root aggr

pjc0715 · ‎2015-12-08

Is it normal to see disk util at near 100% constantly in the 3 disk node root aggregate on only one node of a two node switchless cluster? It dips down occasionally but it stays near 100% for the most part. I'm seeing this on two different 2 node switchless clusters.

sysstat -x 1
CPU    NFS   CIFS   HTTP   Total     Net   kB/s    Disk   kB/s    Tape   kB/s Cache Cache    CP CP Disk   OTHER    FCP iSCSI     FCP   kB/s   iSCSI   kB/s
                                       in    out    read write    read write    age    hit time ty util                            in    out      in    out
61%      0      0      0      58       2      3 137268 169264       0      0     1s   100% 100% :f 100%      58      0      0       0      0       0      0
61%      0      0      0      44       2      2 140852 125420       0      0     1s   100%   91% Hf 100%      44      0      0       0      0       0      0
62%      0      0      0    2381       6      8 119665 108263       0      0     1s    99%   84% :    71%    2381      0      0       0      0       0      0
62%      0      0      0     205       3      4 146314 139197       0      0     1s   100%   79% Hf 100%     205      0      0       0      0       0      0
63%      0      0      0     526       2      3 146387 102423       0      0     1s   100%   82% Hs   94%     526      0      0       0      0       0      0
61%      0      0      0     429       2      4 132643 173920       0      0     1s   100% 100% :f   92%     429      0      0       0      0       0      0
57%      0      0      0     316       5      5 129926 98813       0      0     1s    99%   75% Hf 100%     316      0      0       0      0       0      0
56%      0      0      0    1122       8     11 108576 176384       0      0     1s    99% 100% :f 100%    1122      0      0       0      0       0      0
43%      0      0      0    1045      37     35   88519   4599       0      0     1s    99% 100% :f 100%    1045      0      0       0      0       0      0
49%      0      0      0    1651       1      1 107293     24       0      0     1s   100% 100% :f 100%    1651      0      0       0      0       0      0
32%      0      0      0     759       1      2   72448      0       0      0     1s   100% 100% #f 100%     759      0      0       0      0       0      0
6%      0      0      0     979       3      3    2324   1288       0      0     1s    96% 100% #f   96%     979      0      0       0      0       0      0
30%      0      0      0     551       5      5   63145 67485       0      0     1s    99% 100% Bf 100%     551      0      0       0      0       0      0
57%      0      0      0     322       2      4 108387 229695       0      0     0s    99% 100% :f 100%     322      0      0       0      0       0      0
55%      0      0      0     199       3      4 105489 164732       0      0     1s    99%   99% Hs 100%     199      0      0       0      0       0      0
56%      0      0      0     241       2      4 107648 212204       0      0     1s    99% 100% :v 100%     241      0      0       0      0       0      0
52%      0      0      0     287       2      2 137472   9045       0      0     1s    99%    9% Hn 100%     287      0      0       0      0       0      0
63%      0      0      0     815       7     11 129112 223251       0      0     2s   100%   97% :    99%     815      0      0       0      0       0      0

statit -e

disk             ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr0_01/plex0/rg0:
5c.87             10   4.42    0.46   3.40 1356   3.83 18.41 1726   0.13 15.02   164   0.00   ....     .   0.00   ....     .
5a.52             12   4.56    0.46   3.42 2770   3.98 17.79 2081   0.12 13.00   240   0.00   ....     .   0.00   ....     .
5c.58             94 100.85   97.28   1.18 60458   2.63 26.37 3354   0.94   1.23 57593   0.00   ....     .   0.00   ....     .

bobshouseofcards · ‎2015-12-08

Short answer - yes, in the right circumstances.

Consider - the root node aggregate is where a whole ton of node base operations will occur, from logging to configuration data handling, etc. The permanent copy of a lot of stuff is on the node root aggregate.

And with a three disk aggregate the system puts all that on 1 data disk (2 parity). If you build a configuration that is big enough, uses the right services setup, etc., you by design put load onto the node root aggregate, which is then slaved to the speed of a single spindle. You don't have disk details listed, but if the disks are SATA class, the load is somewhat magnified in that each access tends to be a bit slower. The fact that it is busy near 100% during a measurement interval or steadily across measurement intervals is not unexpected.

There is a "but" in all of this: it's a problem only if the node is not processing other requests fast enough because it is waiting for the root node aggregate. If total system performance is otherwise normal or within acceptable, then don't worry about it. If system performance isn't good enough, Perfstats and other load counters will reveal if the workloads are waiting on something in the "processing" phase of the request which can then drill down to the node root aggregate if appropriate.

On heavily used systems, I have found a small but measureable difference in performance by increasing the node root aggregates to 5 total disks, giving you three data disks to better respond to needed node root aggregate I/O. Not huge, but given that after I switched to 5 node aggregates many things just "felt" better and performance appear to show a small % difference. At a large scale if your have several hundred or a thousand disks in an HA pair, having 10 for node root aggregates isn't a huge deal. Not quite the same calculus if you have maybe 4 shelves across two nodes of course.

I hope this helps you.

Bob Greenwald

Lead Storage Engineer

Huron Legal | Huron Consulting Group

NCDA, NCIE - SAN Clustered, Data Protection

Kudos and accepted solutions are always appreciated.

paul_stejskal · ‎2017-10-10

This is completely normal. Per https://kb.netapp.com/support/s/article/root-aggregate-intermittently-displays-100-disk-utilization and https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=806862 this won't be a problem. There are a lot of background things which will use the root aggr, but since no user i/o is served off of it, there will be no impact.

If there is an impact, then I'd love to see it because 99% of the time there isn't. Mostly what is done on the background is internal database updates or things like monitoring software.

This is coming from engineering as well as experience in support and in the field.

sgrant · ‎2017-10-11

Hi, just to expand a little on what Bob and Paul are saying in their posts. This is not a problem if it does not affect other services.

However, we are seeing alot of high watermark CPs (type H), meaning that whatever the cluster is doing is maxing out the NVRAM, even to the point we have a Back-to-Back (type B), i.e. the disk just cannot keep up with the writes. The NVRAM serves the whole system and if it fills this affects the entire system performance, not just the root aggregate. Currently this is not an issue since it is not serving any data.

Since there are no data protocol IOPS, I'm assuming you have this in an active/passive configuration. In a failover event would these events back-off sufficiently to serve data without impact, i.e. are they truly background as suggested in the KB article and can be safely ignored?

Has a failover test been performed to measure the impact on service, if any? This will prove if it is an issue or not in your environment.

Thanks,

Grant.

paul_stejskal · ‎2017-10-11

Just to add to that, the above issue is not an issue in ONTAP 9 due to PACP (Per Aggr CPs).