Solved: Aggregate disks over-utilized - LEGIT?

bsnyder27 · ‎2017-03-24

OnCommand Performance Manager

Policy Name: Aggregate disks over-utilized

Max Data Disk Utilization value of 99% on aggr2_6Gb has triggered a WARNING event based on threshold setting of 95%.

------------------------------------

We moved 5 DS2246 shelves populated with 900GB SAS drives from a retired FAS to our current FAS8200. Ever since, we've been receiving these alerts during early morning disk scrubs.

Thing is, we haven't even created any volumes on either aggregate residing on these shelves. This is our #1 built-in alert that we get from performance manager and I've come to believe that they are not terribly noteworthy.

Both aggrs are the same and constructed as follows:

58 disks, raid-dp, rg size 20

This alert and the 'Node disk fragmentation' canned alert come through a lot and given they're built-in, cannot be adjusted. This tool has really been kinda just meh for us as the lack of granularity in alerting rules just isn't near what used to be with DFM.

colsen · ‎2017-03-27

Hello,

We've run into both issues with new ONTAP9 systems. The system we had reporting over-utilization was a FAS8060 with a FlashPool enabled RAID-TEC SATA aggregate. Anyway, we stood the system up and from ~1AM to ~5AM the aggregate would report over-utilized during raid-scrub. Once we started putting "real" workloads on the system, the opportunistic raid-scrubs died down to background noise and now if we see an over-utilization it tends to be legit (i.e. usually our snapmirror jobs).

The disk-fragmentation thing appears to be a BURT:

https://kb.netapp.com/support/s/article/ka11A0000001Z2w/oncommand-performance-manager-alert-node-disk-fragmentation

We've run into this on a couple of aggregates (no rhyme or reason why some and not others are alerting - but that's why it's a BURT I 'spose).

Agreed that OCPM isn't all that DFM was. We've been able to sleuth out some ways of getting almost the equivalent information we'd get from DFM but some of it is lacking. That said, we've had better luck tuning the alerts like "vol growth rate abnormal" in OCPM than we ever did in OpsMgr/etc. Maybe I never understood it entirely in the 7mode monitoring, but we were able to tweak the std deviation/etc in OCPM in such a way where abnormal really means abnormal (whereas we just stopped using it with OpsMgr/7mode).

Hope that helps,

Chris

View solution in original post

colsen · ‎2017-03-27

Hello,

We've run into both issues with new ONTAP9 systems. The system we had reporting over-utilization was a FAS8060 with a FlashPool enabled RAID-TEC SATA aggregate. Anyway, we stood the system up and from ~1AM to ~5AM the aggregate would report over-utilized during raid-scrub. Once we started putting "real" workloads on the system, the opportunistic raid-scrubs died down to background noise and now if we see an over-utilization it tends to be legit (i.e. usually our snapmirror jobs).

The disk-fragmentation thing appears to be a BURT:

https://kb.netapp.com/support/s/article/ka11A0000001Z2w/oncommand-performance-manager-alert-node-disk-fragmentation

We've run into this on a couple of aggregates (no rhyme or reason why some and not others are alerting - but that's why it's a BURT I 'spose).

Agreed that OCPM isn't all that DFM was. We've been able to sleuth out some ways of getting almost the equivalent information we'd get from DFM but some of it is lacking. That said, we've had better luck tuning the alerts like "vol growth rate abnormal" in OCPM than we ever did in OpsMgr/etc. Maybe I never understood it entirely in the 7mode monitoring, but we were able to tweak the std deviation/etc in OCPM in such a way where abnormal really means abnormal (whereas we just stopped using it with OpsMgr/7mode).

Hope that helps,

Chris

Aggregate disks over-utilized - LEGIT?

Aggregate Disks Over-utilized - need clarifications

add spare disk to aggregate

Disk Aggregate and Headroom Performance

[FAS8300] Add disk and extend aggregate data

Is mix FSAS and SAS disks in one aggregate possible?