Solved: NetApp Harvest : data-list update failed

ana_sgr_1985 · ‎2017-08-22

Greetings,

I am using netapp harvest with influxdb and facing the issue where for few objects data-list update gets failed.

Error Logs:

[2017-08-23 06:10:00] [WARNING] [workload_volume] update of data cache failed with reason: Aggregated instances requested for the workload_volume object exceeds the data capacity of the performance subsystem, because it includes 55392 constituent instances. With the current counter set, use the -node, -vserver, or -filter flags to include at most 40329 constituent instances in order to stay within the data capacity. Alternatively, requesting fewer counters will also reduce the required data and may allow more instances to be requested.
[2017-08-23 06:10:00] [WARNING] [workload_volume] data-list update failed.

At first look this looks like a error where harvest is not able to scale to all volume objects in the cluster. Has anyone seen this error before?

Any help is appreciated.

Thanks

Anand

madden · ‎2017-08-24

hi @ana_sgr_1985

We caught up on slack yesterday but I will share here too with the community. The root cause is hitting a max memory limit for an aggregation performed by the counter manager subsystem in ONTAP. Some counter objects are aggregated counters, meaning they are summarized from more detailed counters. For example with workload_volume the instances are one for each volume on the cluster. But the cluster is actually keeping track of the metrics per volume per ONTAP node (in the workload_volume:constituent object) because volume access might be include work done by multiple ONTAP nodes. So if you have 500 vols and a 4 node cluster you'd have 2000 instances in workload_volume:constituent, and then ONTAP would aggregate them to 500 in the workload_volume object. To avoid consuming too much memory during aggregation ONTAP will reject especially large requests and is what is happening here.

There are a few options, one is to not request counters in aggregated objects and instead do the aggregation in client code. Another might be to reduce the list of counters to collect for each instance, which decreases memory needed during the aggregation. A third option might be to use the counter manager archiver feature like OATS does. I'll have a look at adapting Harvest to do the aggregation natively for workload_volume but no guarantees it will make it to the top of the priority list :-0

Hope this helps.

Cheers,

Chris Madden

Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!

View solution in original post

madden · ‎2017-08-24

hi @ana_sgr_1985

We caught up on slack yesterday but I will share here too with the community. The root cause is hitting a max memory limit for an aggregation performed by the counter manager subsystem in ONTAP. Some counter objects are aggregated counters, meaning they are summarized from more detailed counters. For example with workload_volume the instances are one for each volume on the cluster. But the cluster is actually keeping track of the metrics per volume per ONTAP node (in the workload_volume:constituent object) because volume access might be include work done by multiple ONTAP nodes. So if you have 500 vols and a 4 node cluster you'd have 2000 instances in workload_volume:constituent, and then ONTAP would aggregate them to 500 in the workload_volume object. To avoid consuming too much memory during aggregation ONTAP will reject especially large requests and is what is happening here.

There are a few options, one is to not request counters in aggregated objects and instead do the aggregation in client code. Another might be to reduce the list of counters to collect for each instance, which decreases memory needed during the aggregation. A third option might be to use the counter manager archiver feature like OATS does. I'll have a look at adapting Harvest to do the aggregation natively for workload_volume but no guarantees it will make it to the top of the priority list :-0

Hope this helps.

Cheers,

Chris Madden

Solution Architect - 3rd Platform - Systems Engineering NetApp EMEA (and author of Harvest)

Blog: It all begins with data

If this post resolved your issue, please help others by selecting ACCEPT AS SOLUTION or adding a KUDO or both!