We use collectd with netapp plugin (using netapp-manageability-sdk) to query our filers about performance data. This worked fine until we upgraded to the latest ONTAP version 8.1.3P2 (in 7-Mode). Since the upgrade we get an error. We queried with ZEDI the following request:
<!-- Output of perf-object-get-instances-iter-start [Execution Time: 107 ms] -->
<results errno='13001' reason='Not enough memory to get instances for disk.Use perf-object-get-instances-iter-* calls and/or ask for specific counters to limit the amount of data retrieved' status='failed'/>
Funny thing, we use the -iter-* calls. But, the more curious thing: On one filer its working, on the other one not! Both filers running the same ONTAP version.
I agree, it looks like perf-object-get-instances and perf-object-get-instances-iter-start are not behaving correctly in 8.1.2P3. I searched and there's some ongoing discussion around these APIs in our bug tracker. I'll try to alert them to your findings.
I found a workaround that should get your call working. Set the instances value to all of the active disk instances. You can find them with:
Thanks for your answer. That's interesting, but not practical for us. We monitor different filers, so we can't integrate that into collectd with for all filers, with different disk instances. And the 2nd point, we have approx 100 disk instances in the questioned filer... this is not practical 😕
My question, why is it working on one filer, on the other not? Both are of exact model and same Firmware Version? They SHOULD behave similar!
It would certainly require a programmatic solution to always first query for all perf disk instances and then construct the perf-object-get-instances-iter-start call using those results. I don't know the particulars of working with collectd, so I'm not sure exactly how you'd accomplish that.
I'm guessing the reason it's working on one filer and not the other is that they have a different number of perf disk instances (non-spare disks), with only one having enough disk instances to run afoul of the memory check that's throwing the error telling you to use the iter-* calls. The memory check is probably in the wrong place as it's being triggered even when you're already using iter-* calls.
So it still does look like a problem with the API to me. I'm not aware of any other workarounds besides the one I described earlier. If you need help programmatically tying together the two API calls to first get the list of instances and then construct the iter-start call, I'd be happy to help.
In the meantime, I'll try to get some more eyes on this problem internally.
To follow-up, I did find the BURT which describes this issue, which is #698743, so it is a known issue.
There is another workaround in addition to the one I described initially. You can modify the 7-mode option which controls the memory limit at which the error is thrown. Increasing the limit should allow your calls to succeed, but, according to the BURT, might make you hit the conditions described in BURT #472940. The option is:
The bug has some internal discussion about the exact issue you've run into. And unfortunately, it looks like it's possible for the perf option I mentioned to get reset when AutoSupport runs if it's set to a value higher than 400k, so increasing that limit might not be a good fix if 400k isn't high enough to ensure your calls succeed.
Yes, as of this week, a fix is in the works. I'm not sure what patch/release it will make it into though. BURT #698743 should be updated when a target is identified.
You are right, there are 216 disk instances in the filer, where the call is working. In the other one, there are 270 instances, there it's not working.
The options perf.dblade_zapi.buf_limit Parameter is set to 262144 on both filers. Can we increase that parameter to get the results from the perf calls and don't get other troubles, concerning #472940?
I'm very hesitant to give advice that might impact your filer performance since I am by no means an expert on the conditions described in #472940. You should probably contact Support and get their recommendation and approval.
As far as I can tell, if you use perf-object-get-instances-iter-* to limit the data by setting a maximum, I think you should be able to safely increase the limit if you're not using other data-hungry calls, like disk-list-info as mentioned in the BURT. But again, I'd recommend getting approval before making changes, as there might be further implications that I'm not aware of.