Problem with an API performance call

netapp_salk · ‎2013-11-13

Hi,

We use collectd with netapp plugin (using netapp-manageability-sdk) to query our filers about performance data. This worked fine until we upgraded to the latest ONTAP version 8.1.3P2 (in 7-Mode). Since the upgrade we get an error. We queried with ZEDI the following request:

<?xml version="1.0" encoding="UTF-8"?>

<perf-object-get-instances-iter-start>

<counter>cp_read_blocks</counter>

</counters>

</perf-object-get-instances-iter-start>

</netapp>

And get this error:

<?xml version='1.0' encoding='UTF-8' ?>

</netapp>

Funny thing, we use the -iter-* calls. But, the more curious thing: On one filer its working, on the other one not! Both filers running the same ONTAP version.

The error is like this other post https://communities.netapp.com/thread/32804, which is allready answered and solved.

Anyone encountered these problems? And/or knows a solution how to get the data from the filer? For us it looks like an error in the API? And yes, we have added the parameter

<?xml version='1.0' encoding='UTF-8' ?>
<netapp version='1.1' xmlns='http://www.netapp.com/filer/admin'>

<results errno='13115' reason='can"t get instances of disk' status='failed'/>
</netapp>

Any help would be appreciated.

Best regards,

Christoph

zulanch · ‎2013-11-13

Hello again Christoph,

I agree, it looks like perf-object-get-instances and perf-object-get-instances-iter-start are not behaving correctly in 8.1.2P3. I searched and there's some ongoing discussion around these APIs in our bug tracker. I'll try to alert them to your findings.

I found a workaround that should get your call working. Set the instances value to all of the active disk instances. You can find them with:

<?xml version="1.0" encoding="UTF-8"?>
<netapp xmlns="http://www.netapp.com/filer/admin" version="1.19">
  <perf-object-instance-list-info>
    <objectname>disk</objectname>
  </perf-object-instance-list-info>
</netapp>

That will return your perf disk instances, which will look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<netapp xmlns="http://www.netapp.com/filer/admin" version="1.1">
  <results status="passed">
    <instances>
      <instance-info>
        <name>4E455441:50502020:56442D31:3030304D:422D465A:2D353230:33383238:33323030:00000000:00000000</name>
      </instance-info>
      <instance-info>
        <name>4E455441:50502020:56442D31:3030304D:422D465A:2D353230:33383238:33323031:00000000:00000000</name>
      </instance-info>
      <instance-info>
        <name>4E455441:50502020:56442D31:3030304D:422D465A:2D353230:33383238:33323032:00000000:00000000</name>
      </instance-info>
    </instances>
  </results>
</netapp>

Then, run your perf-object-get-instances-iter-start call, specifying all of the instances:

<?xml version="1.0" encoding="UTF-8"?>
<netapp xmlns="http://www.netapp.com/filer/admin" version="1.19">
  <perf-object-get-instances-iter-start>
    <objectname>disk</objectname>
    <counters>
      <counter>cp_read_blocks</counter>
    </counters>
    <instances>
      <instance>4E455441:50502020:56442D31:3030304D:422D465A:2D353230:33383238:33323030:00000000:00000000</instance>
      <instance>4E455441:50502020:56442D31:3030304D:422D465A:2D353230:33383238:33323031:00000000:00000000</instance>
      <instance>4E455441:50502020:56442D31:3030304D:422D465A:2D353230:33383238:33323032:00000000:00000000</instance>
    </instances>
  </perf-object-get-instances-iter-start>
</netapp>

That strategy should also work with the perf-object-get-instances API if you don't need to break up the data with the iter-* calls.

-Ben

netapp_salk · ‎2013-11-14

Hi Ben,

Thanks for your answer. That's interesting, but not practical for us. We monitor different filers, so we can't integrate that into collectd with for all filers, with different disk instances. And the 2nd point, we have approx 100 disk instances in the questioned filer... this is not practical 😕

My question, why is it working on one filer, on the other not? Both are of exact model and same Firmware Version? They SHOULD behave similar!

Best regards,

Chris...

zulanch · ‎2013-11-14

It would certainly require a programmatic solution to always first query for all perf disk instances and then construct the perf-object-get-instances-iter-start call using those results. I don't know the particulars of working with collectd, so I'm not sure exactly how you'd accomplish that.

I'm guessing the reason it's working on one filer and not the other is that they have a different number of perf disk instances (non-spare disks), with only one having enough disk instances to run afoul of the memory check that's throwing the error telling you to use the iter-* calls. The memory check is probably in the wrong place as it's being triggered even when you're already using iter-* calls.

So it still does look like a problem with the API to me. I'm not aware of any other workarounds besides the one I described earlier. If you need help programmatically tying together the two API calls to first get the list of instances and then construct the iter-start call, I'd be happy to help.

In the meantime, I'll try to get some more eyes on this problem internally.

-Ben

zulanch · ‎2013-11-14

To follow-up, I did find the BURT which describes this issue, which is #698743, so it is a known issue.

There is another workaround in addition to the one I described initially. You can modify the 7-mode option which controls the memory limit at which the error is thrown. Increasing the limit should allow your calls to succeed, but, according to the BURT, might make you hit the conditions described in BURT #472940. The option is:

> options perf.dblade_zapi.buf_limit

-Ben

netapp_salk · ‎2013-11-14

Hi Ben,

Thanks for your posts. collectd uses a compiled (c) plugin that interacts with the NetApp SDK and interacts with the NetApp API. So it's not possible to use your solution with that.

I looked up the Bug #698743 but it only says 'Bug information not available'. What is the bug about? And is there a fix in sight?

I'll have a look at the perf parameter.

Best regards,

Chris...

zulanch · ‎2013-11-14

Hi Chris,

The bug has some internal discussion about the exact issue you've run into. And unfortunately, it looks like it's possible for the perf option I mentioned to get reset when AutoSupport runs if it's set to a value higher than 400k, so increasing that limit might not be a good fix if 400k isn't high enough to ensure your calls succeed.

Yes, as of this week, a fix is in the works. I'm not sure what patch/release it will make it into though. BURT #698743 should be updated when a target is identified.

-Ben

netapp_salk · ‎2013-11-15

Hi Ben,

You are right, there are 216 disk instances in the filer, where the call is working. In the other one, there are 270 instances, there it's not working.

The options perf.dblade_zapi.buf_limit Parameter is set to 262144 on both filers. Can we increase that parameter to get the results from the perf calls and don't get other troubles, concerning #472940?

Best regards,

Chris...

zulanch · ‎2013-11-15

Hi Chris,

I'm very hesitant to give advice that might impact your filer performance since I am by no means an expert on the conditions described in #472940. You should probably contact Support and get their recommendation and approval.

As far as I can tell, if you use perf-object-get-instances-iter-* to limit the data by setting a maximum, I think you should be able to safely increase the limit if you're not using other data-hungry calls, like disk-list-info as mentioned in the BURT. But again, I'd recommend getting approval before making changes, as there might be further implications that I'm not aware of.

-Ben