I'm attempting to figure out why only some filers and not others are having periods of no graph statistics within Performance Monitor
I've done some troubleshooting and what I've determined is when a filer's statistics are not being displayed within graphs, the "dfm host diag" command will return this error for the filer:
XML (https port 443) Timeout. Could not read API response.
Whereas working filers (displaying statistics in the graphs) always succeed with:
XML (https port 443) Passed (0 ms)
There are periods of time when trouble filers actually have working graphs too and during these times, the "dfm host diag" command will return Passed (x ms) responses for XML check. So the Timeout error does coincide with when the Performance Monitor graphs aren't working.
During the periods of no statistics for particular filer, the "dfm perf data list" command shows the Newest Record as being quite out of date aswell (So the statistics aren't reaching DFM). I believe (currently) that the issue is some filers are not responding to API calls in a timely manner.
So with that explanation/analysis in mind, I have the following questions:
- Is it possible to determine on the filer directly that it has reached some sort of API limit (I'm not seeing anything in messages).
- Is there anything on the filer that I can tune to allow more API calls or improve performance?
FYI: DFM option "snapDeltaMonitorEnabled" has been disabled (To reduce API activity caused by DFM).
Do you have TLS on (options tls.enable on) and ssl v2 and v3 off (options ssl.v3.enable off, options ssl.v3.enable off)? Might be something to try because the negociation of versions can sometimes fail and v2 and v3 are cracked/unsafe anyway.
Have you verified your network is clean of errors? Do you have any firewalls that could be interupting communications?
We experienced a very high api latency, we tested that with zexplore simple <system-get-info/> and remarked that during some time in the day, the filers was very busy to respond to api request, more than 1 minute.
pktt on frontend mgmt interface was used to determine the source of these api, bugy VASA provider service installed on windows server (used on for vcenter ) was the cause.
Thanks heaps for the responses! These are my comments:
Chris, I've checked those options. It is not setup as you mentioned in your reply, however the settings currently configured don't seem to be causing this issue for other systems.
I don't believe it's due to network errors or firewalls. The periods where the performance metrics are not updating is relatively predictable and we often don't lose statistics for both controllers in a HA pair at the same time (I would generally think if it's network related both controllers in the HA pair would experience similar stat outage periods - their networks are setup identical in terms of VLANs and are connected to the same switches).
GidonMarcus, versions are as follows: - ONTAP: 8.1.4P4 7-Mode - DFM: 5.2.1RC1
François, that's very interesting information! While my Vmware admin is telling me we don't run VASA currently, it does sound like a similar situation (Slow API response!). The controllers seem to have slow API responses around the times the controllers are very busy. Periods of no statistics occur when:
- Server backups are running (In-guest backups). - Snapmirrors are replicating (And sometimes for a period after the replication is completed).
It could be a combination of these system heavy tasks and other management systems overloading the API buffer (I'm not sure if that's a thing, but you know what I mean). Thanks for the idea for zexplore and packet trace of the mgmt interface. I should be able to do some comparisons and see if there is a particular management system causing API storms.
I really don't think it's a DFM issue, as it just looks like the controller itself is not responding to DFM. There could be a work around if I could configure DFM to wait longer for API response, but I haven't found option yet (it may not even exist!).
Interesting info that one node in the HA pair is ok while the other is not, and that it occurs on a predictable schedule which coincides with heavy workload.
In Data ONTAP APIs and CLI are handled by threads that run at a lower priority than most others. As a consequence, if the system is super busy and out of total CPU, those processes might have to queue a long time for CPU (i.e. starved). This design prioritizes frontend client work when resources are scarce. Maybe the system is just so busy that this situation is occurring? You could try running "priv set diag;sysstat -M 1" to see if CPU is scarce in that time window. The most interesting columns are AVG (total CPU use) and Host (which includes API work). If you post a snippet here we can take a look. Or, you might want to open a support case to get more troubleshooting steps to isolate root cause.
Yeah, I'm thinking it's controller busy situation. That in combination with a noisy API workload from other storage management systems. I wanted to verify slow API via a system external to DFM (And I didn't get zexplore working yet), but I've heard about VSC having issues around this sort of time. So I went into VSC and indeed the filers that weren't displaying graphs in DFM and were also timing out to API calls, were also timing out in VSC too!
So I stopped DFM services and instantly VSC started to get quick responses to API calls and displayed filer information successfully. I tested this multiple times and yep, stop DFM and VSC interacts rapidly with the filers again. Funnily enough, once I started the DFM service the final time, even the DFM API calls started responding (and perf graphs displaying correctly in DFM ).
Here is an sysstat -M example during no graphing/api response (You might need to copy into notepad.. sorry about that ):
With that last bit of info from the dfm logs it becomes more clear. I think there is a very good chance you are encountering issues related to memory consumption on the storage controller. In short, the controller limits the amount of memory available to API calls to ensure it doesn't interfere with the stability of the system. If the system is busy (lots of memory in use) then the allocation for some perf APIs (which are big consumers of memory) are the first ones to fail. I think this is exactly what is happening in your situation. Fortunately, there are fixes available for DFM to be use the APIs in a more efficient way and in Data ONTAP to be more efficient when allocating memory to respond to those API calls.