In our fpolicy server application written in C code we are calling ONTAP API such as "na_server_open" and "na_server_invoke_elem", we are calling these API repeatedly for getting the status of cluster and its vservers. But these API hang and causes our thread to stuck when all nodes on cluster reboots. These APIs does not come out of hang state even after the cluster becomes up after reboot.
Is this a known behavior? does anybody have the reason for this behavior. I need a solution or workaround to avoid this hang.
I have already tried "na_server_set_timeout" API, this API is also unable to resolve our issue.
As you can see "na_server_set_timeout" requires connection handle (na_server_t* srv) as its first input parameter hence we need to call "na_server_open" before making a call to "na_server_set_timeout".
"na_server_open" itself get hang during cluster reboot, therefore ONTAPI "na_server_set_timeout" does not make any difference.
Please note as per our observation and testing every ONTAPI call will get hang during cluster reboot.
This is strange because "na_server_open()" just allocates memory for na_server_t struct on local system and has nothing to do with the remote system (ONTAP) rebooting or going down.
You mentioned you are repeatedly calling various ONTAP APIs. While doing so, are you calling "na_server_open()" in each iteration? If yes, are you using "na_server_close()" at the end of each iteration to free the memory allocated by na_server context?
Also, can you provide some more details, like -
(1) What version of NMSDK are you using?
(2) What is your host platform? (e.g. RHEL 6.1 64 bit, Windows Server 2008, etc.)?
(3) What is the version of ONTAP?
(4) Can you share a snippet of your code which shows how you invoke the NMSDK core APIs and the ONTAP APIs? Also, how are you iterating/repeating the APIs? And how are you creating the thread?
(5) Did you observe this issue irrespective of the number of threads? I mean, do you hit the issue when the number of threads exceeds a given number?
No, we are not using the thread-safe NMSDK libraries, what is the consequence of not using thread safe libraries.
But as I specified earlier that number of thread does not matters. The issue produces even with single thread. The stack which I have mentioned earlier is for a test program with single main thread in which NMSDK API called repeatedly in a loop.
In the stacktrace that you have provided, I am not seeing the na_server_open() method called, but I can see server_open_conn() which is part of latencytest code (probably it’s your part of code).
server_open_conn() is calling na_server_invoke() which internally calls many methods and it is getting stuck at shttpc_read() of zephyr.
My sense of feeling is that it is not hanging when we call na_server_open() method, but hanging for sure, when we call na_server_invoke() function, which is expected if na_server_set_timeout() is not set or set for very lengthier amount of time.
It would be great if you can provide us the
1)Stack trace where the na_server_open() is hanging, if you have.
2)Share with us, the time out value, that you have set.