ONTAP API for Cluster Mode NetApp hang when all nodes of cluster reboots.

ANSHUL_JAISWAL · ‎2014-04-06

Hi,

In our fpolicy server application written in C code we are calling ONTAP API such as "na_server_open" and "na_server_invoke_elem", we are calling these API repeatedly for getting the status of cluster and its vservers. But these API hang and causes our thread to stuck when all nodes on cluster reboots. These APIs does not come out of hang state even after the cluster becomes up after reboot.

Is this a known behavior? does anybody have the reason for this behavior. I need a solution or workaround to avoid this hang.

Thanks in anticipation.

sens · ‎2014-04-10

Hi Anshul,

I am guessing that the API invocation is done in a blocking connection (which is the default behaviour).

Can you try the same APIs in a non-blocking connection using setting a non-zero timeout.

Here is the core API definition (from SDK_help.htm) for na_server_set_timeout():

na_server_set_timeout

Prototype

int na_server_set_timeout(na_server_t* srv, int timeout);

Description

Sets a connection timeout for the following actions:

- Establishing a connection with the server.
- Reading and writing data to and from the server socket used for API communication.

Input

Type

Description

srv

na_server_t *

The pointer to the server context

timeout

int

The timeout value

This value is in seconds and should be zero or a positive number.

If it is non-zero, the server uses non-blocking socket connection. If it is zero, the server uses blocking connection, which is the default behavior.

Return value

1 (TRUE) on success
0 (FALSE) on failure

Please do let us know if it solved the problem.

Regards,

Sen.

ANSHUL_JAISWAL · ‎2014-04-10

Hi Sen,

Thank you for reply.

I have already tried "na_server_set_timeout" API, this API is also unable to resolve our issue.

As you can see "na_server_set_timeout" requires connection handle (na_server_t* srv) as its first input parameter hence we need to call "na_server_open" before making a call to "na_server_set_timeout".

"na_server_open" itself get hang during cluster reboot, therefore ONTAPI "na_server_set_timeout" does not make any difference.

Please note as per our observation and testing every ONTAPI call will get hang during cluster reboot.

sens · ‎2014-04-10

Hi Anshul,

This is strange because "na_server_open()" just allocates memory for na_server_t struct on local system and has nothing to do with the remote system (ONTAP) rebooting or going down.

You mentioned you are repeatedly calling various ONTAP APIs. While doing so, are you calling "na_server_open()" in each iteration? If yes, are you using "na_server_close()" at the end of each iteration to free the memory allocated by na_server context?

Also, can you provide some more details, like -

(1) What version of NMSDK are you using?

(2) What is your host platform? (e.g. RHEL 6.1 64 bit, Windows Server 2008, etc.)?

(3) What is the version of ONTAP?

(4) Can you share a snippet of your code which shows how you invoke the NMSDK core APIs and the ONTAP APIs? Also, how are you iterating/repeating the APIs? And how are you creating the thread?

(5) Did you observe this issue irrespective of the number of threads? I mean, do you hit the issue when the number of threads exceeds a given number?

Regards,

Sen.

ANSHUL_JAISWAL · ‎2014-04-11

Hi Sen,

I am able to resolve this issue by applying some workaround but the fact remains that "na_server_open()" stuck if called while cluster is rebooting.

Previously my code logic is :

---------------------------------------------------

while() {

open server connection using "na_server_open()" and get server context.

setting timeout on server connection using "na_server_set_timeout()".

Using server context to read latency, fpolicy status e.t.c.

closing the server context using "na_server_close()".

}

---------------------------------------------------

Workaround in my code logic to resolve the issue :

---------------------------------------------------

while() {

if (global server context is not initialized) {

open server connection using "na_server_open()", get server context, store it is some global space to use it in next iteration.

setting timeout on server connection using "na_server_set_timeout()".

}

Using global server context to read latency, fpolicy status e.t.c.

}

closing the global server context using "na_server_close()" when it is not required or application exiting.

---------------------------------------------------

Using this workaround chances of "na_server_open()" encountered during reboot is very rare, but the fact remains that na_server_open() stucks when called during cluster reboot.

Now answers to your questions are:

Yes, I am calling "na_server_open()" in each iteration.

Yes, I am using "na_server_close()" at the end of each iteration to free the memory allocated by na_server context.

(1) v.5.1 is the version of NMSDK used.

(2) Host Plateform : 64 bit Windows Server 2008, 6 GB RAM, Intel(R) Xenon(R) CPU (2 processors)

(3) v.8.2 is the version of ONTAP.

(4) I have a small VS project, code that will help you produce and understand the issue. But let me know how can I share it with you? I hope above logic will give you some little idea.

(5) Number of threads does not matters. Issue produces even with a single main thread.

WinDbg's thread stack when "na_server_open()" hang, if it helps :

--------------------------------------------

0 Id: 3f34.2250 Suspend: 1 Teb: 000007ff`fffdd000 Unfrozen

# Child-SP RetAddr Call Site

00 00000000`0021dd48 000007fe`fd7d0555 ntdll!NtWaitForSingleObject+0xa

01 00000000`0021dd50 000007fe`fd7d295e mswsock!_GSHandlerCheck_SEH+0x4269

02 00000000`0021ddd0 000007fe`ff532a7c mswsock!_GSHandlerCheck_SEH+0x776a

03 00000000`0021dec0 00000001`80016e4c WS2_32!recv+0x13c

04 00000000`0021df60 00000001`80002a15 zephyr!shttpc_read+0x8c

05 00000000`0021e3f0 00000001`80002ae7 zephyr!http_free_url+0x1c5

06 00000000`0021e430 00000001`80003af4 zephyr!http_strip_headers+0x37

07 00000000`0021ec90 00000001`80003c51 zephyr!http_open_url_socket+0x6d4

08 00000000`0021f5c0 00000001`80003f66 zephyr!http_post_request_ex+0x91

09 00000000`0021f610 00000001`800082eb zephyr!http_post_request+0x26

0a 00000000`0021f660 00000001`800093a3 zephyr!na_child_add_int+0x87b

0b 00000000`0021f780 00000001`80009407 zephyr!na_server_invoke_elem+0xf3

0c 00000000`0021f7b0 00000001`3fde175d zephyr!na_server_invoke+0x27

0d 00000000`0021f7e0 00000001`3fde1936 latencytest!server_open_conn(struct mx_cmd_s * cmd = 0x00000000`008167f0, char ** errstr = 0x00000000`00000279)+0xbd [c:\anshul\workspace\mywork\latencytest\latencytest\ltntap.c @ 52]

0e 00000000`0021f830 00000001`3fde1528 latencytest!ntap_open_vserv_conn(struct mx_cmd_s * cmd = 0x00000000`00000000)+0x56 [c:\anshul\workspace\mywork\latencytest\latencytest\ltntap.c @ 160]

0f 00000000`0021f860 00000001`3fde1fe2 latencytest!main(int argc = 0n0, char ** argv = 0x00000000`00000001)+0xb8 [c:\anshul\workspace\mywork\latencytest\latencytest\latencytest.c @ 260]

10 00000000`0021f8c0 00000000`7755f56d latencytest!__tmainCRTStartup(void)+0x11a [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\crtexe.c @ 582]

11 00000000`0021f8f0 00000000`77b53281 kernel32!BaseThreadInitThunk+0xd

12 00000000`0021f920 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

--------------------------------------------

sens · ‎2014-04-15

Hi Anshul,

Thanks for all the details. We will need to debug the issue.

Are you using the thread-safe libraries of NMSDK, basically the ones with '_md' suffix in the filename, e.g. libadtd_md.lib?

For debug versions of multi-threaded applications, use the ones with '-mdd' suffix in the filename, e.g. libadtd_mdd.lib.

Regards,

Sen.

ANSHUL_JAISWAL · ‎2014-04-21

Hi Sens,

No, we are not using the thread-safe NMSDK libraries, what is the consequence of not using thread safe libraries.

But as I specified earlier that number of thread does not matters. The issue produces even with single thread. The stack which I have mentioned earlier is for a test program with single main thread in which NMSDK API called repeatedly in a loop.

pmbidara · ‎2014-04-24

Hi Anshul,

In the stacktrace that you have provided, I am not seeing the na_server_open() method called, but I can see server_open_conn() which is part of latencytest code (probably it’s your part of code).

server_open_conn() is calling na_server_invoke() which internally calls many methods and it is getting stuck at shttpc_read() of zephyr.

My sense of feeling is that it is not hanging when we call na_server_open() method, but hanging for sure, when we call na_server_invoke() function, which is expected if na_server_set_timeout() is not set or set for very lengthier amount of time.

It would be great if you can provide us the

1)Stack trace where the na_server_open() is hanging, if you have.

2)Share with us, the time out value, that you have set.