Subscribe

SnapDrive process failure in Linux

Hi all,

I'm deploying a new filer and am having some troubles with SnapDrive 4.0 for Linux - specifically CentOS 5.1 x86_64 (fully patched).

snapdrived starts up ok and I can interact with it to the extent of setting the root password for the filer. When I try to perform a filer operation, however, things don't go so well. To start,

[root@db2 log]# snapdrive storage list -all

Status call to SDU daemon failed

[root@db2 log]# ps -ef | grep snapdri
root 7587 1 0 Jul24 ? 00:00:00 snapdrived start
root 11283 7587 0 13:40 ? 00:00:00 [snapdrived] <defunct>

Each re-iteration of a snapdrive storage command will spawn a new defunct process. Commands such as "snapdrive config show" will run fine.

And in sd-trace.log:

13:43:06 07/25/08 [f7f7cb90]?,2,2,Job tag: bEogRP90xw
13:43:06 07/25/08 [f7f7cb90]?,2,2,snapdrive storage list -all
13:43:06 07/25/08 [f7f7cb90]v,2,6,FileSpecOperation::FileSpecOperation: 12
13:43:06 07/25/08 [f7f7cb90]v,2,6,StorageOperation::StorageOperation: 12
13:43:06 07/25/08 [f7f7cb90]i,2,2,Job tag bEogRP90xw
13:43:06 07/25/08 [f7f7cb90]i,2,6,Operation::setUserCred user id from soap context: root
13:43:06 07/25/08 [f7f7cb90]i,2,6,Operation::setUserCred uid:0 gid:0 userName:root
13:43:06 07/25/08 [f7f7cb90]F,0,0,Fatal error: Assertion detected in production code: ../sbl/StorageOperation.cpp:182: Test 'osAssistants.size() == 1' failed

When I strace the snapdrive process I see things conclude with:

connect(3, {sa_family=AF_INET, sin_port=htons(4094), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
send(3, "POST / HTTP/1.1\r\nHost: localhost"..., 1555, 0) = 1555
recv(3, "HTTP/1.1 200 OK\r\nServer: gSOAP/2"..., 65536, 0) = 1722
shutdown(3, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not connected)
close(3) = 0
write(2, "Status call to SDU daemon failed"..., 33) = 33
munmap(0xf7f7d000, 135168) = 0
exit_group(104) = ?

Which follows what I see on the packet capture side of things where the snapdrived port sends RSTs (no doubt after the child process has gone defunct) after a very limited exchange:

POST / HTTP/1.1
Host: localhoHTTP/1.1 200 OK
Server: gSOAP

Any input appreciated.

Thanks in advance.

Re: SnapDrive process failure in Linux

Hello Frans,

I have seen similar problems like this occur in the past, so I'm going to offer up a few suggestions in the hope we can get this taken care of.

1. Check the current length of the lun names - If they're excessively long, this could be part of your problem.

2. Make sure you have no stale snapdrive daemons or that the snapdrive ports are not in use

a.) ps -ae | grep snap

b.) ps -an | grep 4094

3. Attempt enabling low latency to disable delayed ACK from kicking in.

Check with:

sysctl -a| grep net.ipv4.tcp_low_latency

Should report:

net.ipv4.tcp_low_latency = 0

Enable with:

sysctl -w net.ipv4.tcp_low_latency=1

4. And above all, if your troubleshooting steps up until 3 do not return any significant results, contact support - 888-463-8277 (888-4NETAPP)

Let us know if this helps Frans!

Thanks,

Christopher

Re: SnapDrive process failure in Linux

Hi Chris,

I still get defunct processes on SnapDrive after I set the tcp_low_latency on. I'll open a ticket with NetApp.

Thanks,

Frans

Re: SnapDrive process failure in Linux

Thanks for the update Frans,

I look forward to a speedy resolution to your problem!

Christopher

Re: SnapDrive process failure in Linux

Hello Frans!

Did you find a solution for this problem?

I have the same problem on my lab system....

Regards

Helge

Re: SnapDrive process failure in Linux

Hi Helge,

No, unfortunately I have not found a soution. NetApp does not support CentOS but advised I try an older rev of SnapDrive. If, by chance, you are using RHEL

and have support with NetApp, could you open a ticket?

Cheers,

Frans

Re: SnapDrive process failure in Linux

I'm getting the same issue here. Ialso opened a ticket but had no luck getting a better response.

It would seem there is a clear demand from the NetApp community for CentOS support. Frans did some good work digging into this as much as an end user can. What must we do to get NetApp Engineers to look into this? The error message states clearly what code this is puking on.

Help your loyal customers out NetApp, please!

Re: SnapDrive process failure in Linux

To add more information here, most of which was included in my ticket to NetApp support:

OS: CentOS 5.2 (Also tried with Fedora 7 with same results)
Filer: FAS 3070
Connection: iSCSI
SnapDrive Version: 4.0, 3.0 and 2.2.1
sanlun version: 3.2.79.2486

Snapdrive v4.0:
All 'snapdrive config *' commands work, nothing else appears to work. Mainly:
#snapdrive storage list -all
Status call to SDU daemon failed

Snapdrive v.3.0:
Nothing here really appears to work. The common error I get is:
0001-877 Admin error: HBA assistant not found. Commands involving LUNs should fail.

The most success I have had was with Snapdrive v 2.2.1

Snapdrive v 2.2.1:

'snapdrive config' works

I have had success with 'snapdrive snap create -fs [path_to_mounted_LUN]'

Doing a 'snapdrive snap restore' from the snap does NOT work, however I successfully tested making a FlexClone from the Snap and mounting it.

Snapdrive v2.2.1 does NOT work with multipathing, as I found out just tonight which is a requirement for production use, IMHO.

Re: SnapDrive process failure in Linux

Use " snapdrive storage show -all "

Also check /etc/hosts file for host and filer ip/alias

~Nikhil

Re: SnapDrive process failure in Linux

Neither 'snapdrive storage show -all' or 'snapdrive storage list -all' work. They seem to be similar commands anywho.

A host entry exists for the filers and works, otherwise simply getting a login to the filer would fail (you should not be able to 'snapdrive config set [filer] root' without this existing).

The meaning of my addition to this post was to prove there is a need and want for snapdrive to work in CentOS and that others are trying to make it work with very little success.