ONTAP Discussions

SnapDrive process failure in Linux

tribadmin
15,078 Views

Hi all,

I'm deploying a new filer and am having some troubles with SnapDrive 4.0 for Linux - specifically CentOS 5.1 x86_64 (fully patched).

snapdrived starts up ok and I can interact with it to the extent of setting the root password for the filer. When I try to perform a filer operation, however, things don't go so well. To start,

[root@db2 log]# snapdrive storage list -all

Status call to SDU daemon failed

[root@db2 log]# ps -ef | grep snapdri
root 7587 1 0 Jul24 ? 00:00:00 snapdrived start
root 11283 7587 0 13:40 ? 00:00:00 [snapdrived] <defunct>

Each re-iteration of a snapdrive storage command will spawn a new defunct process. Commands such as "snapdrive config show" will run fine.

And in sd-trace.log:

13:43:06 07/25/08 [f7f7cb90]?,2,2,Job tag: bEogRP90xw
13:43:06 07/25/08 [f7f7cb90]?,2,2,snapdrive storage list -all
13:43:06 07/25/08 [f7f7cb90]v,2,6,FileSpecOperation::FileSpecOperation: 12
13:43:06 07/25/08 [f7f7cb90]v,2,6,StorageOperation::StorageOperation: 12
13:43:06 07/25/08 [f7f7cb90]i,2,2,Job tag bEogRP90xw
13:43:06 07/25/08 [f7f7cb90]i,2,6,Operation::setUserCred user id from soap context: root
13:43:06 07/25/08 [f7f7cb90]i,2,6,Operation::setUserCred uid:0 gid:0 userName:root
13:43:06 07/25/08 [f7f7cb90]F,0,0,Fatal error: Assertion detected in production code: ../sbl/StorageOperation.cpp:182: Test 'osAssistants.size() == 1' failed

When I strace the snapdrive process I see things conclude with:

connect(3, {sa_family=AF_INET, sin_port=htons(4094), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
send(3, "POST / HTTP/1.1\r\nHost: localhost"..., 1555, 0) = 1555
recv(3, "HTTP/1.1 200 OK\r\nServer: gSOAP/2"..., 65536, 0) = 1722
shutdown(3, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not connected)
close(3) = 0
write(2, "Status call to SDU daemon failed"..., 33) = 33
munmap(0xf7f7d000, 135168) = 0
exit_group(104) = ?

Which follows what I see on the packet capture side of things where the snapdrived port sends RSTs (no doubt after the child process has gone defunct) after a very limited exchange:

POST / HTTP/1.1
Host: localhoHTTP/1.1 200 OK
Server: gSOAP

Any input appreciated.

Thanks in advance.

44 REPLIES 44

kusek
11,155 Views

Hello Frans,

I have seen similar problems like this occur in the past, so I'm going to offer up a few suggestions in the hope we can get this taken care of.

1. Check the current length of the lun names - If they're excessively long, this could be part of your problem.

2. Make sure you have no stale snapdrive daemons or that the snapdrive ports are not in use

a.) ps -ae | grep snap

b.) ps -an | grep 4094

3. Attempt enabling low latency to disable delayed ACK from kicking in.

Check with:

sysctl -a| grep net.ipv4.tcp_low_latency

Should report:

net.ipv4.tcp_low_latency = 0

Enable with:

sysctl -w net.ipv4.tcp_low_latency=1

4. And above all, if your troubleshooting steps up until 3 do not return any significant results, contact support - 888-463-8277 (888-4NETAPP)

Let us know if this helps Frans!

Thanks,

Christopher

tribadmin
11,155 Views

Hi Chris,

I still get defunct processes on SnapDrive after I set the tcp_low_latency on. I'll open a ticket with NetApp.

Thanks,

Frans

kusek
10,898 Views

Thanks for the update Frans,

I look forward to a speedy resolution to your problem!

Christopher

helgehegg
10,898 Views

Hello Frans!

Did you find a solution for this problem?

I have the same problem on my lab system....

Regards

Helge

tribadmin
10,898 Views

Hi Helge,

No, unfortunately I have not found a soution. NetApp does not support CentOS but advised I try an older rev of SnapDrive. If, by chance, you are using RHEL

and have support with NetApp, could you open a ticket?

Cheers,

Frans

jesseyoung
11,223 Views

I'm getting the same issue here. Ialso opened a ticket but had no luck getting a better response.

It would seem there is a clear demand from the NetApp community for CentOS support. Frans did some good work digging into this as much as an end user can. What must we do to get NetApp Engineers to look into this? The error message states clearly what code this is puking on.

Help your loyal customers out NetApp, please!

jesseyoung
11,224 Views

To add more information here, most of which was included in my ticket to NetApp support:

OS: CentOS 5.2 (Also tried with Fedora 7 with same results)
Filer: FAS 3070
Connection: iSCSI
SnapDrive Version: 4.0, 3.0 and 2.2.1
sanlun version: 3.2.79.2486

Snapdrive v4.0:
All 'snapdrive config *' commands work, nothing else appears to work. Mainly:
#snapdrive storage list -all
Status call to SDU daemon failed

Snapdrive v.3.0:
Nothing here really appears to work. The common error I get is:
0001-877 Admin error: HBA assistant not found. Commands involving LUNs should fail.

The most success I have had was with Snapdrive v 2.2.1

Snapdrive v 2.2.1:

'snapdrive config' works

I have had success with 'snapdrive snap create -fs [path_to_mounted_LUN]'

Doing a 'snapdrive snap restore' from the snap does NOT work, however I successfully tested making a FlexClone from the Snap and mounting it.

Snapdrive v2.2.1 does NOT work with multipathing, as I found out just tonight which is a requirement for production use, IMHO.

nikhilm
11,224 Views

Use " snapdrive storage show -all "

Also check /etc/hosts file for host and filer ip/alias

~Nikhil

jesseyoung
11,224 Views

Neither 'snapdrive storage show -all' or 'snapdrive storage list -all' work. They seem to be similar commands anywho.

A host entry exists for the filers and works, otherwise simply getting a login to the filer would fail (you should not be able to 'snapdrive config set [filer] root' without this existing).

The meaning of my addition to this post was to prove there is a need and want for snapdrive to work in CentOS and that others are trying to make it work with very little success.

nikhilm
8,468 Views

1) " Status call to SDU daemon failed " , got this error when when filer DNS entry was removed.

Looks good in your case.

2) Admin Error: HBA assistant not found.

check (a) sanlun lun show => does it working on your host ? .

(b) sanlun fcp show adapter -v => does it showing HBA information on the host ?

HBA assistant not found means, Snapdrive is not able to recognize the host HBA driver.

jesseyoung
8,396 Views

(a) 'sanlun lun show'

This is working correctly. Results are showing filers, LUNs that are mapped to the igroup the initiator belongs to, lun-pathnames are correct as well as device filenames.

filer: lun-pathname device filename adapter protocol lun size lun state
fas-001: /vol/vol_test4/lun_test4 /dev/sde host1 iSCSI 400.0g (429523992576) GOOD
fas-001: /vol/vol_test4/lun_test4 /dev/sdc host2 iSCSI 400.0g (429523992576) GOOD
fas-001: /vol/vol_test3/lun_test3 /dev/sdd host1 iSCSI 400.0g (429523992576) GOOD
fas-001: /vol/vol_test3/lun_test3 /dev/sdb host2 iSCSI 400.0g (429523992576) GOOD

(b) 'sanlun fcp show adapter -v'

Unable to locate /usr/lib/libHBAAPI.so library
Make sure the package installing the library is installed & loaded

I am using iSCSI and I believe the original poster was as well. Should this still show some sort of result back

I found this KB article: https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb41496

which seems to indicate Linux Host Utils should install that library, but it does not exist on my systems. Is it installed with iSCSI Host Utils also or just FC Host Utils?

nikhilm
8,396 Views

(a) and (b) looks good for iSCSI, since you are using iSCSI HU.

1) what is your below parameters in snapdrive.conf file.

"snapdrive config show"

E.g

default-transport="iscsi" # Transport type to use for storage provisioning, when a decision is needed

multipathing-type="none" # Multipathing software to use when more than one multipathing solution is available

fstype="ext3" # File system to use when more than one file system is available

vmtype="lvm" # Volume manager to use when more than one volume manager is available

use-https-to-filer=off # Communication with filer done via HTTPS instead of HTTP

jesseyoung
8,397 Views

below obtained from active config, using grep to single out lines:

default-transport="iscsi" # Transport type to use for storage provisioning, when a decision is need

multipathing-type="none" # Multipathing software to use when more than one multipathing solution is available

fstype did not exist, I've manually added this in by hand(this did not change anything)

vmtype did not exist also. We're not using LVM so I have not added this to the config

use-https-to-filer=on # Communication with filer done via HTTPS instead of HTTP

I have disabled all HTTP transactions on the filer end so I have this enabled. I can re-enable HTTP mode if you think this will make a difference and re-test.

It should be noted that I am using snapdrive 2.2.1 for these, as that is the version I have had the most sucess with. Thanks,


Jesse

nikhilm
8,397 Views

Also can you please let us know the snapdrive restore command which you are using , is it a live or dead filespec

jesseyoung
8,397 Views

'snapdrive snap restore -fs [local fs path] -snapname [snap_created]'

I've tried this with both the LUN device mounted and unmounted with the same results. In my case, the restore command is less of an issue. I have successfully used a SNAP to create a FlexVol without issues. It's a slight annoying to have to get on the filer to do this, but not a problem really, especially with the frequency a SNAP is/will be restored.

nikhilm
8,226 Views

Since you are not using VM type , does it mean you created a raw lun and make filesystem on it and used snapdrive to create the snapshot.

jesseyoung
8,226 Views

That is right

in a condensed, psuedo command type fashion:

filer> create lun, map to igroup, etc

server> iscsiadm resync

server> fdisk to create partition on device (following KB article http://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb8190 to ensure proper alignment)

server> mke2fs w/ journal

server> mount LUN device partition

server> snapdrive snap creat -fs [LUN device partition] -snapname test

nikhilm
8,226 Views

Check Netapp support matrix for snapdrive with CentOS.

Since this is an Assistant error , it may require snapdrive code fix for CentOs.

jesseyoung
8,226 Views

The support matrix doesn't have CentOS which is why calling up NetApp support doesn't help at all.

CentOS should be a 99.9% match to RHEL so a code fix shouldn't be too hard. Heck, the sd-trace file tells you exactly where it is bombin out. See OP for message from sd-trace.

What would be the correct route to try and get NetApp to have an engineer look at this? It would seem a small fix could make plenty of customer happy.

tribadmin
7,871 Views

It would be nice if Netapp at least reviewed the problem without offering official support but don't hold your breath.

What would be interesting would be to see a strace from a true RHEL system running the same kernel and RPMs. I'm afraid if we're going to solve this one we're going to solve it on our own.

Public