Network and Storage Protocols
Network and Storage Protocols
Hello Everyone
I recently started Working in an Environement, where Linux is used as Client. The Homes are mounted via kerberized nfs v4.
Everything was fine, until we upgraded the Netapp V3140 to Ontap 8.1.2
Now the NFS is unstable.
When Users browse trough their Homeshares, suddenly the connection times out and the Client crashes.
I tried to figure out what went wrong, but am a little bit stuck.
perhaps one of you can help out.
Facts:
mountpoint via autofs:
* -fstype=nfs4,soft,timeo=10,sec=krb5,wsize=32768,rsize=32768 boiler:/vol/staff/& boiler:/vol/student/& boiler:/vol/system/& boiler:/vol/guest/&
export:
/vol/staff | -sec=krb5,rw=147.87.0.0/16 |
Volume security= Unix
nfs.acache.persistence.enabled on (value might be overwritten in takeover)
nfs.always.deny.truncate on (value might be overwritten in takeover)
nfs.assist.queue.limit 40 (value might be overwritten in takeover)
nfs.export.allow_provisional_access on (value might be overwritten in takeover)
nfs.export.auto-update off (value might be overwritten in takeover)
nfs.export.exportfs_comment_on_delete on (value might be overwritten in takeover)
nfs.export.harvest.timeout 1800 (value might be overwritten in takeover)
nfs.export.neg.timeout 3600 (value might be overwritten in takeover)
nfs.export.pos.timeout 36000 (value might be overwritten in takeover)
nfs.export.resolve.timeout 6 (value might be overwritten in takeover)
nfs.hide_snapshot off
nfs.ifc.rcv.high 66340
nfs.ifc.rcv.low 33170
nfs.ifc.xmt.high 16
nfs.ifc.xmt.low 8
nfs.ipv6.enable off
nfs.kerberos.enable on
nfs.locking.check_domain on (value might be overwritten in takeover)
nfs.max_num_aux_groups 32
nfs.mount_rootonly on
nfs.mountd.trace off
nfs.netgroup.strict off
nfs.nfs_rootonly off (value might be overwritten in takeover)
nfs.notify.carryover on
nfs.ntacl_display_permissive_perms off (value might be overwritten in takeover)
nfs.per_client_stats.enable on
nfs.require_valid_mapped_uid off
nfs.response.trace off (value might be overwritten in takeover)
nfs.response.trigger 60 (value might be overwritten in takeover)
nfs.rpcsec.ctx.high 0
nfs.rpcsec.ctx.idle 360
nfs.rpcsec.trace off (value might be overwritten in takeover)
nfs.tcp.enable on
nfs.thin_prov.ejuke off (value might be overwritten in takeover)
nfs.udp.enable on
nfs.udp.xfersize 32768 (value might be overwritten in takeover)
nfs.v2.df_2gb_lim off (value might be overwritten in takeover)
nfs.v2.enable on (value might be overwritten in takeover)
nfs.v3.enable on (value might be overwritten in takeover)
nfs.v4.acl.enable off (value might be overwritten in takeover)
nfs.v4.enable on (value might be overwritten in takeover)
nfs.v4.id.allow_numerics off
nfs.v4.id.domain bfh.ch
nfs.v4.read_delegation off (value might be overwritten in takeover)
nfs.v4.write_delegation off (value might be overwritten in takeover)
nfs.vstorage.enable off (value might be overwritten in takeover)
nfs.webnfs.enable off
nfs.webnfs.rootdir XXX
nfs.webnfs.rootdir.set off
Issue:
When the error drops, syslog of ubuntu box shouts:
nfs v4 server returned a bad sequence-id error
wiresharking in attached .csv
Any suggestions in what to debug?
Regards
Phil
Hi! This probably is not helpful for a resolution now, but perhaps the information is good to share.
We also started to see this error when we upgraded FAS6*** series heads to DOT 8.1.1P1 from DOT 8.0.2P5 7-mode (upgraded in Oct. 2012). We were accessing NFS shares using multiple versions of Linux OS, and therefore, it didn't appear to be specific to certain Linux distributions. We have seen errors on OpenSuSE and Cent OS at least.
The error was somewhat random, but we were getting it, and affected our several of NFS4 volumes, especially in a high load with certain software combination like OpenNX. Downgrade to 8.0.2P5 after talking to NetApp support was not reasonable (major downtime a few days+).
We worked with NetApp tech support for quite long time including multiple network traces, and NetApp determined it was a DOT bug. I was initially told that it would be fixed in the upcoming release version 8.1.3.
Based on what you reported, the bug appears to exist on 8.1.2 (or it has not been fixed as of 8.1.2). Our current work around has been to run affected NFS4 volumes on a filer pair running older DOT 8.0.2P5 version which may or may not be reasonable to implement in your situation.
We may need to wait until 8.1.3 is released...
Do you have bug number?
Yes, it is 614395.
I was told by NetApp support the new release with this bug fix would be out in the next 2-4 months.
Hei Masaru
Thank you very much. This helped not in finding the solution, but it's good to hear, that i'm not the only one having this bug.
so let's wait until 8.1.3 is out.
I resniffed the connection and it's definitely the same problem. so for us it's no solution to stop nfsv4. i hope that Netapp is going to release the fix asap.
Regards
Phil
Hello,
Data ONTAP 8.1.2P3 now include the fix for bug 614395 (27-MAR-2013)
https://support.netapp.com/NOW/download/software/ontap/8.1.2P3/
Regards,
Didier