Network and Storage Protocols

NFSv4 NFS4ERR_STALE_STATEID

philipp_pluess
6,436 Views

Hello Everyone

I recently started Working in an Environement, where Linux is used as Client. The Homes are mounted via kerberized nfs v4.

Everything was fine, until we upgraded the Netapp V3140 to Ontap 8.1.2

Now the NFS is unstable.

When Users browse trough their Homeshares, suddenly the connection times out and the Client crashes.

I tried to figure out what went wrong, but am a little bit stuck.

perhaps one of you can help out.

Facts:

mountpoint via autofs:

*       -fstype=nfs4,soft,timeo=10,sec=krb5,wsize=32768,rsize=32768 boiler:/vol/staff/& boiler:/vol/student/& boiler:/vol/system/& boiler:/vol/guest/&

export:

/vol/staff-sec=krb5,rw=147.87.0.0/16

Volume security= Unix

nfs.acache.persistence.enabled on         (value might be overwritten in takeover)

nfs.always.deny.truncate     on         (value might be overwritten in takeover)

nfs.assist.queue.limit       40         (value might be overwritten in takeover)

nfs.export.allow_provisional_access on         (value might be overwritten in takeover)

nfs.export.auto-update       off        (value might be overwritten in takeover)

nfs.export.exportfs_comment_on_delete on         (value might be overwritten in takeover)

nfs.export.harvest.timeout   1800       (value might be overwritten in takeover)

nfs.export.neg.timeout       3600       (value might be overwritten in takeover)

nfs.export.pos.timeout       36000      (value might be overwritten in takeover)

nfs.export.resolve.timeout   6          (value might be overwritten in takeover)

nfs.hide_snapshot            off       

nfs.ifc.rcv.high             66340     

nfs.ifc.rcv.low              33170     

nfs.ifc.xmt.high             16        

nfs.ifc.xmt.low              8         

nfs.ipv6.enable              off       

nfs.kerberos.enable          on        

nfs.locking.check_domain     on         (value might be overwritten in takeover)

nfs.max_num_aux_groups       32        

nfs.mount_rootonly           on        

nfs.mountd.trace             off       

nfs.netgroup.strict          off       

nfs.nfs_rootonly             off        (value might be overwritten in takeover)

nfs.notify.carryover         on        

nfs.ntacl_display_permissive_perms off        (value might be overwritten in takeover)

nfs.per_client_stats.enable  on        

nfs.require_valid_mapped_uid off       

nfs.response.trace           off        (value might be overwritten in takeover)

nfs.response.trigger         60         (value might be overwritten in takeover)

nfs.rpcsec.ctx.high          0         

nfs.rpcsec.ctx.idle          360       

nfs.rpcsec.trace             off        (value might be overwritten in takeover)

nfs.tcp.enable               on        

nfs.thin_prov.ejuke          off        (value might be overwritten in takeover)

nfs.udp.enable               on        

nfs.udp.xfersize             32768      (value might be overwritten in takeover)

nfs.v2.df_2gb_lim            off        (value might be overwritten in takeover)

nfs.v2.enable                on         (value might be overwritten in takeover)

nfs.v3.enable                on         (value might be overwritten in takeover)

nfs.v4.acl.enable            off        (value might be overwritten in takeover)

nfs.v4.enable                on         (value might be overwritten in takeover)

nfs.v4.id.allow_numerics     off       

nfs.v4.id.domain             bfh.ch    

nfs.v4.read_delegation       off        (value might be overwritten in takeover)

nfs.v4.write_delegation      off        (value might be overwritten in takeover)

nfs.vstorage.enable          off        (value might be overwritten in takeover)

nfs.webnfs.enable            off       

nfs.webnfs.rootdir           XXX       

nfs.webnfs.rootdir.set       off 

Issue:

When the error drops, syslog of ubuntu box shouts:

nfs v4 server returned a bad sequence-id error

wiresharking in attached .csv

Any suggestions in what to debug?

Regards

Phil

5 REPLIES 5

masaru_ryumae
6,436 Views

Hi! This probably is not helpful for a resolution now, but perhaps the information is good to share.

We also started to see this error when we upgraded FAS6*** series heads to DOT 8.1.1P1 from DOT 8.0.2P5 7-mode (upgraded in Oct. 2012). We were accessing NFS shares using multiple versions of Linux OS, and therefore, it didn't appear to be specific to certain Linux distributions. We have seen errors on OpenSuSE and Cent OS at least.

The error was somewhat random, but we were getting it, and affected our several of NFS4 volumes, especially in a high load with certain software combination like OpenNX. Downgrade to 8.0.2P5 after talking to NetApp support was not reasonable (major downtime a few days+).

We worked with NetApp tech support for quite long time including multiple network traces, and NetApp determined it was a DOT bug. I was initially told that it would be fixed in the upcoming release version 8.1.3.


Based on what you reported, the bug appears to exist on 8.1.2 (or it has not been fixed as of 8.1.2). Our current work around has been to run affected NFS4 volumes on a filer pair running older DOT 8.0.2P5 version which may or may not be reasonable to implement in your situation.

We may need to wait until 8.1.3 is released...

aborzenkov
6,436 Views

Do you have bug number?

masaru_ryumae
6,436 Views

Yes, it is 614395.

I was told by NetApp support the new release with this bug fix would be out in the next 2-4 months.

philipp_pluess
6,436 Views

Hei Masaru

Thank you very much. This helped not in finding the solution, but it's good to hear, that i'm not the only one having this bug.

so let's wait until 8.1.3 is out.

I resniffed the connection and it's definitely the same problem. so for us it's no solution to stop nfsv4. i hope that Netapp is going to release the fix asap.

Regards

Phil

dperrinjaquet
6,436 Views

Hello,

Data ONTAP 8.1.2P3 now include the fix for bug 614395 (27-MAR-2013)

https://support.netapp.com/NOW/download/software/ontap/8.1.2P3/

Regards,

Didier

Public