NFSv4 NFS4ERR_STALE_STATEID

philipp_pluess · ‎2013-02-19

Hello Everyone

I recently started Working in an Environement, where Linux is used as Client. The Homes are mounted via kerberized nfs v4.

Everything was fine, until we upgraded the Netapp V3140 to Ontap 8.1.2

Now the NFS is unstable.

When Users browse trough their Homeshares, suddenly the connection times out and the Client crashes.

I tried to figure out what went wrong, but am a little bit stuck.

perhaps one of you can help out.

Facts:

mountpoint via autofs:

* -fstype=nfs4,soft,timeo=10,sec=krb5,wsize=32768,rsize=32768 boiler:/vol/staff/& boiler:/vol/student/& boiler:/vol/system/& boiler:/vol/guest/&

export:

/vol/staff

-sec=krb5,rw=147.87.0.0/16

Volume security= Unix

nfs.acache.persistence.enabled on (value might be overwritten in takeover)

nfs.always.deny.truncate on (value might be overwritten in takeover)

nfs.assist.queue.limit 40 (value might be overwritten in takeover)

nfs.export.allow_provisional_access on (value might be overwritten in takeover)

nfs.export.auto-update off (value might be overwritten in takeover)

nfs.export.exportfs_comment_on_delete on (value might be overwritten in takeover)

nfs.export.harvest.timeout 1800 (value might be overwritten in takeover)

nfs.export.neg.timeout 3600 (value might be overwritten in takeover)

nfs.export.pos.timeout 36000 (value might be overwritten in takeover)

nfs.export.resolve.timeout 6 (value might be overwritten in takeover)

nfs.hide_snapshot off

nfs.ifc.rcv.high 66340

nfs.ifc.rcv.low 33170

nfs.ifc.xmt.high 16

nfs.ifc.xmt.low 8

nfs.ipv6.enable off

nfs.kerberos.enable on

nfs.locking.check_domain on (value might be overwritten in takeover)

nfs.max_num_aux_groups 32

nfs.mount_rootonly on

nfs.mountd.trace off

nfs.netgroup.strict off

nfs.nfs_rootonly off (value might be overwritten in takeover)

nfs.notify.carryover on

nfs.ntacl_display_permissive_perms off (value might be overwritten in takeover)

nfs.per_client_stats.enable on

nfs.require_valid_mapped_uid off

nfs.response.trace off (value might be overwritten in takeover)

nfs.response.trigger 60 (value might be overwritten in takeover)

nfs.rpcsec.ctx.high 0

nfs.rpcsec.ctx.idle 360

nfs.rpcsec.trace off (value might be overwritten in takeover)

nfs.tcp.enable on

nfs.thin_prov.ejuke off (value might be overwritten in takeover)

nfs.udp.enable on

nfs.udp.xfersize 32768 (value might be overwritten in takeover)

nfs.v2.df_2gb_lim off (value might be overwritten in takeover)

nfs.v2.enable on (value might be overwritten in takeover)

nfs.v3.enable on (value might be overwritten in takeover)

nfs.v4.acl.enable off (value might be overwritten in takeover)

nfs.v4.enable on (value might be overwritten in takeover)

nfs.v4.id.allow_numerics off

nfs.v4.id.domain bfh.ch

nfs.v4.read_delegation off (value might be overwritten in takeover)

nfs.v4.write_delegation off (value might be overwritten in takeover)

nfs.vstorage.enable off (value might be overwritten in takeover)

nfs.webnfs.enable off

nfs.webnfs.rootdir XXX

nfs.webnfs.rootdir.set off

Issue:

When the error drops, syslog of ubuntu box shouts:

nfs v4 server returned a bad sequence-id error

wiresharking in attached .csv

Any suggestions in what to debug?

Regards

Phil

masaru_ryumae · ‎2013-02-20

Hi! This probably is not helpful for a resolution now, but perhaps the information is good to share.

We also started to see this error when we upgraded FAS6*** series heads to DOT 8.1.1P1 from DOT 8.0.2P5 7-mode (upgraded in Oct. 2012). We were accessing NFS shares using multiple versions of Linux OS, and therefore, it didn't appear to be specific to certain Linux distributions. We have seen errors on OpenSuSE and Cent OS at least.

The error was somewhat random, but we were getting it, and affected our several of NFS4 volumes, especially in a high load with certain software combination like OpenNX. Downgrade to 8.0.2P5 after talking to NetApp support was not reasonable (major downtime a few days+).

We worked with NetApp tech support for quite long time including multiple network traces, and NetApp determined it was a DOT bug. I was initially told that it would be fixed in the upcoming release version 8.1.3.

Based on what you reported, the bug appears to exist on 8.1.2 (or it has not been fixed as of 8.1.2). Our current work around has been to run affected NFS4 volumes on a filer pair running older DOT 8.0.2P5 version which may or may not be reasonable to implement in your situation.

We may need to wait until 8.1.3 is released...

aborzenkov · ‎2013-02-20

Do you have bug number?

masaru_ryumae · ‎2013-02-21

Yes, it is 614395.

I was told by NetApp support the new release with this bug fix would be out in the next 2-4 months.

philipp_pluess · ‎2013-02-21

Hei Masaru

Thank you very much. This helped not in finding the solution, but it's good to hear, that i'm not the only one having this bug.

so let's wait until 8.1.3 is out.

I resniffed the connection and it's definitely the same problem. so for us it's no solution to stop nfsv4. i hope that Netapp is going to release the fix asap.

Regards

Phil

dperrinjaquet · ‎2013-04-17

Hello,

Data ONTAP 8.1.2P3 now include the fix for bug 614395 (27-MAR-2013)

https://support.netapp.com/NOW/download/software/ontap/8.1.2P3/

Regards,

Didier