I have couple of NFS v4 shares mounted to CentOS 5.* boxes and from couple of weeks I have the following error:
NFS: v4 server returned a bad sequence id error
I google around and I see some RHEL bug related to this (https://bugzilla.redhat.com/show_bug.cgi?id=628889).I upgrade my kernels and nfs-utils but I still see the error. Problem is that when the error is seen too much my boxes goes dead. Here is some more information about my environment:
NetApp Release 8.1.2 7-Mode: Tue Oct 30 19:56:51 PDT 2012
Any one have an idea how I can resolve this issue?
We have Oracle Enterprise Linux 5.9 with NetApp DataOnTap 8.1.2 7-mode and we experience exactly the same issue. When we see this issue, NFSv4 client & NFSv4 server (NetApp), they get stuck in indefinite loop causing NFSv4 mounts to hang & so as NFSv4 client. We had a case opened with NetApp and also Oracle for this issue. As per NetApp, no issue on NetApp end. As per Oracle, Linux kernel 2.6.18 is pretty old kernel to be used for NFSv4. They say it has lot of back-porting for NFSv4 and it is not recommended to be used for NFSv4. Also, we noticed that this issue we see mainly with Unix user's home directories mounted as NFSv4 exports on NFSv4 clients. We do not see any issues on Solaris 10 using NFSv4 exports from NetApp. Issue seems to be just with OEL Linux NFSv4 clients & NetApp.
When client gets stuck in this state, it shows unwanted thousands of calls for NFSv4 stats. On client, you can see using nfsstat -c -o nfs -4. Renew/Write calls are made in thousands every few seconds. This also increases load on NetApp in terms of Ops/Sec and NW BW usage. We have seen this issue with 2.6.18-322.214.171.124.1.el5 kernel (RHEL 5.9).
So far we could not find any solution so we converted all NFSv4 clients to use NFSv3. We run lot of shell scripts from NFS exports and for some reason using NFSv4 on these clients, they get stuck in some sort of locking issue & never recovers from that point onwards. On Solaris 10, no NFSv4 issues for the same export from NetApp.
The issue occurs when NFSv4 client sends old state-id to NetApp & NetApp responds that it is an invalid state-id than expected. So NFSv4 client not querying NetApp to get the most recent state-id, instead it keeps sending old state-id & NetApp keeps responding that it is old state-id. So they both go in indefinite loop at this point. If you capture tcpdump on NFSv4 client side, you would see this occurring continuously. At this point, NFSv4 client hangs or becomes really really slow. If you are running scripts out of NFSv4 exports, they would never complete & corresponding processes would pile up on client.
Below was NetApp's response :
"I can't pin it down to any specific ONTAP issue. The client sends us a state-id that is for a previous locking state and doesn't ever seem to try and correct this. Without seeing the trigger, it hard to say how this state is caused (we would need a capture as the problem occurs, showing the transitioning state). It's possible that if the issue was caused by an OPEN_DOWNGRADE call, that an upgrade would provide some relief from ONTAP perspective, but the client should be able to recover from this error. So far, this appears to be a client side issue which the client should be able to recover from."
Oracle's response was below :
"At this point, we may not have a solution that will work properly on this very aging kernel (which was created before NFSv4 was even a draft spec; *everything* about its NFSv4 code is backported piecemeal).
As a rule of thumb, we highly recommend that anyone using NFSv4 in production environments make use of the Unbreakable Enterprise Kernel (UEK).
If at all possible, please try UEKr2 on at least one of these boxes and see whether the issue can be reproduced. The current version is 2.6.39-400.109.6.el5uek. The NFSv4 code in that kernel is considerably newer
than in RHCK, with much more active NFSv4 production workloads. This is likely to be the easiest (and most stable) resolution available for this issue.
We will continue to look into this in RHCK, but if a UEK upgrade is possible, that will likely solve this issue as well as many more well-known bugs."