We've bought new hardware and are in the middle of migrating to CDOT.
One volume that I've migrated contains NFS homedirs that get automounted from most of our linux hosts.
The day that I transitioned that volume from the 7-mode node and into the new CDOT environment, users started noticing intermittent 60 second delays with the automounting of their homedir.
I've been troubleshooting this for the past few days and here are notes on it.
CDOT version we're running is 8.2P5
clients experiencing the issue are CentOS 6.5 with autofs-5.0.5-88.el6.x86_64
automount options used are -fstype=nfs,rw,nosuid,soft though I've tried various other options with no success
noticed most often when user connects via SSH
I can frequently reproduce the problem (though intermittently) by, as root, unmounting my homedir, waiting a minute or so, then executing 'ls ~username'. strace on this command will always show a 61 second pause.
during this 61-second hang and the prompt, if I open a second SSH session and sudo to root...I'm still unable to access that homedir during this 'hang time', BUT I can manually mount the homedir and successfully get a directory listing while the autofs command is in the delayed state.
I can't reproduce the problem using NFSv3 on the client
I can't reproduce the problem on a CentOS 5.5 client to the same path using NFSv3 or NFSv4
So since we still have our 7-mode system running, I created a setup to test the same with it. I cannot make the problem happen with an automount from the same client to an export on the 7-mode node.
Are there any known issues with autofs from RHEL 6 using an export on CDOT 8.2P5?
Some discovered information to add to this question.
the 60-second delays were a result of the default nfs timeo value of 600 (deciseconds) in the automount file
when the delay occurs, the 'retrans' count given from 'nfsstat -rc' is incremented by 1
in testing 50 iterations of doing sleep 660 - ls homedir, there was a pattern of the mount failing EVERY OTHER attempt consistent throughout.
According to what I'm finding on retransmissions being incremented, it's an indication that the number of available NFS kernel threads on the server is insufficient to handle the requests from the client.
How does this apply to CDOT 8.2 assuming it does actually apply in any way? Any ideas what might be triggering these retransmits?
So again providing more information to my original post in hope that someone finds interest in the issue...
I performed the same test as I noted before with 90 iterations doing 'sleep 60; ls ~homedir' and this time discovered the mount failing roughly every 20 minutes.
Something interesting discovered in this is that in capturing a packet trace, the issue occurs when the RHEL client sends a NFS call to the filer...the filer acknowledges receiving the request, but it never actually replies to the request. The timeout is then triggered and the retry succeeds.
Here are frames of question/concern from the trace. Note how there is no reply to the call in frame 177: