NFSv4 Intermittent Automount Delay

We've bought new hardware and are in the middle of migrating to CDOT.

One volume that I've migrated contains NFS homedirs that get automounted from most of our linux hosts.

The day that I transitioned that volume from the 7-mode node and into the new CDOT environment, users started noticing intermittent 60 second delays with the automounting of their homedir.

I've been troubleshooting this for the past few days and here are notes on it.

  • CDOT version we're running is 8.2P5
  • clients experiencing the issue are CentOS 6.5 with autofs-5.0.5-88.el6.x86_64
  • automount options used are -fstype=nfs,rw,nosuid,soft though I've tried various other options with no success
  • noticed most often when user connects via SSH
  • I can frequently reproduce the problem (though intermittently) by, as root, unmounting my homedir, waiting a minute or so, then executing 'ls ~username'.  strace on this command will always show a 61 second pause.
  • during this 61-second hang and the prompt, if I open a second SSH session and sudo to root...I'm still unable to access that homedir during this 'hang time', BUT I can manually mount the homedir and successfully get a directory listing while the autofs command is in the delayed state.
  • I can't reproduce the problem using NFSv3 on the client
  • I can't reproduce the problem on a CentOS 5.5 client to the same path using NFSv3 or NFSv4

So since we still have our 7-mode system running, I created a setup to test the same with it.  I cannot make the problem happen with an automount from the same client to an export on the 7-mode node.

Are there any known issues with autofs from RHEL 6 using an export on CDOT 8.2P5?

Re: NFSv4 Intermittent Automount Delay

Some discovered information to add to this question.

  • the 60-second delays were a result of the default nfs timeo value of 600 (deciseconds) in the automount file
  • when the delay occurs, the 'retrans' count given from 'nfsstat -rc' is incremented by 1
  • in testing 50 iterations of doing sleep 660 - ls homedir, there was a pattern of the mount failing EVERY OTHER attempt consistent throughout.

According to what I'm finding on retransmissions being incremented, it's an indication that the number of available NFS kernel threads on the server is insufficient to handle the requests from the client.

How does this apply to CDOT 8.2 assuming it does actually apply in any way?  Any ideas what might be triggering these retransmits?

NFSv4 Intermittent Automount Delay

So again providing more information to my original post in hope that someone finds interest in the issue...

I performed the same test as I noted before with 90 iterations doing 'sleep 60; ls ~homedir' and this time discovered the mount failing roughly every 20 minutes.

Something interesting discovered in this is that in capturing a packet trace, the issue occurs when the RHEL client sends a NFS call to the filer...the filer acknowledges receiving the request, but it never actually replies to the request.  The timeout is then triggered and the retry succeeds.

Here are frames of question/concern from the trace.  Note how there is no reply to the call in frame 177:

156 150.026052 -> NFS 218 V4 Call (Reply In 158) PUTROOTFH | GETATTR

157 150.122157 -> TCP 66 nfs > phonebook [ACK] Seq=9645 Ack=10581 Win=1063 Len=0 TSval=336266703 TSecr=855018471

158 151.066229 -> NFS 166 V4 Reply (Call In 156) PUTROOTFH | GETATTR Status: NFS4ERR_DELAY

177 151.171255 -> NFS 258 V4 Call ACCESS FH:0x614a1648, [Check: RD LU MD XT DL]

178 151.273173 -> TCP 66 nfs > phonebook [ACK] Seq=10813 Ack=12197 Win=1063 Len=0 TSval=336267854 TSecr=855019616

179 157.171013 -> TCP 66 phonebook > nfs [FIN, ACK] Seq=12197 Ack=10813 Win=65535 Len=0 TSval=855025616 TSecr=336267854

180 157.171139 -> TCP 66 nfs > phonebook [ACK] Seq=10813 Ack=12198 Win=1063 Len=0 TSval=336273751 TSecr=855025616

181 157.171190 -> TCP 66 nfs > phonebook [FIN, ACK] Seq=10813 Ack=12198 Win=1063 Len=0 TSval=336273751 TSecr=855025616

182 157.171206 -> TCP 66 phonebook > nfs [ACK] Seq=12198 Ack=10814 Win=65535 Len=0 TSval=855025616 TSecr=336273751

183 157.171286 -> TCP 74 [TCP Port numbers reused] phonebook > nfs [SYN] Seq=0 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=855025616 TSecr=336273751 WS=64

184 157.171428 -> TCP 78 nfs > phonebook [SYN, ACK] Seq=0 Ack=1 Win=33580 Len=0 MSS=1460 SACK_PERM=1 WS=128 TSval=336273752 TSecr=855025616

185 157.171461 -> TCP 66 phonebook > nfs [ACK] Seq=1 Ack=1 Win=14528 Len=0 TSval=855025616 TSecr=336273752

186 157.171509 -> NFS 258 V4 Call (Reply In 187) ACCESS FH:0x614a1648, [Check: RD LU MD XT DL]

187 157.171982 -> NFS 298 V4 Reply (Call In 186) ACCESS, [Allowed: RD LU MD XT DL]