2014-03-13 09:46 AM
We've bought new hardware and are in the middle of migrating to CDOT.
One volume that I've migrated contains NFS homedirs that get automounted from most of our linux hosts.
The day that I transitioned that volume from the 7-mode node and into the new CDOT environment, users started noticing intermittent 60 second delays with the automounting of their homedir.
I've been troubleshooting this for the past few days and here are notes on it.
So since we still have our 7-mode system running, I created a setup to test the same with it. I cannot make the problem happen with an automount from the same client to an export on the 7-mode node.
Are there any known issues with autofs from RHEL 6 using an export on CDOT 8.2P5?
2014-03-25 12:32 PM
Some discovered information to add to this question.
According to what I'm finding on retransmissions being incremented, it's an indication that the number of available NFS kernel threads on the server is insufficient to handle the requests from the client.
How does this apply to CDOT 8.2 assuming it does actually apply in any way? Any ideas what might be triggering these retransmits?
2014-04-02 10:40 AM
So again providing more information to my original post in hope that someone finds interest in the issue...
I performed the same test as I noted before with 90 iterations doing 'sleep 60; ls ~homedir' and this time discovered the mount failing roughly every 20 minutes.
Something interesting discovered in this is that in capturing a packet trace, the issue occurs when the RHEL client sends a NFS call to the filer...the filer acknowledges receiving the request, but it never actually replies to the request. The timeout is then triggered and the retry succeeds.
Here are frames of question/concern from the trace. Note how there is no reply to the call in frame 177:
156 150.026052 192.168.1.231 -> 192.168.1.160 NFS 218 V4 Call (Reply In 158) PUTROOTFH | GETATTR
157 150.122157 192.168.1.160 -> 192.168.1.231 TCP 66 nfs > phonebook [ACK] Seq=9645 Ack=10581 Win=1063 Len=0 TSval=336266703 TSecr=855018471
158 151.066229 192.168.1.160 -> 192.168.1.231 NFS 166 V4 Reply (Call In 156) PUTROOTFH | GETATTR Status: NFS4ERR_DELAY
177 151.171255 192.168.1.231 -> 192.168.1.160 NFS 258 V4 Call ACCESS FH:0x614a1648, [Check: RD LU MD XT DL]
178 151.273173 192.168.1.160 -> 192.168.1.231 TCP 66 nfs > phonebook [ACK] Seq=10813 Ack=12197 Win=1063 Len=0 TSval=336267854 TSecr=855019616
179 157.171013 192.168.1.231 -> 192.168.1.160 TCP 66 phonebook > nfs [FIN, ACK] Seq=12197 Ack=10813 Win=65535 Len=0 TSval=855025616 TSecr=336267854
180 157.171139 192.168.1.160 -> 192.168.1.231 TCP 66 nfs > phonebook [ACK] Seq=10813 Ack=12198 Win=1063 Len=0 TSval=336273751 TSecr=855025616
181 157.171190 192.168.1.160 -> 192.168.1.231 TCP 66 nfs > phonebook [FIN, ACK] Seq=10813 Ack=12198 Win=1063 Len=0 TSval=336273751 TSecr=855025616
182 157.171206 192.168.1.231 -> 192.168.1.160 TCP 66 phonebook > nfs [ACK] Seq=12198 Ack=10814 Win=65535 Len=0 TSval=855025616 TSecr=336273751
183 157.171286 192.168.1.231 -> 192.168.1.160 TCP 74 [TCP Port numbers reused] phonebook > nfs [SYN] Seq=0 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=855025616 TSecr=336273751 WS=64
184 157.171428 192.168.1.160 -> 192.168.1.231 TCP 78 nfs > phonebook [SYN, ACK] Seq=0 Ack=1 Win=33580 Len=0 MSS=1460 SACK_PERM=1 WS=128 TSval=336273752 TSecr=855025616
185 157.171461 192.168.1.231 -> 192.168.1.160 TCP 66 phonebook > nfs [ACK] Seq=1 Ack=1 Win=14528 Len=0 TSval=855025616 TSecr=336273752
186 157.171509 192.168.1.231 -> 192.168.1.160 NFS 258 V4 Call (Reply In 187) ACCESS FH:0x614a1648, [Check: RD LU MD XT DL]
187 157.171982 192.168.1.160 -> 192.168.1.231 NFS 298 V4 Reply (Call In 186) ACCESS, [Allowed: RD LU MD XT DL]