I recently upgrade our FAS 2240-4 system ( EOSL) from ONTAP 8.3.2P12 version to 9.1P20. All went well, until we started seeing our RHEL 7.7 machines encountering NFS4 mount issues after reboot. The NFS4 exports works well until the server actually rebooted. Suddenly the mount points doesn't show up with "df -h" command and this only happens after reboot of the Linux server. NFS3 works well with UDP protocol but not with NFS4.
Please note, the export policy and export-policy check does show that the client has RW access. I do not see any errors on the NetApp logs. However, with pktt trace shows that NFS server throwing "NFS4ERR_DENIED" error. Please find below NetApp packet trace details excerpts from wireshark:
45 2.046195 10.XXX.XXX.156 10.XXX.XXX.36 NFS 394 NFS4_OK,NFS4_OK,NFS4_OK,NFS4_OK,NFS4_OK,NFS4_OK V4 Reply (Call In 44) OPEN StateID: 0x4dcd
46 2.046705 10.XXX.XXX.36 10.XXX.XXX.156 TCP 66 811 → 2049 [ACK] Seq=3053 Ack=3421 Win=24574 Len=0 TSval=1987690989 TSecr=467065747
47 2.046819 10.XXX.XXX.36 10.XXX.XXX.156 NFS 302 V4 Call (Reply In 48) LOCK FH: 0xf9bee644 Offset: 0 Length: <End of File>
48 2.047059 10.XXX.XXX.156 10.XXX.XXX.36 NFS 174 NFS4ERR_DENIED,NFS4_OK,NFS4ERR_DENIED V4 Reply (Call In 47) LOCK Status: NFS4ERR_DENIED
Frame 48: 174 bytes on wire (1392 bits), 174 bytes captured (1392 bits)
Ethernet II, Src: 02:xx:xx:36:xx:2e (02:xx:xx:36:xx:2e), Dst: Cisco_b8:00:fe (00:bf:77:b8:00:fe)
Internet Protocol Version 4, Src: 10.XXX.XXX.156, Dst: 10.XXX.XXX.36
Transmission Control Protocol, Src Port: 2049, Dst Port: 811, Seq: 3421, Ack: 3289, Len: 108
Remote Procedure Call, Type:Reply XID:0xf6be4f45
Network File System, Ops(2): PUTFH LOCK(NFS4ERR_DENIED)
[Program Version: 4]
[V4 Procedure: COMPOUND (1)]
Status: NFS4ERR_DENIED (10010)
Operations (count: 2)
Opcode: PUTFH (22)
Opcode: LOCK (12)
Status: NFS4ERR_DENIED (10010)
locktype: WRITE_LT (2)
[Main Opcode: LOCK (12)]
I'm unable to understand why NetApp NFS server is showing up this error. To give you some perspective from the redhat client server is recently (gap of 1 week) we have installed McAfee AV as well. And, the server running IBM MQ application services in HA mode (meaning primary and secondary servers as active-standby).
Please find below details of NFS4 mount error -
Verify if an export rule exists that allows the client to gain using the check-access command; can you share the outptut:
Cluster::> check-access -vserver <vserver> -volume <volume> -path <path> -client-ip <clientIP> -auth <auth_type> -proto <proto> -access-type <type>
Please find below output -
Cluster001::> export-policy check-access -vserver Cluster001_SVM -volume vol10 -path /vol/vol10/UAT_MQHA_MQ -client-ip 10.XXX.XXX.36 -auth sys -proto nfs4 -access-type read-write
There are no entries matching your query.
Cluster001::> export-policy check-access -vserver Cluster001_SVM -volume vol10 -client-ip 10.XXX.XXX.36 -auth sys -proto nfs4 -access-type read-write
Policy Policy Rule
Path Policy Owner Owner Type Index Access
----------------------------- ---------- --------- ---------- ------ ----------
/ default Cluster001_SVM_root
volume 1 read
/vol default Cluster001_SVM_root
volume 1 read
/vol/vol10 default vol10 volume 1 read-write
3 entries were displayed.
The second output shows that vol10 has proper read-write access.
Not sure why running the command with the -path did not work. looking at the packet trace the LOCK Status: NFS4ERR_DENIED means that an attempt to lock a file is denied. Are you able to mount vol10 only with nfsv4 without the whole path:
Per example: mount -vvv SVM:/vol/vol10 /UAT_MQHA_MQ
The mounted path is same i.e. vol/vol10 /UAT_MQHA_MQ and it is still mounted with NFS4. The problem starts only if we reboot the server then the mount point will fail to mount. Luckily we have VM snapshot maintained to revert in working condition. NFS3 mounts perfectly - but we need NFS4 mount point as it need to support to support multi-mode for IBM MQ application. NFS3 doesn't support multimode.
What could be the possible reason for this issue? The NFS mount works perfectly as long as server UP and running. Is it related to Kerberos?
Here are some troubleshooting steps to help narrow down the problem if it will happened again.
Migrate the LIF to another node. Test the connection then migrate it back.
cluster::> net int migrate
Displays information about the server connection.
cluster::> network connections active show-clients -remote-address <ip of the server>
Capture a packet trace while having the issue.
cluster::> network tcpdump start -node nodename -address serveripaddress -port portname cluster::> network tcpdump stop
Check the event logs.
If the system is not busy,
Some of the commands mentioned earlier are not supported by Ontap version 9.1
Note that you can still capture a packet trace using the pktt command.
If you see my first post, I have already given the excerpt from the packet trace output. The problem starts only when the client server restarted else already mount NFS FS does not have any issue. I tried to migrate the LIF as well last time but it did not helped. Luckily we had VM snapshots available prior to reboot, we reverted the Snapshot and it started working.
What could possibly go wrong here?