ONTAP Discussions
ONTAP Discussions
I am working with a pre-existing Netapp/VMware environment and I am trying to troubleshoot why Storage vMotions are taking hours for vm's only 30gb in size. It does not appear to be a network issue as vm's are able to move between iSCSI volumes within minutes. It is only the NFS Datastores that have this issue. All of the ESXi hosts are 7.0.3 and we are using a series of AFF-A400 Storage Devices.
For example, we have a storage vm which has been allocated 5tb but due to thin provisioning is only actually using 28.38gb of storage space. This vm takes typically 3.5 hours to Storage vMotion between NFS datastores. The same vm can Storage vMotion between two iSCSI volumes in a matter of minutes. I'm running out of things to test/try. I've checked the basics like the Session Slot table settings and the TCP Maximum Transfer Size. And the NetApp VAAI Plugin is installed on all of the ESXi hosts.
My current hope is that there is something misconfigured on the vserver in regards to NFS. I ran a vserver nfs show -instance command and have pasted the results below. I am hoping that something here may jump out to someone that would help explain why Storage vMotions are taking so extraordinarily long.
Command ran - "vserver nfs show -instance -vserver vsphere02"
vserver: vsphere02
General NFS Access: True
RPC GSS Context Cache High Water Mark: 0
RPC GSS Context Idle: 0
NFS v3: enabled
NFS v4.0: enabled
UDP Protocol: disabled
TCP Protocol: enabled
Default Windows User: -
Enable NFSv3 EJUKEBOX error: true
Require All NFSv3 Reads to Return Read Attributes: false
Show Change in FSID as NFSv3 Clients Traverse Filesystems: enabled
Enable the Dropping of a Connection When an NFSv3 Request is Dropped: enabled
Vserver NTFS Unix Security Options: use_export_policy
Vserver Change Ownership Mode: use_export_policy
Force Usage of SpinNp Readdir Requests: false
NFS Response Trace Enabled: false
NFS Response Trigger (in secs): 60
UDP Maximum Transfer Size (bytes): 32768
TCP Maximum Transfer Size (bytes): 1048576
NFSv4.0 ACL Support: disabled
NFSv4.0 Read Delegation Support: disabled
NFSv4.0 Write Delegation Support: disabled
Show Change in FSID as NFSv4 Clients Traverse Filesystems: enabled
NFSv4.0 Referral Support: disabled
NFSv4 ID Mapping Domain: defaultv4iddomain.com
NFSv4 Validate UTF-8 Encoding of Symbolic Link Data: disabled
NFSv4 Lease Timeout Value (in secs): 30
NFSv4 Grace Timeout Value (in secs): 45
Preserves and Modifies NFSv4 ACL (and NTFS File Permissions in Unified Security Style): enabled
NFSv4.1 Minor Version Support: enabled
Rquota Enable: disabled
NFSv4.1 Implementation ID Domain: netapp.com
NFSv4.1 Implementation ID Name: NetApp Release 9.7P6
NFSv4.1 Impelemntation ID Date: Tue July 28 00:06:27 2020
NFSv4.1 Parallel NFS Support: enabled
NFSv4.0 Migration Support: disabled
NFSv4.1 Referral Support: disabled
NFSv4.1 Migration Support: disabled
NFSv4.1 ACL Support: disabled
NFS vstorage Support: enabled
NFSv4 Support for Numeric Owner IDs: enabled
Default Windows Group: -
NFSv4.1 Read Delegation Support: disabled
NFSv4.1 Write Delegation Support: disabled
Number of Slots in the NFSv4.x Session slot table: 128
Size of the Reply that will be Cached in Each NFSv4.x Session Slot (in bytes): 640
Maximum Number of ACEs per ACL: 400
NFS Mount Root Only: enabled
NFS Root Only: disabled
Qtree Exports Enabled: disabled
AUTH_SYS Extended Groups Enabled: disabled
AUTH_SYS and RPCSEC_GSS Auxillary Groups Limit: 32
Validation of Qtree IDs for Qtree File Operations: enabled
NFS mount Daemon Port: 635
Network Lock Manager Port: 4045
Network Status Monitor Port: 4046
NFS Quota Daemon Port: 4049
Permitted Kerberos Encryption Types: des, des3, aes-128, aes-256
Showmount Enabled: enabled
Set the Protocol Used for Name Services Lookups for Exports: udp
Map Unknown UID to Default Windows User: enabled
DNS Domain Search Enabled During Netgroup Lookup: enabled
Trust No-Match Result from Any Name Service Switch Source During Netgroup Lookup: disabled
Display maximum NT ACL Permissions to NFS Client: disabled
NFSv3 MS-Dos Client Support: disabled
Ignore the NT ACL Check for NFS User 'root': disabled
Time to Live Value (in msecs) of a Positive Cached Credential: 86400000
Time to Live Value (in msecs) of a Negative Cached Credential: 7200000
Time to Live Value (in msecs) of a Cached Entry for a Transient Error: 30000
Skip Permission Check for NFS Write Calls from Root/Owner: disabled
Use 64 Bits for NFSv3 FSIDs and File IDs: disabled
Ignore Client Specified Mode Bits and Preserve Inherited NFSv4 ACL When Creating New Files or Directories: disabled
Fallback to Unconverted Filename Search: disabled
I/O Count to be grouped as a session: 5000
Duration for I/O to be grouped as a session (Secs): 120
Enable or disable Checksum for Replay-Cache: enabled
Harvest timeout (in msecs) for a Cached Credential: 86400000
Idle Connection Timeout Value (in seconds): 360
Are Idle NFS Connections Supported: disabled
Hide Snapshot Directory under NFSv3 Mount Point: disabled
Allow Non-root User MOUNT Operations from Hadoop Connector: enabled
Provide Root Path as Showmount State: disabled
Use 64 Bits for NFSv4.x FSIDs and File IDs: enabled
Qtree QoS Enabled: disabled
Modebits and ACL Setting is Restricted: unrestricted
I'd get a packet trace to rule anything out.
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/How_to_view_network_connections_on_different_versions_of_ONTAP <--check if there are rexmit or ooorecv or zero windows.
I'd rule out the network first.
Also, are you using copy offload at all? I've seen weird perf issues with that: https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/vStorage_APIs_-_Array_Integration_(VAAI)/How_to_disable_SAN_VAAI_Copy_Offload_o... But confirm with qos statistics volume latency show.
I tried disabling "HardwareAcceleratedMove" and it had very little if any impact on the time it took for a storage vmotion to complete. I'm going to dig through the switch configs and check for packet errors after Christmas.
I've run "qos statistics volume latency show" but I don't have a frame of reference to know what is a good/bad value. QoS Max/Min are all 0 as I don't think any QoS rule are currently active. Volume Latency varies from 150us up to 500us but most are between 200-250us. Network Latency shows values of 40-100us and for data I see 60-130us. For disk latency I see values of 14-100us.
Ok, try a packet trace. "us" = microseconds, so less than 1 ms of latency. It is very good.
Did you ever find what caused this?
I'm on the same boat, storage vmotion of thin provisioned VMs take hours for small VMs as long as they have more than 100GB provisioned. Installing or removing the VAAI didn't have much of an impact.