We had a part of a vmware cluster go down this morning when a VMFS volume containing some iSCSI VMs apparently became unreachable for a period of time.
The only thing we see in the Netapp logs is:
Event: ksmf.svc.watchdog: "kSMF service thread held > 25 (sec) by application for table d_vdisk_lun" Message Name : ksmf.svc.watchdog Sequence Number : 1420474 Description : This message occurs when an application is taking too long to complete a kSMF operation. Action : (NONE)
Anyone seen this?
Would this error reflect a lockup on the Netpp side, or would this error be a side effect of the iSCSi network connectivity going away for a while? I could not find any info in Google search for this error string....
Agree with kahuna. Please raise a ticket with NetApp.
In the meantime, let us know:
1) ONTAP ver 2) Filer
Please check the usage of the Filer around the time when this time-out occurred. Heavy usage of the Node (depending upon the version of the Ontap) could be the cause here. If you have OCUM in place, you could check the filer usage for particular volume for a range of time.
In worst cases, if the filer usage is very high, an I/O could take a longer time to execute and depending upon the Client side SCSI Storage stack time-out setting, it might lead to failure/downtime. To be on safer side in future, you might consider increasing the scsi timeout settings suggested in the following KB (applicable to MS software Initiators). However, Filer performance issues must be investigated.
I'm in the process of sanitising some logs so I can raise a ticket.....
It seems we can narrow it down to one node and it's only aggregate.
It's on ONTAP 9.5P12 on a FAS8080 and a single aggregate of ~250 1.2T SAS (2.5") drives in 14 RAID groups. (there's 10 nodes in the cluster)
There's definitely a spike in latency on the aggr & node at the time, and performance capacity on the aggregate hits 101%!
It's got over 100 volumes on the aggr, so we may have to move some to another node/aggr....
client load is a mixture of Windows Server 2012 R2 with old NetApp DSM, and VMware presented RDM's - I have my doubts on the timeout settings for them so I will get the various teams to double check.
I believe the NetApp Host Utilities will set the timeouts etc. for Windows? Does the NetApp DSM do something similar? (Host utilities is NOT installed, the process to get that sort of update approved is somewhat drawn out - but maybe I can use this to push for it)