"kSMF service thread held > 25 (sec)"

wsanderstii · ‎2017-08-28

We had a part of a vmware cluster go down this morning when a VMFS volume containing some iSCSI VMs apparently became unreachable for a period of time.

The only thing we see in the Netapp logs is:

Event: ksmf.svc.watchdog: "kSMF service thread held > 25 (sec) by application for table d_vdisk_lun"
Message Name :
ksmf.svc.watchdog
Sequence Number :
1420474
Description :
This message occurs when an application is taking too long to complete a kSMF operation.
Action :
(NONE)

Anyone seen this?

Would this error reflect a lockup on the Netpp side, or would this error be a side effect of the iSCSi network connectivity going away for a while? I could not find any info in Google search for this error string....

w

kahuna · ‎2017-08-29

might be a performance issue

do a search for 'toolong' in the EMS log

Is the system sending Autosupport? If yes, you can ping me (private message) the serial number and I'll have a look

Is there an open case?

OzStuV2 · ‎2022-01-23

Sorry to bring this one back from the dead, but we've just seen this happen on a 9.5 cluster with an associated loss of iSCSI disk......

Did you ever find out what caused it?

(and no, we don't have Autosupport on 😞 )

Thanks in advance!

Ontapforrum · ‎2022-01-24

Agree with kahuna. Please raise a ticket with NetApp.

In the meantime, let us know:

1) ONTAP ver
2) Filer

Please check the usage of the Filer around the time when this time-out occurred. Heavy usage of the Node (depending upon the version of the Ontap) could be the cause here. If you have OCUM in place, you could check the filer usage for particular volume for a range of time.

In worst cases, if the filer usage is very high, an I/O could take a longer time to execute and depending upon the Client side SCSI Storage stack time-out setting, it might lead to failure/downtime. To be on safer side in future, you might consider increasing the scsi timeout settings suggested in the following KB (applicable to MS software Initiators). However, Filer performance issues must be investigated.

https://kb.netapp.com/?title=Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/iSCSI_timeouts_Q%26A

OzStuV2 · ‎2022-01-24

Thank you OnTapForrum!

I'm in the process of sanitising some logs so I can raise a ticket.....

It seems we can narrow it down to one node and it's only aggregate.

It's on ONTAP 9.5P12 on a FAS8080 and a single aggregate of ~250 1.2T SAS (2.5") drives in 14 RAID groups. (there's 10 nodes in the cluster)

There's definitely a spike in latency on the aggr & node at the time, and performance capacity on the aggregate hits 101%!

It's got over 100 volumes on the aggr, so we may have to move some to another node/aggr....

client load is a mixture of Windows Server 2012 R2 with old NetApp DSM, and VMware presented RDM's - I have my doubts on the timeout settings for them so I will get the various teams to double check.

I believe the NetApp Host Utilities will set the timeouts etc. for Windows? Does the NetApp DSM do something similar? (Host utilities is NOT installed, the process to get that sort of update approved is somewhat drawn out - but maybe I can use this to push for it)

Thank you,

Stu

tahmad · ‎2022-03-23

Were you able to resolve the issue, with NetApp support @OzStuV2