ONTAP Discussions

"kSMF service thread held > 25 (sec)"

wsanderstii

We had a part of a vmware cluster go down this morning when a VMFS volume containing some iSCSI VMs apparently became unreachable for a period of time.

 

The only thing we see in the Netapp logs is:

 

Event: ksmf.svc.watchdog: "kSMF service thread held > 25 (sec) by application for table d_vdisk_lun"
Message Name :
ksmf.svc.watchdog
Sequence Number :
1420474
Description :
This message occurs when an application is taking too long to complete a kSMF operation.
Action :
(NONE)

 

Anyone seen this?

 

Would this error reflect a lockup on the Netpp side, or would this error be a side effect of the iSCSi network connectivity going away for a while? I could not find any info in Google search for this error string....

 

w

5 REPLIES 5

kahuna

might be a performance issue

 

do a search for 'toolong' in the EMS log

 

Is the system sending Autosupport? If yes, you can ping me (private message) the serial number and I'll have a look

 

Is there an open case?

OzStuV2

Sorry to bring this one back from the dead, but we've just seen this happen on a 9.5 cluster with an associated loss of iSCSI disk......

 

Did you ever find out what caused it?

(and no, we don't have Autosupport on 😞 )

 

Thanks in advance!

Ontapforrum

Agree with kahuna. Please raise a ticket with NetApp.

 

In the meantime,  let us know:

1) ONTAP ver
2) Filer

 

Please check the usage of the Filer around the time when this time-out occurred. Heavy usage of the Node (depending upon the version of the Ontap) could be the cause here. If you have OCUM in place, you could check the filer usage for particular volume for a range of time.

 

In worst cases, if the filer usage is very high, an I/O could take a longer time to execute and depending upon the Client side SCSI Storage stack time-out setting, it might lead to failure/downtime. To be on safer side in future, you might consider increasing the scsi timeout settings suggested in the following KB (applicable to MS software Initiators). However, Filer performance issues must be investigated.

 

https://kb.netapp.com/?title=Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/iSCSI_timeouts_Q%26A

OzStuV2

Thank you OnTapForrum!

I'm in the process of sanitising some logs so I can raise a ticket.....

 

It seems we can narrow it down to one node and it's only aggregate.

It's on ONTAP 9.5P12 on a FAS8080 and a single aggregate of  ~250 1.2T SAS (2.5") drives in 14 RAID groups.  (there's 10 nodes in the cluster)

 

There's definitely a spike in latency on the aggr & node at the time, and performance capacity on the aggregate hits 101%!

 

It's got over 100 volumes on the aggr, so we may have to move some to another node/aggr....

 

client load is a mixture of Windows Server 2012 R2 with old NetApp DSM, and VMware presented RDM's - I have my doubts on the timeout settings for them so I will get the various teams to double check.

 

I believe the NetApp Host Utilities will set the timeouts etc. for Windows?  Does the NetApp DSM do something similar? (Host utilities is NOT installed, the process to get that sort of update approved is somewhat drawn out - but maybe I can use this to push for it)

 

Thank you,

Stu

tahmad

Were you able to resolve the issue, with NetApp support @OzStuV2 

Public