Data ONTAP Discussions

Watchdog Reset???

I encountered a watchdog reset on one of my filers yesterday. According to NetApp this is an extremely rare event. In trying to get down to the bottom of this reset, I searched in the RLM event logs and found the following message right before the watchdog reset.

Record 5: Tue Feb 23 07:39:40 2005 [Agent Event.warning]: FIFO 0x8FFF - Agent XYZ, L1_WD_TIMEOUT asserted.
Record 6: Tue Feb 23 07:39:42 2005 [Agent Event.critical]: FIFO 0x8FFE - Agent XYZ, L2_WD_TIMEOUT asserted.

Has anyone experienced or seen this message before? What does this mean? I assume it is some kind of Hardware failure.


Re: Watchdog Reset???

We had a watchdog reset error on a filer a few month ago.  NetApp explained that the error was most likely caused by MB faults then by the RLM and onboard memory.  We replaced the MB first but that didn't fix it.  They told us to replace the RLM and we said we can't have the downtime if we weren't absoultely sure replacing the rlm would fix the error, so we ended up replacing the head.  Problem soved.  

Re: Watchdog Reset???

So far we have had a number systems go down as indentified as "Watchdog Reset", mainly FAS3070's although potentially a FAS3040. Possible cause is the PCI drivers from Broadcom although no answers as yet. BURT 409491 in the P3 release of Ontap 7.3.3 to enhance the error reporting for the events.

Re: Watchdog Reset???

just to chime in here in case someone from NetApp reads these, we've been running Filers for years and have just started experiencing these watchdog reset issues on both of our clusters.

Between the 2 clusters, we've had 5 of these in 2010. Nothing in the logs. Motherboard diagnostics come back clean. It's very frustrating and the techs don't seem to be able to do much more than shrug their shoulders.  I think we will try your suggestion and upgrade onTap to 7.3.3.

We are currently running 7.3.2p3 on 3040s .

Watchdog Reset???

The "watchdog reset" is a failsafe measure to reset a system in the event that some part of a running system stops responding. This is done to avoid a complete deadlock and an unresponsive system that would otherwise have to be reset manually. The unresponsive component could be either a software task that fails to yield or a hardware component that is not responding quickly enough.

RLM-based systems with RLM firmware version 4.0 or SP-based systems with SP firmware 1.2.3 or later can provide more information as to which component triggered a watchdog reset (WDR). Additionally, Data ONTAP version or Data ONTAP version 8.0.2P3 or later can provide additional information in the unlikely event that a system encounters a WDR event.

For help with troublshooting watchdog resets, please open a case with NetApp Global Support.