First time I have posted on here, so apologies if this post is in the wrong place!
IHAC that is running a 3210a, and they have recently upgraded ONTAP to 8.0.1p4 to fix the transient PSU issue that affected the 32XX model range.They have been victim to one of the heads randomly going into a panic, causing a failover to happen. They say this has happened twice since they were installed.
looking through the ASUP logs I can see that this is the cause of the problem:
[fas3210a: cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(fas3210b), system_down because l2_watchdog_reset
What I want to know is what l2_watchdog_reset means, and if anybody has some information on it? The NOW site provides little insight, and Google is no better
To answer myself, this is the official action plan for watchdog resets:
“ If the system takes a single watchdog reset, in general, no action needs to be taken as the condition causing the watchdog reset most often is a transient problem and would have been cleared by the Reset process. The customer scenario sometimes demands a replacement of the Hardware. In this case, a replacement motherboard is the most prudent course.”
L1 watchdog resets mostly result in a panic with a core dump (which can be analysed for root cause)
L2 watchdog resets do not generate a core dump... It just tells you that "some device" has encountered a timeout. "some device" could be anything, a disk, an adapter driver, a CPU....
However, officially a hardware replacement is only mandatory when it happens more than once...
I have some further information that I got from NGS, as I wanted to know more. The platforms this document relates to are older than the ones we have experienced this on, but that doesnt change anything as the watchdogs are still the same.
Watchdog Reset Best Practices (FAS3040/3070)
Last Modified: October 15, 2007
Document Version: 0.2
What is a Watchdog?
The Watchdog is a timer mechanism built into the system, which when enabled, prevents the system from hanging forever for any reason. The purpose is to function as a mechanism of “last resort” to recover a system from the effects of an otherwise unrecoverable system error, generally of unknown origin to the system at the time of failure.
The Watchdog timer in the FAS3040/3070 architecture is a two-level timer with different actions associated with each level of the timeout. It is physically implemented in the Agent running on a PIC chip. The Agent has a dedicated NMI signal which is triggered by the Level 1 Watchdog Timer, the RLM or the LCD pushbutton. The Agent also sources the System Reset signal for the case of a Level 2 Watchdog Reset.
· Level 1: Timeout - system tries to panic and dump core in response to an NMI asserted by the Agent (PIC chip)
· Level 2: Reset - system resets via a hard reset signal from the Agent
In the FAS3040/3070 and derivative products, the watchdog timer is a separate timer implemented in a PIC chip and controlled by the Agent code running on that PIC chip. The Watchdog timer is enabled by Data OnTAP (DOT) during boot up.
In a running system, DOT attempts to reset (kick) the watchdog every 10ms. If kicked successfully, the watchdog timer count resets and system continues to run normally. If the watchdog timer is not kicked within ~1.5 seconds, the system takes a first level watchdog timeout.
· First Level watchdog timeout leads to a high priority interrupt, an NMI, to the processor(s). If the hardware is in a “sane” enough state to handle the NMI interrupt, the system Panics and dumps core. If not, the watchdog timer is not reset and continues to run to the second level watchdog timeout.
· Second Level watchdog timeout trips if the watchdog timer is not kicked in ~ 2 seconds after the First Level Watchdog resulted in an NMI. When this occurs, the system is reset – hence the term “watchdog reset”. This is a hard reset and the system goes back through initialization and boot.
Possible Causes of Watchdog Events
Watchdog events can be caused by Software (SW) failures, Hardware (HW) failures or a combination of SW and HW failures.
The SW causes are primarily corrupt code or unusually long code paths that do not allow the watchdog timer to be kicked. These generally result in a watchdog timeout and the system recovers. Since all primary code paths have been executed many times under almost every imaginable condition, it is usually the exception code paths that may take too long to execute under some obscure condition. The software cause is generally resolved by a Watchdog Timeout except in rare cases or some cases of recursive panics. On System panics, watchdog timer is not kicked every 10ms, rather it is kicked at various stages of core dump code execution. Watchdog timeouts during a core dump could lead to a watchdog reset of the system.
The HW causes are primarily limited to catastrophic CPU failures (Uncorrectable Machine Check Errors) or failures of the Agent PIC chip. The PIC chip is a very simple processor and a true failure would generally result in repeated Watchdog Resets. The UMCK error from the processor may be transient or permanent. If of a permanent nature, this type of failure would also generally result in repeated Watchdog events.
On the FAS3040, FAS3070 and their derivative products, there is a set of cases that result in a false indication of “watchdog reset” upon reboot. These cases are the “Power” commands performed through the RLM. For these cases, the current Agent code cannot distinguish a watchdog reset from the RLM resets and the reboot reason is incorrectly indicated as watchdog reset. These commands are all performed with user intervention and thus these cases are easily identified by the system administrator.
An additional case has been found in which a watchdog reset results from the shutdown of a controller when a manual takeover is performed. Again, this is under administrator control and can be easily identified as an erroneous message.
Response to Watchdog Events
It is not necessary to “recover” from a Watchdog Timeout or a Watchdog Reset. These are both recovery mechanisms for other failures. The objective is to identify the failure(s) that caused the watchdog event and eliminate them, thus preventing watchdog events from occurring.
What is the appropriate response to a Watchdog Timeout (First Level Watchdog Event)?
A Watchdog Timeout should be treated just like any other system panic. The associated backtrace and/or the core should be analyzed for the possible root cause(s). A Watchdog Timeout not followed by a Watchdog reset is almost always software-based root cause.
What is the appropriate response to a Watchdog Reset (Second Level Watchdog Event)?
Check that this is not an erroneous message, that is, one due to the known manual interventions that can result in the “false” watchdog reset message. These include the RLM “power” commands and manual cluster failover commands. In the case of the RLM “power” commands, no actual watchdog reset occurred. For the manual cluster failover, the message can be disregarded. These “manual” events can be easily identified through the console messages and RLM log.
Check to see if the system was handling a panic or had paniced due to a Watchdog Timeout. If so, treat the panic. No hardware should be replaced unless the root cause of the initial panic is a hardware problem.
If the system takes a single watchdog reset, in general, no action needs to be taken as the condition causing the watchdog reset most often is a transient problem and would have been cleared by the Reset process. The customer scenario sometimes demands a replacement of the HW. In this case, a replacement motherboard is the most prudent course.
If a system takes multiple Watchdog Resets, look for previously logged errors associated with the CPU, motherboard, memory or I/O cards. Replace the appropriate FRU with a good one.
If there is no indication of an appropriate FRU to replace, then the most prudent action is to perform a head swap with both motherboard and all I/O cards being replaced by a tested set of the same FRUs in the same configuration. This is a last resort and should eliminate the repeated Watchdog Resets while allowing NetApp to test the system as a whole and to have a better chance of reproducing the error and getting root cause.
Does the Front Panel Reset Button Help?
First level watchdog functionality, the Watchdog Timeout, uses the same priority interrupt as the front panel halt button. Thus this button should result in no better results than the watchdog timeout.
The "watchdog reset" is a failsafe measure to reset a system in the event that some part of a running system stops responding. This is done to avoid a complete deadlock and an unresponsive system that would otherwise have to be reset manually. The unresponsive component could be either a software task that fails to yield or a hardware component that is not responding quickly enough.
RLM-based systems with RLM firmware version 4.0 or SP-based systems with SP firmware 1.2.3 or later can provide more information as to which component triggered a watchdog reset (WDR). Additionally, Data ONTAP version 220.127.116.11P1 or Data ONTAP version 8.0.2P3 or later can provide additional information in the unlikely event that a system encounters a WDR event.
For help with troubleshooting watchdog resets, please open a case with NetApp Global Support.