it should have - and it's seems that it tried "The mgwd/vifmgr service internal to Data ONTAP that is required for continuing data service was unavailable. The service failed, but was unsuccessfully restarted"
There's a service-specific log (mgwd and lifmgr) that can be looked into on all the nodes to try and understand why it didn't started.
Looks like something caused mgwd and vifmgr on node 2 to be non responsive that triggere failover by node 1.
Node 2 restarted and after 600 sec of normal operation node 1 gave back drives to node 2 - everything abck to normal.
The entire process from start to end took 18 min whenre first 15 min was just sitting and waiting and reporting that mgwd and vifmgr is not accessible for 927 seconds. after that message takeover was triggered.
From the attached logs, it appears to be running Ontap 9.2 (c).
In general, MGWD is a very critical compoment responsible for cluster management. When Data ONTAP is under heavy load, services required for cluster management can fail, but is designed to restart but if the Filer continues to be on higher load then there are chances that it might trigger Node panic. As a result, Node will write the core dump and will fail-over. Please note, additional load on mgwd load can come from various sources , one such is 'ZAPI load' from OnCommand softwares. It all depends on the CPU & Memory resources of the Filer Model.
I am sure NetApp must have been notified about this PANIC ? And, they should do their best to analyze the "core dump" & "mgwd" logs to determine what eventually led to panic (RCA).
I found few bugs related to MGWD causing disruption. Please take a look these KBs, it will help you point in the right direction. One such example talks about - SNMPv3 & mgwd process leak. However logs should reveal what caused it.
I will be keen to find out what core-dump & mgwd logs reveal.
OOQ means: MHost RDB apps went out of quorum on this node:
******* OOQ mtrace dump BEGIN ********* => RDB application Out-Of-Quorum ('Local unit offline').
Possible cause: RDB apps compete with D-blade and N-blade for CPU and I/O cycles. Therefore, RDB apps can go OOQ occasionally on heavily loaded systems.
Additional info: ONTAP is called a true cluster b'cos of a quorum, which is connected to majority of like RDB apps, with one instance selected as the master. Cluster membership and configuration is stored within the replicated sitelist. All RDB applications or rings such as (mgwd, vldb, vifmgr, bcomd, and so on) share the sitelist configuration. Therefore, a node that is Out of Quorum (OOQ) simply means, it is no longer participating in a quorum b'cos the local apps on that node went OOQ due to possible time-out on heavily loaded system.
@4:33 local node started to boot-up, as the N-blde interfaces started to get online.
Upgrading Ontap makes sense (Until then you could just monitor your Nodes resources). NetApp phone/email support should be able to provide you with more insight into this issue.