Solved: mgwd and vifmgr failed - Take over after 927 seconds

marcinmf · ‎2019-09-27

Hello,

one of our FAS8040 controllers (nmg-fas8040-02) got issues with mgwd and vifmgr services where they become unresponsive.

Based on logs it took 927 seconds for Panic and Takeover.

That is about 15 min of no access to the storage untill 1st controller took over drives and they become available.

Is there a way to shorten this time to panic?

I attached the log for that event. Is there any document that woudl point me where i could tweek takeover settings?

Thanks,
Martin

Ontapforrum · ‎2019-10-01

Thanks for sharing logs:

00000074.0065e707 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] f_coord_commit_1: RPC: Timed out; netid=tcp fd=630 TO=30.0s TT=30.000s O=3680b I=0b CN=4857822/3 VSID=-3 169.254.8.83:7818
00000074.0065e708 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] W [src/rdb/TM.cc 2774 (0x81d1ea100)]: TM 1000: Attempted commit of TID <5,981472,981472> 'job_manager job tx loop' failed: RPC_FAILURE (f_coord_commit_1: RPC: Timed out; netid=tcp fd=630 TO=30.0s TT=30.000s O=3680b I=0b CN=4857822/3 VSID=-3 169.254.8.83:7818).
00000074.0065e709 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/cluster_events.cc 78 (0x81d1ea100)]: Cluster event: node-event, epoch 5, site 1000 [could not commit transaction: RPC failure (f_coord_commit_1: RPC: Timed out; netid=tcp fd=630 TO=30.0s TT=30.000s O=3680b I=0b ].
00000074.0065e70a 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/TM.cc 1918 (0x81d1ea100)]: _localSiteHasFailed: set IS_FAILED epoch 5
00000074.0065e70b 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/quorumimpl.cc 941 (0x81d1ea100)]: Declare secondary failed in epoch: 5, state: WS_QuorumMember (transaction manager) fastPath 0 isFastPathDefault 1.
00000074.0065e70c 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_in_secondary.cc 303 (0x81d1ea100)]: secondaryFailed: FastPathDefault 1, Membership terminated by secondaryFailed call at 60788231s, _failedTillTime 60788234s
00000074.0065e70d 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_in_quorum_member.cc 64 (0x81d1ea100)]: QuorumMemberState::state2: WS_QuorumMember -> WS_Failed
00000074.0065e70e 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_in_secondary.cc 323 (0x81d1ea100)]: SecondaryState::stateUp2Secondary: WS_QuorumMember -> WS_Failed
00000074.0065e70f 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_state.cc 306 (0x81d1ea100)]: qmsPreferredCandidate_set till: 60788236s who: 1001.
00000074.0065e710 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_in_in_quorum.cc 49 (0x81d1ea100)]: InQuorumState::stateUp2InQuorum: WS_QuorumMember -> WS_Failed
00000074.0065e711 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/quorumimpl.cc 1833 (0x81d1ea100)]: local_offlineUpcall QM Upcall status: Secondary ==> Offline Epoch: 5 => 5 isFastPath 1 isFastPathOverride 0 membershipDisabled: 0
00000074.0065e712 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_state.cc 552 (0x81d1ea100)]: stateTrans: WS_QuorumMember -> WS_Failed at: 60788231s
00000074.0065e713 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] ******* OOQ mtrace dump BEGIN *********
00000074.0065e714 243bc4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] rdb::TM:Thu Sep 26 02:10:15 2019:src/rdb/TM.cc:2080 (thr_id:0x81b146e00)

OOQ means: MHost RDB apps went out of quorum on this node:

******* OOQ mtrace dump BEGIN ********* => RDB application Out-Of-Quorum ('Local unit offline').

Possible cause: RDB apps compete with D-blade and N-blade for CPU and I/O cycles. Therefore, RDB apps can go OOQ occasionally on heavily loaded systems.

Additional info: ONTAP is called a true cluster b'cos of a quorum, which is connected to majority of like RDB apps, with one instance selected as the master. Cluster membership and configuration is stored within the replicated sitelist. All RDB applications or rings such as (mgwd, vldb, vifmgr, bcomd, and so on) share the sitelist configuration. Therefore, a node that is Out of Quorum (OOQ) simply means, it is no longer participating in a quorum b'cos the local apps on that node went OOQ due to possible time-out on heavily loaded system.

@4:33 local node started to boot-up, as the N-blde interfaces started to get online.

Upgrading Ontap makes sense (Until then you could just monitor your Nodes resources). NetApp phone/email support should be able to provide you with more insight into this issue.

View solution in original post

GidonMarcus · ‎2019-09-28

Hi

it should have - and it's seems that it tried "The mgwd/vifmgr service internal to Data ONTAP that is required for continuing data service was unavailable. The service failed, but was unsuccessfully restarted"

There's a service-specific log (mgwd and lifmgr) that can be looked into on all the nodes to try and understand why it didn't started.

i'm also going to send you a quick DM about it.

Gidi

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

SpindleNinja · ‎2019-09-28

is your cluster network healthy?

All lifs home / no degrated ports, all lifs on the cluster network pingable, and the cluster ring healthy?

marcinmf · ‎2019-09-28

Network is healthy, no alerts and all lifs home.

Looks like something caused mgwd and vifmgr on node 2 to be non responsive that triggere failover by node 1.

Node 2 restarted and after 600 sec of normal operation node 1 gave back drives to node 2 - everything abck to normal.

The entire process from start to end took 18 min whenre first 15 min was just sitting and waiting and reporting that mgwd and vifmgr is not accessible for 927 seconds. after that message takeover was triggered.

Martin

SpindleNinja · ‎2019-09-28

What version of ONTAP?

marcinmf · ‎2019-09-28

Running on 9.2

Thank you,

Martin

Ontapforrum · ‎2019-09-28

Hi,

From the attached logs, it appears to be running Ontap 9.2 (c).

In general, MGWD is a very critical compoment responsible for cluster management. When Data ONTAP is under heavy load, services required for cluster management can fail, but is designed to restart but if the Filer continues to be on higher load then there are chances that it might trigger Node panic. As a result, Node will write the core dump and will fail-over. Please note, additional load on mgwd load can come from various sources , one such is 'ZAPI load' from OnCommand softwares. It all depends on the CPU & Memory resources of the Filer Model.

I am sure NetApp must have been notified about this PANIC ? And, they should do their best to analyze the "core dump" & "mgwd" logs to determine what eventually led to panic (RCA).

I found few bugs related to MGWD causing disruption. Please take a look these KBs, it will help you point in the right direction. One such example talks about - SNMPv3 & mgwd process leak. However logs should reveal what caused it.

I will be keen to find out what core-dump & mgwd logs reveal.

MGWD panics and restarts on a regular basis; cluster node may panic regularly
https://kb.netapp.com/app/answers/answer_view/a_id/1087034

BUG: First fixed in 9.3P13
https://mysupport.netapp.com//NOW/cgi-bin/bol?Type=Detail&Display=1196390

MGWD loading on low CPU count systems takes two minutes or longer, resulting in disruption
https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1163726

Disruption in ONTAP 9.2 might occur due to IGMP protocol traffic
https://kb.netapp.com/app/answers/answer_view/a_id/1074346/loc/en_US

marcinmf · ‎2019-09-30

Thank you for checking.

I disabled SNMPv3 and attached some logs from both nodes. Te issue started to come up around 4am on26th.

We will start plannign on updating our ontap to the latest supported version.

Martin

Ontapforrum · ‎2019-10-01

Thanks for sharing logs:

00000074.0065e707 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] f_coord_commit_1: RPC: Timed out; netid=tcp fd=630 TO=30.0s TT=30.000s O=3680b I=0b CN=4857822/3 VSID=-3 169.254.8.83:7818
00000074.0065e708 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] W [src/rdb/TM.cc 2774 (0x81d1ea100)]: TM 1000: Attempted commit of TID <5,981472,981472> 'job_manager job tx loop' failed: RPC_FAILURE (f_coord_commit_1: RPC: Timed out; netid=tcp fd=630 TO=30.0s TT=30.000s O=3680b I=0b CN=4857822/3 VSID=-3 169.254.8.83:7818).
00000074.0065e709 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/cluster_events.cc 78 (0x81d1ea100)]: Cluster event: node-event, epoch 5, site 1000 [could not commit transaction: RPC failure (f_coord_commit_1: RPC: Timed out; netid=tcp fd=630 TO=30.0s TT=30.000s O=3680b I=0b ].
00000074.0065e70a 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/TM.cc 1918 (0x81d1ea100)]: _localSiteHasFailed: set IS_FAILED epoch 5
00000074.0065e70b 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/quorumimpl.cc 941 (0x81d1ea100)]: Declare secondary failed in epoch: 5, state: WS_QuorumMember (transaction manager) fastPath 0 isFastPathDefault 1.
00000074.0065e70c 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_in_secondary.cc 303 (0x81d1ea100)]: secondaryFailed: FastPathDefault 1, Membership terminated by secondaryFailed call at 60788231s, _failedTillTime 60788234s
00000074.0065e70d 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_in_quorum_member.cc 64 (0x81d1ea100)]: QuorumMemberState::state2: WS_QuorumMember -> WS_Failed
00000074.0065e70e 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_in_secondary.cc 323 (0x81d1ea100)]: SecondaryState::stateUp2Secondary: WS_QuorumMember -> WS_Failed
00000074.0065e70f 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_state.cc 306 (0x81d1ea100)]: qmsPreferredCandidate_set till: 60788236s who: 1001.
00000074.0065e710 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_in_in_quorum.cc 49 (0x81d1ea100)]: InQuorumState::stateUp2InQuorum: WS_QuorumMember -> WS_Failed
00000074.0065e711 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/quorumimpl.cc 1833 (0x81d1ea100)]: local_offlineUpcall QM Upcall status: Secondary ==> Offline Epoch: 5 => 5 isFastPath 1 isFastPathOverride 0 membershipDisabled: 0
00000074.0065e712 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_state.cc 552 (0x81d1ea100)]: stateTrans: WS_QuorumMember -> WS_Failed at: 60788231s
00000074.0065e713 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] ******* OOQ mtrace dump BEGIN *********
00000074.0065e714 243bc4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] rdb::TM:Thu Sep 26 02:10:15 2019:src/rdb/TM.cc:2080 (thr_id:0x81b146e00)

OOQ means: MHost RDB apps went out of quorum on this node:

******* OOQ mtrace dump BEGIN ********* => RDB application Out-Of-Quorum ('Local unit offline').

Possible cause: RDB apps compete with D-blade and N-blade for CPU and I/O cycles. Therefore, RDB apps can go OOQ occasionally on heavily loaded systems.

Additional info: ONTAP is called a true cluster b'cos of a quorum, which is connected to majority of like RDB apps, with one instance selected as the master. Cluster membership and configuration is stored within the replicated sitelist. All RDB applications or rings such as (mgwd, vldb, vifmgr, bcomd, and so on) share the sitelist configuration. Therefore, a node that is Out of Quorum (OOQ) simply means, it is no longer participating in a quorum b'cos the local apps on that node went OOQ due to possible time-out on heavily loaded system.

@4:33 local node started to boot-up, as the N-blde interfaces started to get online.

Upgrading Ontap makes sense (Until then you could just monitor your Nodes resources). NetApp phone/email support should be able to provide you with more insight into this issue.

kahuna · ‎2019-09-29

was 'inode' mentioned in the logs prior to the 'event'?