Thanks for sharing logs:
00000074.0065e707 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] f_coord_commit_1: RPC: Timed out; netid=tcp fd=630 TO=30.0s TT=30.000s O=3680b I=0b CN=4857822/3 VSID=-3 169.254.8.83:7818
00000074.0065e708 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] W [src/rdb/TM.cc 2774 (0x81d1ea100)]: TM 1000: Attempted commit of TID <5,981472,981472> 'job_manager job tx loop' failed: RPC_FAILURE (f_coord_commit_1: RPC: Timed out; netid=tcp fd=630 TO=30.0s TT=30.000s O=3680b I=0b CN=4857822/3 VSID=-3 169.254.8.83:7818).
00000074.0065e709 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/cluster_events.cc 78 (0x81d1ea100)]: Cluster event: node-event, epoch 5, site 1000 [could not commit transaction: RPC failure (f_coord_commit_1: RPC: Timed out; netid=tcp fd=630 TO=30.0s TT=30.000s O=3680b I=0b ].
00000074.0065e70a 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/TM.cc 1918 (0x81d1ea100)]: _localSiteHasFailed: set IS_FAILED epoch 5
00000074.0065e70b 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/quorumimpl.cc 941 (0x81d1ea100)]: Declare secondary failed in epoch: 5, state: WS_QuorumMember (transaction manager) fastPath 0 isFastPathDefault 1.
00000074.0065e70c 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_in_secondary.cc 303 (0x81d1ea100)]: secondaryFailed: FastPathDefault 1, Membership terminated by secondaryFailed call at 60788231s, _failedTillTime 60788234s
00000074.0065e70d 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_in_quorum_member.cc 64 (0x81d1ea100)]: QuorumMemberState::state2: WS_QuorumMember -> WS_Failed
00000074.0065e70e 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_in_secondary.cc 323 (0x81d1ea100)]: SecondaryState::stateUp2Secondary: WS_QuorumMember -> WS_Failed
00000074.0065e70f 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_state.cc 306 (0x81d1ea100)]: qmsPreferredCandidate_set till: 60788236s who: 1001.
00000074.0065e710 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_in_in_quorum.cc 49 (0x81d1ea100)]: InQuorumState::stateUp2InQuorum: WS_QuorumMember -> WS_Failed
00000074.0065e711 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/quorumimpl.cc 1833 (0x81d1ea100)]: local_offlineUpcall QM Upcall status: Secondary ==> Offline Epoch: 5 => 5 isFastPath 1 isFastPathOverride 0 membershipDisabled: 0
00000074.0065e712 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] A [src/rdb/quorum/qm_states/qm_state.cc 552 (0x81d1ea100)]: stateTrans: WS_QuorumMember -> WS_Failed at: 60788231s
00000074.0065e713 243b8c4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] ******* OOQ mtrace dump BEGIN *********
00000074.0065e714 243bc4d Thu Sep 26 2019 04:04:48 -06:00 [kern_mgwd:info:2284] rdb::TM:Thu Sep 26 02:10:15 2019:src/rdb/TM.cc:2080 (thr_id:0x81b146e00)
OOQ means: MHost RDB apps went out of quorum on this node:
******* OOQ mtrace dump BEGIN ********* => RDB application Out-Of-Quorum ('Local unit offline').
Possible cause: RDB apps compete with D-blade and N-blade for CPU and I/O cycles. Therefore, RDB apps can go OOQ occasionally on heavily loaded systems.
Additional info: ONTAP is called a true cluster b'cos of a quorum, which is connected to majority of like RDB apps, with one instance selected as the master. Cluster membership and configuration is stored within the replicated sitelist. All RDB applications or rings such as (mgwd, vldb, vifmgr, bcomd, and so on) share the sitelist configuration. Therefore, a node that is Out of Quorum (OOQ) simply means, it is no longer participating in a quorum b'cos the local apps on that node went OOQ due to possible time-out on heavily loaded system.
@4:33 local node started to boot-up, as the N-blde interfaces started to get online.
Upgrading Ontap makes sense (Until then you could just monitor your Nodes resources). NetApp phone/email support should be able to provide you with more insight into this issue.