ONTAP Hardware

filer reboot

lasonsysadmn
8,290 Views

we have FAS3240 ontap 8.0 clustered filer. One of the filer panic reboot , and follows the SP logs...please help us to identify the issue.

 

 

[IPMI Event.critical]: NMI

Record 2474: Mon Mar 31 21:01:07 2014 [IPMI.notice]: 7f04 | 02 | EVT: 6fc824ff | System_Watchdog | Assertion Event, "Timer interrupt"

Record 2475: Mon Mar 31 21:01:09 2014 [IPMI Event.critical]: L2 watchdog timeout hard reset

Record 2476: Mon Mar 31 21:01:09 2014 [Trap Event.critical]: hwassist l2_watchdog_reset (29)

Record 2477: Mon Mar 31 21:01:09 2014 [Trap Event.critical]: SNMP l2_watchdog_reset (29)

Record 2478: Mon Mar 31 21:01:09 2014 [IPMI Event.critical]: System reset

Record 2479: Mon Mar 31 21:01:09 2014 [IPMI Event.critical]: L2 watchdog action completed

Record 2480: Mon Mar 31 21:01:09 2014 [IPMI.notice]: L2 to L1 is 3(s) 195021(us)

Record 2481: Mon Mar 31 21:01:11 2014 [IPMI.notice]: 8004 | 02 | EVT: 6fc202ff | System_FW_Status | Assertion Event, "NVMEM initialization"

Record 2482: Mon Mar 31 21:01:11 2014 [IPMI.notice]: 8104 | 02 | EVT: 6fc104ff | System_Watchdog | Assertion Event, "Hard reset"

Record 2483: Mon Mar 31 21:01:11 2014 [IPMI.notice]: 8204 | 02 | EVT: 0301ffff | System_Fault | Assertion Event, "State Asserted"

Record 2484: Mon Mar 31 21:01:11 2014 [IPMI.notice]: 8304 | 02 | EVT: 0301ffff | Controller_Fault | Assertion Event, "State Asserted"

Record 2485: Mon Mar 31 21:01:11 2014 [SP.notice]: Delaying L2_WDOG ASUP email for 120 seconds

Record 2486: Mon Mar 31 21:01:52 2014 [SP.critical]: Filer Reboots

Record 2487: Mon Mar 31 21:02:11 2014 [IPMI.notice]: 8404 | 02 | EVT: 6fc213ff | System_FW_Status | Assertion Event, "System boot initiated"

Record 2488: Mon Mar 31 21:02:15 2014 [IPMI.notice]: 8504 | 02 | EVT: 6fc220ff | System_FW_Status | Assertion Event, "Bootloader is running"

Record 2489: Mon Mar 31 21:02:19 2014 [IPMI.notice]: 8604 | 02 | EVT: 6fc22fff | System_FW_Status | Assertion Event, "OnTap Kernel Started"

Record 2490: Mon Mar 31 21:02:19 2014 [IPMI.notice]: 8704 | 02 | EVT: 0300ffff | System_Fault | Assertion Event, "State Deasserted"

Record 2491: Mon Mar 31 21:02:20 2014 [IPMI.notice]: 8804 | 02 | EVT: 0300ffff | Controller_Fault | Assertion Event, "State Deasserted"

Record 2492: Mon Mar 31 21:04:05 2014 [ASUP.notice]: First notification email | (REBOOT (watchdog reset)) CRITICAL | Sent

Record 2493: Mon Mar 31 21:04:30 2014 [SP.normal]: Heartbeat received

3 REPLIES 3

fabrice_berrier
8,290 Views

Hi,

edit the /etc/messages of the other node and find the message why the other node panic.

lasonsysadmn
8,290 Views

Please find below:

Tue Apr  1 02:23:26 IST [gun-nas-1: wafl.quota.qtree.exceeded:notice]: tid 17: tree quota exceeded on volume vol9. Additional warnings will be suppressed for approximately 60 minutes or until a 'quota resize' is performed.

Tue Apr  1 02:31:08 IST [gun-nas-1: cf.fsm.partnerNotResponding:notice]: Failover monitor: partner not responding

Tue Apr  1 02:31:08 IST [gun-nas-1: cf.fsm.takeoverCountdown:info]: Failover monitor: takeover scheduled in 10 seconds

Tue Apr  1 02:31:09 IST [gun-nas-1: cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(gun-nas-2), system_down because l2_watchdog_reset.

Tue Apr  1 02:31:09 IST [gun-nas-1: cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(gun-nas-2), system_down because l2_watchdog_reset.

Tue Apr  1 02:31:10 IST [iwarp-vfiler@gun-nas-1: ctrl.rdma.heartBeat:info]: High-availability interconnect status: Missed heartbeat to 192.168.1.240

Tue Apr  1 02:31:10 IST [gun-nas-1: cf.ic.xferTimedOut:error]: wafl interconnect transfer timed out

Tue Apr  1 02:31:10 IST [gun-nas-1: cf.fsm.stateTransit:info]: Failover monitor: UP --> TAKEOVER

Tue Apr  1 02:31:10 IST [gun-nas-1: cf.fm.takeoverStarted:notice]: Failover monitor: takeover started

Tue Apr  1 02:31:10 IST [gun-nas-1: netif.linkDown:info]: Ethernet c0a: Link down, check cable.

Tue Apr  1 02:31:10 IST [gun-nas-1: netif.linkDown:info]: Ethernet c0b: Link down, check cable.

Tue Apr  1 02:31:10 IST [iwarp-vfiler@gun-nas-1: ctrl.rdma.heartBeat:info]: High-availability interconnect status: Missed heartbeat to 192.168.2.62

Tue Apr  1 02:31:10 IST [gun-nas-1: scsitarget.vtic.down:notice]: The VTIC is down.

Tue Apr  1 02:31:11 IST [gun-nas-2/gun-nas-1: coredump.host.spare.none:info]: No sparecore disk was found for host 1.

Tue Apr  1 02:31:12 IST [gun-nas-1: raid.vol.replay.nvram:info]: Performing raid replay on volume(s)

Tue Apr  1 02:31:12 IST [gun-nas-1: raid.replay.partner.nvram:notice]: Replaying partner NVRAM.

Tue Apr  1 02:31:12 IST [gun-nas-1: raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.

Tue Apr  1 02:31:12 IST [gun-nas-1: raid.stripe.replay.summary:info]: Replayed 0 stripes.

Tue Apr  1 02:31:16 IST [gun-nas-1: wafl.replay.done:info]: WAFL log replay completed, 3 seconds

Tue Apr  1 02:31:18 IST [gun-nas-2/gun-nas-1: fcp.service.startup:info]: FCP service startup

Tue Apr  1 02:31:18 IST [gun-nas-2/gun-nas-1: httpd.config.mime.missing:warning]: /etc/httpd.mimetypes file is missing.

Tue Apr  1 02:31:18 IST [gun-nas-2/gun-nas-1: iscsi.service.startup:info]: iSCSI service startup

Tue Apr  1 02:31:18 IST [gun-nas-2/gun-nas-1: net.ifconfig.noPartner:error]: ifconfig: 'c0a' cannot be configured: Address does not match any partner interface.

Tue Apr  1 02:31:18 IST [gun-nas-2/gun-nas-1: net.ifconfig.noPartner:error]: ifconfig: 'c0b' cannot be configured: Address does not match any partner interface.

Tue Apr  1 02:31:19 IST [gun-nas-2/gun-nas-1: net.ifconfig.noPartner:error]: ifconfig: 'e0P' cannot be configured: Address does not match any partner interface.

Tue Apr  1 02:31:19 IST [gun-nas-2/gun-nas-1: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0a.

Tue Apr  1 02:31:19 IST [gun-nas-2/gun-nas-1: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0b.

Tue Apr  1 02:31:19 IST [gun-nas-2/gun-nas-1: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e1a.

Tue Apr  1 02:31:19 IST [gun-nas-2/gun-nas-1: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e1b.

Tue Apr  1 02:31:19 IST [gun-nas-2/gun-nas-1: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e4a.

Tue Apr  1 02:31:19 IST [gun-nas-2/gun-nas-1: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e4b.

Tue Apr  1 02:31:19 IST [gun-nas-2/gun-nas-1: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e4e4d.

Tue Apr  1 02:31:19 IST [gun-nas-2/gun-nas-1: ip.drd.vfiler.info:info]: Although vFiler units are licensed, the routing daemon runs in the default IP space only.

Tue Apr  1 02:31:19 IST [gun-nas-2/gun-nas-1: cf_takeover:info]: relog syslog Tue Apr  1 02:30:14 IST [gun-nas-2: api_mpool_03:debug]: root@10.20.32.39:API:https in:<?xml version='1.0' encoding='u

Tue Apr  1 02:31:19 IST [gun-nas-2/gun-nas-1: cf_takeover:info]: relog syslog Tue Apr  1 02:30:14 IST [gun-nas-2: api_mpool_06:debug]: root@10.20.32.39:API:https in:<?xml version='1.0' encoding='u

Tue Apr  1 02:31:20 IST [gun-nas-2/gun-nas-1: cf_takeover:ALERT]: Warning: license setting for snapmanager_sharepoint is not the same on both systems

Tue Apr  1 02:31:20 IST [gun-nas-1: cf.rsrc.takeoverOpFail:error]: Failover monitor: takeover during license_check failed; takeover continuing...

Tue Apr  1 02:31:20 IST [gun-nas-1: net.ifconfig.takeoverError:warning]: WARNING: 10 errors detected during network takeover processing WARNING: Some network clients may not be able to access the cluster during takeover

Tue Apr  1 02:31:20 IST [gun-nas-1: cf.rsrc.takeoverOpFail:error]: Failover monitor: takeover during ifconfig_2 failed; takeover continuing...

Tue Apr  1 02:31:20 IST [gun-nas-2/gun-nas-1: cifs.startup.partner.succeeded:info]: CIFS: CIFS partner server is running.

Tue Apr  1 02:31:20 IST [gun-nas-2/gun-nas-1: proto_init03:info]: Vfiler discovery complete

Tue Apr  1 02:31:20 IST [gun-nas-1 (takeover): cf.rsrc.transitTime:notice]: Top Takeover transit times wafl_replay=3148 {replay_log=3118, mark_replaying=29, enable_log=1, init=0, catalog_init=0, replay_log_missing=0, nvfail=0, partner_log=0, destroy_vvol=0}, wafl_sync=2145, rc=1111 {ifconfig=155, ifconfig=133, always_do_just_after_etc_rc=127, ifconfig=93, ifconfig=75, hostname=59, ifconfig=58, ifconfig=51, ifconfig=50, ifconfig=49}, wafl=550 {paggrs_to_done=228, prvol_to_done=194, pvvols_to_done=120, part_

Tue Apr  1 02:31:20 IST [gun-nas-1 (takeover): callhome.sfo.takeover:CRITICAL]: Call home for CONTROLLER TAKEOVER COMPLETE AUTOMATIC

Tue Apr  1 02:31:20 IST [gun-nas-1 (takeover): callhome.reboot.takeover:error]: Call home for PARTNER REBOOT (CONTROLLER TAKEOVER)

Tue Apr  1 02:31:20 IST [gun-nas-1 (takeover): cf.fm.takeoverComplete:notice]: Failover monitor: takeover completed

Tue Apr  1 02:31:20 IST [gun-nas-1 (takeover): cf.fm.takeoverDuration:info]: Failover monitor: takeover duration time is 10 seconds

Tue Apr  1 02:31:22 IST [gun-nas-1 (takeover): asup.general.collect.error:notice]: AutoSupport (callhome.sfo.takeover) might be missing content due to truncation in the (dblade) AutoSupport collection module.

Tue Apr  1 02:31:28 IST [gun-nas-1 (takeover): asup.general.collect.error:notice]: AutoSupport (callhome.reboot.takeover) might be missing content due to truncation in the (dblade) AutoSupport collection module.

Tue Apr  1 02:31:43 IST [gun-nas-2/gun-nas-1: nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations have completed for the partner server.

Tue Apr  1 02:31:51 IST [gun-nas-1 (takeover): asup.general.collect.error:notice]: AutoSupport (callhome.sfo.takeover) might be missing content due to truncation in the (dblade) AutoSupport collection module.

Tue Apr  1 02:32:01 IST [gun-nas-1 (takeover): monitor.globalStatus.critical:CRITICAL]: This node has taken over gun-nas-2.

Tue Apr  1 02:32:10 IST [gun-nas-1 (takeover): asup.general.collect.error:notice]: AutoSupport (callhome.reboot.takeover) might be

might be missing content due to truncation in the (dblade) AutoSupport collection module.

Tue Apr  1 02:32:14 IST [gun-nas-1 (takeover): asup.general.collect.error:notice]: AutoSupport (callhome.sfo.takeover) might be missing content due to truncation in the (dblade) AutoSupport collection module.

Tue Apr  1 02:32:22 IST [gun-nas-1 (takeover): callhome.performance.snap:info]: Call home for PERFORMANCE SNAPSHOT

Tue Apr  1 02:32:32 IST [gun-nas-1 (takeover): asup.general.collect.error:notice]: AutoSupport (callhome.reboot.takeover) might be missing content due to truncation in the (dblade) AutoSupport collection module.

Tue Apr  1 02:32:50 IST [gun-nas-1 (takeover): asup.general.collect.error:notice]: AutoSupport (callhome.sfo.takeover) might be missing content due to truncation in the (dblade) AutoSupport collection module.

Tue Apr  1 02:33:20 IST [gun-nas-2/gun-nas-1: asup.general.file.missing:error]: Unable to find file: /etc/log/cm_asup_stats

Tue Apr  1 02:33:26 IST [gun-nas-1 (takeover): netif.linkUp:info]: Ethernet c0a: Link up.

Tue Apr  1 02:33:32 IST [gun-nas-1 (takeover): netif.linkUp:info]: Ethernet c0b: Link up.

Tue Apr  1 02:33:36 IST [iwarp-vfiler@gun-nas-1 (takeover): ctrl.rdma.heartBeat:info]: High-availability interconnect status: Starting heartbeat to 192.168.2.237

Tue Apr  1 02:33:36 IST [iwarp-vfiler@gun-nas-1 (takeover): ctrl.rdma.heartBeat:info]: High-availability interconnect status: Starting heartbeat to 192.168.1.116

Tue Apr  1 02:33:42 IST [gun-nas-1 (takeover): cf.fsm.releasingReservations:info]: Failover monitor: Releasing disk reservations in preparation for giveback

Tue Apr  1 02:33:42 IST [gun-nas-1 (takeover): cf.fm.diskRelease:info]: Failover monitor: released disk reservations.

Tue Apr  1 02:33:46 IST [gun-nas-1 (takeover): scsitarget.vtic.up:notice]: The VTIC is up.

Tue Apr  1 03:00:00 IST [gun-nas-1 (takeover): kern.uptime.filer:info]:   3:00am up 64 days,  8:29 18732153008 NFS ops, 5277830570 CIFS ops, 48 HTTP ops, 10499862474 FCP ops, 418578985 iSCSI ops

Tue Apr  1 03:00:12 IST [gun-nas-1 (takeover): cf.partner.ready.giveback:info]: Partner is booted and ready for giveback.

Tue Apr  1 03:31:37 IST [gun-nas-1 (takeover): rlmauth_login_mgr:info]: root logged in from SP

Tue Apr  1 03:36:38 IST [gun-nas-1 (takeover): cf.misc.operatorGiveback:info]: Failover monitor: giveback initiated by operator

Tue Apr  1 03:36:38 IST [gun-nas-1: cf.fm.givebackStarted:notice]: Failover monitor: giveback started

Tue Apr  1 03:36:40 IST [gun-nas-2/gun-nas-1: iscsi.service.shutdown:info]: iSCSI service shutdown

Tue Apr  1 03:36:40 IST [gun-nas-2/gun-nas-1: fcp.service.shutdown:info]: FCP service shutdown

Tue Apr  1 03:36:45 IST [gun-nas-1: cf.rsrc.transitTime:notice]: Top Giveback transit times wafl=4958 {drain_msgs=2008, sync_clean=1090, finish=1066, giveback_sync=437, forget=353, vol_refs=3, abort_scans=1, mark_abort=0, wait_offline=0, wait_create=0}, snapmirror=635, wafl_gb_sync=553, ndmpd=366, nfsd=303, raid=232, registry_giveback=204, sanown_replay=164, vdisk=93, exports=34

Tue Apr  1 03:36:45 IST [gun-nas-1: callhome.sfo.giveback:info]: Call home for CONTROLLER GIVEBACK COMPLETE

Tue Apr  1 03:36:46 IST [gun-nas-1: cf.fm.givebackComplete:notice]: Failover monitor: giveback completed

Tue Apr  1 03:36:46 IST [gun-nas-1: cf.fm.givebackDuration:notice]: Failover monitor: giveback duration time is 8 seconds

fabrice_berrier
8,290 Views

on the /etc/messages of both node find  the word "panic"

Public