Another E2600 disaster. Can it be restored?

agapotron · ‎2018-10-31

We have an E2600 that has gone into lockdown mode. It all started with a report of a failed drive. Simple enough, we took a similar drive from one of our spare units and replaced the drive. Immediately, the unit went into infinite reboot mode. Well, not infinite, since it stops after 5 attempts. But here's what it reports (when connected to the serial port):

10/31/18-19:10:00 (tRAID): SOD Sequence is Normal, 0 on controller A
10/31/18-19:10:00 (tRAID): NOTE: loading flash file: ECTGenOEM
10/31/18-19:10:00 (tRAID): NOTE: unloading flash file: ECTGenOEM
10/31/18-19:10:01 (tRAID): NOTE: Turning on tray summary fault LED
10/31/18-19:10:01 (tRAID): NOTE: SODRebootLoop- Limit:5 Cnt:1
10/31/18-19:10:01 (tRAID): NOTE: Installed Protocols: <MTPs: SAS USB > <ITPs: RDMA IPOIB > <STPs: FCP RmtDMA iSCSI SAS >
10/31/18-19:10:01 (tRAID): NOTE: Required Protocols: <MTPs: SAS > <ITPs: UNK > <STPs: SAS >
10/31/18-19:10:03 (tRAID): NOTE: DSM: Current revision 7
10/31/18-19:10:03 (tRAID): NOTE: SYMBOL: SYMbolAPI registered.
10/31/18-19:10:03 (tRAID): NOTE: cmgrCtlr: Board Manager HostBoardData Model Name for slot 0: None
10/31/18-19:10:03 (tRAID): NOTE: RCBBitmapManager total RPA size = 1761607680
10/31/18-19:10:04 (tRAID): NOTE: init: ioc: 0, PLVersion: 13-075-70-00
10/31/18-19:10:04 (IOSched): NOTE: SAS Expander Added: expDevHandle:x11 enclHandle:x2 numPhys:33 port:2 ioc:0 channel:0 numEntries:22
10/31/18-19:10:04 (IOSched): NOTE: SAS Expander Added: expDevHandle:x12 enclHandle:x3 numPhys:33 port:3 ioc:0 channel:1 numEntries:22
10/31/18-19:10:05 (tSasExpChk): NOTE: Local Expander Firmware Version: 00.00.80.13
10/31/18-19:10:05 (tSasExpChk): NOTE: Enabling this controller's expander phys, channel=0
10/31/18-19:10:05 (sasEnPhys): NOTE: sasEnableDrivePhysTask: 0 phys enabled on channel 0
10/31/18-19:10:06 (tSasExpChk): NOTE: Alternate Expander Firmware Version: 00.00.80.13
10/31/18-19:10:06 (IOSched): NOTE: SAS Expander Added: expDevHandle:x15 enclHandle:x4 numPhys:17 port:2 ioc:0 channel:0 numEntries:17
10/31/18-19:10:06 (IOSched): NOTE: SAS Expander Added: expDevHandle:x16 enclHandle:x5 numPhys:17 port:2 ioc:0 channel:0 numEntries:17
10/31/18-19:10:06 (IOSched): NOTE: SAS Expander Added: expDevHandle:x17 enclHandle:x6 numPhys:17 port:2 ioc:0 channel:0 numEntries:17
10/31/18-19:10:06 (IOSched): NOTE: SAS Expander Added: expDevHandle:x18 enclHandle:x7 numPhys:17 port:2 ioc:0 channel:0 numEntries:17
10/31/18-19:10:06 (tSasExpChk): NOTE: Enabling alternate controller's expander phys, channel=1
10/31/18-19:10:06 (sasEnPhys): NOTE: sasEnableDrivePhysTask: 0 phys enabled on channel 1
10/31/18-19:10:07 (DcmmTask): NOTE: DCM Inquiry revision for itn:15 is 049A
10/31/18-19:10:07 (DcmmTask): NOTE: DCM Inquiry revision for itn:28 is 049A
10/31/18-19:10:07 (DcmmTask): NOTE: DCM Inquiry revision for itn:41 is 049A
10/31/18-19:10:08 (DcmmTask): NOTE: DCM Inquiry revision for itn:54 is 049A
10/31/18-19:10:11 (IOSched): NOTE: SAS Expander Added: expDevHandle:x4d enclHandle:x8 numPhys:17 port:2 ioc:0 channel:0 numEntries:17
10/31/18-19:10:11 (DcmmTask): NOTE: DCM Inquiry revision for itn:67 is 049A
10/31/18-19:10:12 (tSasDiscCom): NOTE: SAS: Initial Discovery Complete Time: 41 seconds since last power on/reset, 10 seconds since sas instantiated
10/31/18-19:10:12 (tRAID): NOTE: CrushMemoryPoolMgr: platform and memory (CPU MemSize:2048), adjusted allocating size to 262144 for CStripes
10/31/18-19:10:14 (tRAID): SOD: Instantiation Phase Complete
0xe4abf0 (tNetCfgInit): miiPhyInit check cable connection
10/31/18-19:10:26 (tNetCfgInit): NOTE: eth0: LinkDown event
10/31/18-19:10:29 (tNetCfgInit): NOTE: eth1: LinkUp event
10/31/18-19:10:30 (tNetCfgInit): NOTE: Acquiring network parameters for interface gei1 using DHCP
10/31/18-19:10:34 (ipdhcpc): NOTE: netCfgDhcpReplyCallback :: received OFFER on interface gei1, unit 1
10/31/18-19:10:35 (ipdhcpc): NOTE: DHCP server: 10.3.1.4
10/31/18-19:10:35 (ipdhcpc): WARN: **WARNING** The DHCP Server did not assign a permanent IP for gei1.
10/31/18-19:10:35 (ipdhcpc): WARN: Network access to this controller may eventually fail.
10/31/18-19:10:35 (ipdhcpc): NOTE: Client IP routers: 10.3.1.4
10/31/18-19:10:35 (ipdhcpc): NOTE: Client DNS name servers: 10.1.1.12
10/31/18-19:10:35 (ipdhcpc): NOTE: DNS domain name: gi.ucsc.edu
10/31/18-19:10:35 (ipdhcpc): NOTE: Assigned IP address: 10.3.200.1
10/31/18-19:10:35 (ipdhcpc): NOTE: Assigned subnet mask: 255.255.0.0
10/31/18-19:10:35 (tNetReset): NOTE: Network Ready
10/31/18-19:10:43 (tRAID): WARN: Failed to open Inter-Controller Communication Channels!
10/31/18-19:10:43 (tRAID): NOTE: LockMgr Role is Master
10/31/18-19:10:45 (tHckReset): NOTE: Firmware running line - OFF
10/31/18-19:10:46 (tHckReset): NOTE: HealthCheck: Alt Ctl: 1 Reset_Failure, state: 0 Start
10/31/18-19:11:13 (tRAID): NOTE: WWN baseName 00080080-e5247358 (valid==>SoftRst)
10/31/18-19:11:14 (tRAID): SOD: Pre-Initialization Phase Complete
10/31/18-19:11:22 (IOSched): WARN: SSM: Start Unit to Devnum 0x00010036 FAILED (error: 2)! Drive will not be available
10/31/18-19:12:01 (tRAID): WARN: ACS: testCommunication: Domi Exception caught: IconSendInfeasibleException Error
10/31/18-19:12:01 (tRAID): NOTE: ACS: autoCodeSync(): Process start. Comm Status: 0
10/31/18-19:12:01 (tRAID): WARN: ACS: autoCodeSync(): Skipped since alt not communicating.
10/31/18-19:12:02 (DcmmTask): WARN: Starting recovery on DCM Id: 1 (0x00000000), faultType: 2, force: False
10/31/18-19:12:02 (DcmmTask): WARN: Starting recovery on DCM Id: 3 (0x00000000), faultType: 2, force: False
10/31/18-19:12:02 (DcmmTask): WARN: Starting recovery on DCM Id: 5 (0x00000000), faultType: 2, force: False
10/31/18-19:12:02 (DcmmTask): WARN: Starting recovery on DCM Id: 7 (0x00000000), faultType: 2, force: False
10/31/18-19:12:02 (DcmmTask): WARN: Starting recovery on DCM Id: 9 (0x00000000), faultType: 2, force: False
10/31/18-19:12:03 (tRAID): NOTE: iocfi: SOD Bad Check-in: IconSendInfeasibleException Error
10/31/18-19:12:04 (tRAID): SOD: Code Synchronization Initialization Phase Complete
10/31/18-19:12:04 (NvpsPersistentSyncM): NOTE: NVSRAM Persistent Storage updated successfully
10/31/18-19:12:05 (tRAID): NOTE: SAFE: Process new features
10/31/18-19:12:05 (tRAID): NOTE: SAFE: Process legacy features
10/31/18-19:12:07 (tRAID): NOTE: IOManager::restoreData - dataSize:0x2800000, startAddress:0x4779f7e00
10/31/18-19:12:09 (tRAID): NOTE: ncb::IOManager::restoreData - Successful
10/31/18-19:12:09 (tRAID): NOTE: OBB Restore From Flash Completed
10/31/18-19:12:10 (bdbmBGTask): ERROR: sendFreezeStateToAlternate: Exception IconSendInfeasibleException Error
10/31/18-19:12:10 (tRAID): NOTE: SPM acquireObjects exception: IconSendInfeasibleException Error
10/31/18-19:12:10 (tRAID): NOTE: sas: SOD Bad Check-in: IconSendInfeasibleException Error
10/31/18-19:12:10 (tRAID): NOTE: ion: SOD Bad Check-in: IconSendInfeasibleException Error
10/31/18-19:12:11 (tRAID): WARN: Unable to intialize mirror device
10/31/18-19:12:11 (tRAID): NOTE: CacheMgr::cacheOpenMirrorDevice:: mirror device 0xfffffff
10/31/18-19:12:11 (tRAID): NOTE: CCM:readCheckpointFromNvsram for this 0x0 alt 0xffffffff
10/31/18-19:12:11 (tRAID): NOTE: CCM: validateCacheMem() cache memory is invalid
10/31/18-19:12:11 (tRAID): NOTE: CCM: validateCacheMem() Initializing my partition
10/31/18-19:12:12 (tRAID): WARN: CCM: setupRcbBitmapRegion(): Continue to use pstor/dbm for RCB bitmaps
10/31/18-19:12:12 (tRAID): NOTE: Cache: Switching pstor intent logging to disk mode, cacheBufsPerSyncRegion: 500
10/31/18-19:12:12 (tRAID): WARN: CCM: startRestoreProcessing() local backup was not complete, request alt takeover
10/31/18-19:12:12 (tRAID): NOTE: startRestoreProcessing: m_BackupInfoRecord.backupStatus = IN_PROGRESS
10/31/18-19:12:12 (tRAID): WARN: Backup Status in pstor does not match the status stored in NVSRVM, honoring NVSRAM
10/31/18-19:12:12 (tRAID): NOTE: Cache: Restoring RCB region of this controller ...
10/31/18-19:12:12 (tRAID): NOTE: Cache: RCB region restore of this controller completed
10/31/18-19:12:12 (tRAID): NOTE: Cache: Restoring RCB region of alternate controller ...
10/31/18-19:12:12 (tRAID): NOTE: Cache: RCB region restore of alternate controller completed
10/31/18-19:12:12 (tRAID): NOTE: restoreRCBs: m_MetadataIndexTable.size(): 308536, addr 0x3dccf7c
10/31/18-19:12:12 (tRAID): NOTE: Cache: Dirty cache blocks to be restored: 0
10/31/18-19:12:12 (tRAID): NOTE: Cache: Restoring dirty cache blocks ...
10/31/18-19:12:12 (tRAID): WARN: CCM: informRestoreCompleteToAlt() caught IconSendInfeasibleException Error
10/31/18-19:12:12 (tRAID): NOTE: Cache: Restore Completed
10/31/18-19:12:12 (tRAID): WARN: CCM: initialize() - exchangeRestoreStatusAlt caught IconSendInfeasibleException Error
10/31/18-19:12:12 (tRAID): WARN: CCM Failed to notify the alternate to update Cache Store in pstor, writeCacheStoreToPstor() caught IconSendInfeasibleException Error
10/31/18-19:12:12 (tRAID): WARN: CCM: sodMirrorCheckIn(): Software Version Mismatch - mirror checkin skipped
10/31/18-19:12:12 (tRAID): WARN: CCM: mirrorInit() alternate did not check in
10/31/18-19:12:13 (tRAID): WARN: CCM Failed to notify the alternate to update Cache Store in pstor, writeCacheStoreToPstor() caught IconSendInfeasibleException Error
10/31/18-19:12:13 (tRAID): NOTE: CCM: sodClearMOSIntentsAlt(), failure clearing MOS intents on alt
10/31/18-19:12:13 (tRAID): NOTE: doRecovery: myMemory:1, IORCB:0, NEW:0, OLD:0
10/31/18-19:12:13 (tRAID): NOTE: buildBackupCCBList: mirror:0, unmirror:0, wb:0, wbm:0, wt:0, backupListSize:0
10/31/18-19:12:13 (tRAID): NOTE: doRecovery: myMemory:0, IORCB:0, NEW:0, OLD:78
10/31/18-19:12:13 (tRAID): WARN: CCM Failed to notify the alternate to update Cache Store in pstor, writeCacheStoreToPstor() caught IconSendInfeasibleException Error
10/31/18-19:12:13 (tRAID): WARN: PSTOR: PstorRecordMgr: removeRecord failed
10/31/18-19:12:13 (tRAID): WARN: deleteBackupStatus: caught pstorRecordNotFoundException Line 1893 File ncbIOManager.cc
10/31/18-19:12:15 (tRAID): NOTE: UWMgr findIWLogs: Found IW log drive. Devnum 0x1003a tray=99 slot=59 ssd=0 qos=6 controller=0
10/31/18-19:12:15 (tRAID): NOTE: UWMgr findIWLogs: Found IW log drive. Devnum 0x1003b tray=99 slot=60 ssd=0 qos=6 controller=0
10/31/18-19:12:20 (tRAID): WARN: *** Volume 0 is now DEGRADED ***
10/31/18-19:12:20 (tRAID): WARN: *** Volume 1 is now DEGRADED ***
10/31/18-19:12:20 (tRAID): WARN: *** Volume 2 is now DEGRADED ***
10/31/18-19:12:21 (tRAID): WARN: *** Volume 3 is now DEGRADED ***
10/31/18-19:12:21 (tRAID): PANIC: CrushDrive::allocateExtent - Exception type N3adp6Device24ExtentAllocatedExceptionE message "N3adp6Device24ExtentAllocatedExceptionE"with extent:2787 of size:5578 for drive ordinal35

Stack Trace for tRAID:

******* At this point, some garbled characters appear, and the whole sequence starts again *******

Any suggestions on what we can do to recover the data? The unit is old and out of warranty. We do have some saved files for the unit under /var/opt/SM for our santricity monitoring system. I would think the data on the disks is still good except for the one failed drive, but the unit was using disk pools, and it should be able to withstand a couple of drive failures... Of course, this unit was being used to backup our main system, and as soon as we lost the unit we find ourselves in need of the backed up data. So, any help would be extremely appreciated. Thanks!

eseries_trailblazer · ‎2018-11-06

Looking at the error messages and after consulting with some of our experts internally, I'd recommend that you reach out to NetApp support since the recovery process could be pretty involved.

agapotron · ‎2018-11-06

I have reached out to NetApp support, and they weren't very helpful. They chastised me for not having the unit under maintenance, and said the unit is end-of-life, so why are we using it anyway? Eventually, they almost grudginly agreed to give me a quote (which they mentioned is probably going to be very expensive) to place the unit back on maintenance so that then they can begin to investigate if they can do anything about this. At this point I'm wondering if there are any other options out there...

eseries_trailblazer · ‎2018-11-06

Getting the unit back under maintenance would be the best course of action.