What is happening to our NetApp ? How can we recover it ?

JAVIERBARROSO · ‎2014-07-14

Hello,

Recently we realized a detailed study about our NETAPP, and we saw about 6 disk in bad state.

We changed that 6 disk, and when the new 6 disk were introduced, at least 9 disk more were to FAIL state, Now we have 12 disk at FAIL State.

Before the change of that 6 disks, we had change the wires which connected a shelf to ours stors.

I will try to explain the change:

Stor A , port 3b was connected to a shelf at its low part

Stor B , port 3b was connected to the same shelf at its upper part

I thought that was better to have Stor A connected to upper part and Stor B to down part

After this change we had to halt both Stors, (because of another not related problems).

Two weeks later I booted both stors. Stor B was working fine, but Stor A was rebooting all the time. Then the new disk were changed and then 9 disks were to FAIL (just after the change).

Why do you think it could be possible that 9 disk together were to FAIL ?

What do you recommend us ? I don't want to change 12 disks without any possible explication.

After changing the 6 disk, Stor A is rebooting all the time.

I tried to entry to Maintenance shell, and this is what I can see:

*> disk show -v

Local System ID: 15xxx

DISK OWNER POOL SERIAL NUMBER DR HOME CHKSUM

------------ ------------- ----- ------------- ------------- -------

0c.27 FAILED Block

0a.45 FAILED Block

0c.24 FAILED Block

0a.35 FAILED Block

0a.26 Stor01 (151697145) Pool0 3LM5PDB200009909GPWQ Stor01 (151697145) Block

0a.42 Stor01 (151697145) Pool0 3LM5P33K000099090SLV Stor01 (151697145) Block

0c.38 Stor01 (151697145) Pool0 3LM5P33Q000099090NTN Stor01 (151697145) Block

0c.36 Stor01 (151697145) Pool0 3LM5P4JD00009909GPXX Stor01 (151697145) Block

0a.40 Stor01 (151697145) Pool0 3LM5P4F800009909GTJ4 Stor01 (151697145) Block

0a.32 Stor01 (151697145) Pool0 3LM5P4WG00009909NXNT Stor01 (151697145) Block

0c.28 Stor01 (151697145) Pool0 3LM5QGVY00009909P06D Stor01 (151697145) Block

0c.34 Stor01 (151697145) Pool0 3LM5P4Q800009909PXP5 Stor01 (151697145) Block

0a.28 Stor01 (151697145) Pool0 3LM5PC7000009909FZ3D Stor01 (151697145) Block

0a.22 Stor01 (151697145) Pool0 3LM5PC9Q00009907552X Stor01 (151697145) Block

0a.20 Stor01 (151697145) Pool0 JHW2869C Stor01 (151697145) Block

0c.20 Stor01 (151697145) Pool0 JLVGSU1C Stor01 (151697145) Block

0a.25 Stor02 (151697133) Pool0 3LM5PC1000009909GPC7 Stor02 (151697133) Block

0a.29 Stor02 (151697133) Pool0 3LM5PDZZ00009909FZPW Stor02 (151697133) Block

0c.41 Stor02 (151697133) Pool0 3LM5P4RJ00009909G0BG Stor02 (151697133) Block

0c.35 Stor02 (151697133) Pool0 3LM5P39V000099090UJS Stor02 (151697133) Block

0c.44 Stor02 (151697133) Pool0 3LM5QG7000009909PXP3 Stor02 (151697133) Block

0a.44 Stor02 (151697133) Pool0 3LM5P30A00009909GSE2 Stor02 (151697133) Block

0c.25 Stor02 (151697133) Pool0 3LM5NYC500009909FZ7G Stor02 (151697133) Block

0c.37 Stor02 (151697133) Pool0 3LM5P2WN00009909GSEG Stor02 (151697133) Block

0c.45 Stor02 (151697133) Pool0 3LM5QDGG00009909PX9E Stor02 (151697133) Block

0c.32 Stor01 (151697145) Pool0 3LM5P3EA000099077F9C Stor01 (151697145) Block

0c.21 Stor02 (151697133) Pool0 3LM5PDWE00009909PTBF Stor02 (151697133) Block

0c.43 Stor02 (151697133) Pool0 3LM1H0B900009745VZAJ Stor02 (151697133) Block

0a.16 Stor01 (151697145) Pool0 3LM5PC0J00009909FYSV Stor01 (151697145) Block

0c.33 Stor02 (151697133) Pool0 3LM5P3JM00009908AL3X Stor02 (151697133) Block

0a.21 Stor02 (151697133) Pool0 3LM5PC6X00009909FWQ4 Stor02 (151697133) Block

0c.29 Stor02 (151697133) Pool0 3LM5NYC700009909N0U7 Stor02 (151697133) Block

0a.23 Stor02 (151697133) Pool0 3LM5PBYV00009909GPFB Stor02 (151697133) Block

0a.33 Stor02 (151697133) Pool0 3LM5PB2Y00009909FZEN Stor02 (151697133) Block

0a.19 Stor02 (151697133) Pool0 3LM5PDHL00009909GPVT Stor02 (151697133) Block

0c.19 Stor02 (151697133) Pool0 3LM5P2X600009909PXDH Stor02 (151697133) Block

0c.23 Stor02 (151697133) Pool0 3LM5QDJH00009909MZ9W Stor02 (151697133) Block

0a.17 Stor02 (151697133) Pool0 3LM5P3JC00009908ANSP Stor02 (151697133) Block

0a.37 Stor02 (151697133) Pool0 3LM5P30Z000099090P4M Stor02 (151697133) Block

0c.39 Stor02 (151697133) Pool0 3LM5QG4900009909FV89 Stor02 (151697133) Block

0a.39 Stor02 (151697133) Pool0 3LM5P4LX00009909PTJA Stor02 (151697133) Block

3b.18 Stor01 (151697145) Pool0 J813LWGL Stor01 (151697145) Block

3b.16 Stor01 (151697145) Pool0 J817RR5L Stor01 (151697145) Block

3b.20 Stor01 (151697145) Pool0 J81623XL Stor01 (151697145) Block

3b.26 Stor01 (151697145) Pool0 J80Z1LML Stor01 (151697145) Block

4b.38 Stor01 (151697145) Pool0 PBJ10JZE Stor01 (151697145) Block

0c.17 (101183107) Pool0 JLVE95JC (101183107) Block

3b.22 Stor01 (151697145) Pool0 J815GWYL Stor01 (151697145) Block

4b.40 Stor01 (151697145) Pool0 PBJ1T6YE Stor01 (151697145) Block

4b.34 Stor01 (151697145) Pool0 PBJ1J6BE Stor01 (151697145) Block

3b.24 Stor01 (151697145) Pool0 J814TLBL Stor01 (151697145) Block

4b.36 Stor01 (151697145) Pool0 PBJ1KUYE Stor01 (151697145) Block

4b.32 Stor01 (151697145) Pool0 PBJ0Y8KE Stor01 (151697145) Block

3b.28 Stor01 (151697145) Pool0 J815GWRL Stor01 (151697145) Block

4b.42 Stor01 (151697145) Pool0 PBJ1U8YE Stor01 (151697145) Block

4b.44 Stor01 (151697145) Pool0 PBJ109VE Stor01 (151697145) Block

4b.18 Stor01 (151697145) Pool0 PAKUD3HE Stor01 (151697145) Block

4b.16 Stor01 (151697145) Pool0 PAKSAXZE Stor01 (151697145) Block

4b.22 Stor01 (151697145) Pool0 PAKUBLDE Stor01 (151697145) Block

4b.24 Stor01 (151697145) Pool0 PAKUDY5E Stor01 (151697145) Block

4b.20 Stor01 (151697145) Pool0 PAKU2BLE Stor01 (151697145) Block

4b.26 Stor01 (151697145) Pool0 PAKUG05E Stor01 (151697145) Block

4b.23 Stor02 (151697133) Pool0 PAKS9RHE Stor02 (151697133) Block

4b.21 Stor02 (151697133) Pool0 PAKUDTNE Stor02 (151697133) Block

4b.43 Stor02 (151697133) Pool0 PBJ1NN2E Stor02 (151697133) Block

4b.37 Stor02 (151697133) Pool0 PBJ0Y8NE Stor02 (151697133) Block

4b.25 Stor02 (151697133) Pool0 PAKUD17E Stor02 (151697133) Block

4b.19 Stor02 (151697133) Pool0 PAKUBZ6E Stor02 (151697133) Block

3b.27 Stor02 (151697133) Pool0 J815GW0L Stor02 (151697133) Block

4b.45 Stor02 (151697133) Pool0 PBJ10LXE Stor02 (151697133) Block

4b.41 Stor02 (151697133) Pool0 PBJ1U8XE Stor02 (151697133) Block

4b.17 Stor02 (151697133) Pool0 PAKSKJTE Stor02 (151697133) Block

4b.35 Stor02 (151697133) Pool0 PBJ0Y8EE Stor02 (151697133) Block

4b.39 Stor02 (151697133) Pool0 PBJ1S2ZE Stor02 (151697133) Block

3b.17 Stor02 (151697133) Pool0 J814TKXL Stor02 (151697133) Block

4b.33 Stor02 (151697133) Pool0 PBJ10J0E Stor02 (151697133) Block

3b.25 Stor02 (151697133) Pool0 J811YHTL Stor02 (151697133) Block

3b.19 Stor02 (151697133) Pool0 J814V9GL Stor02 (151697133) Block

3b.23 Stor02 (151697133) Pool0 J815GY8L Stor02 (151697133) Block

3b.21 Stor02 (151697133) Pool0 J813J9GL Stor02 (151697133) Block

3b.29 Stor02 (151697133) Pool0 J813LELL Stor02 (151697133) Block

0a.27 Stor02 (151697133) Pool0 JLVGH9UC Stor02 (151697133) Block

0c.22 Stor02 (151697133) Pool0 JLVGGUEC Stor02 (151697133) Block

0c.18 (101183107) Pool0 JLVEKWSC (101183107) Block

0a.43 Stor02 (151697133) Pool0 JLVGH1VC Stor02 (151697133) Block

0c.16 (101183107) Pool0 JLVGSPSC (101183107) Block

0c.26 (101183107) Pool0 JLVGPASC (101183107) Block

0a.41 FAILED 3LM5P398000099090U5Y Block

0a.34 FAILED 3LM5P3JN00009909GQ6K Block

0c.42 FAILED 3LM5P4NJ00009909GQYE Block

0a.24 FAILED 3LM5P4G6000099090PSC Block

0a.36 FAILED 3LM5P4RD00009909PTMM Block

0a.18 FAILED 3LM5PBZK00009909GPCY Block

0a.38 FAILED 3LM5P3FZ00009909FZNT Block

0c.40 FAILED 3LM5QGHA00009909NZ4F Block

*> aggr status

Jul 14 11:46:09 [localhost:fmmb.current.lock.disk:info]: Disk 0a.16 is a local HA mailbox disk.

Jul 14 11:46:09 [localhost:fmmb.instStat.change:info]: normal mailbox instance on local side.

Aggr State Status Options

Jul 14 11:46:09 [localhost:coredump.host.spare.none:info]: No sparecore disk was found for host 0.

aggr0 failed raid_dp, aggr diskroot, lost_write_protect=off

Jul 14 11:46:09 [localhost:raid.assim.rg.missingChild:error]: Aggregate aggr1, rgobj_verify: RAID object 0 has only 7 valid children, expected 12.

partial

Jul 14 11:46:09 [localhost:raid.assim.plex.missingChild:error]: Aggregate aggr1, plexobj_verify: Plex 0 only has 0 working RAID groups (1 total) and is being taken offline

aggr1 failed raid_dp, aggr lost_write_protect=off

Jul 14 11:46:09 [localhost:raid.assim.mirror.noChild:ALERT]: Aggregate aggr1, mirrorobj_verify: No operable plexes found.

partial

Jul 14 11:46:09 [localhost:raid.assim.rg.missingChild:error]: Aggregate aggr0, rgobj_verify: RAID object 0 has only 7 valid children, expected 12.

aggr2 online raid_dp, aggr nosnap=on

Jul 14 11:46:09 [localhost:raid.assim.plex.missingChild:error]: Aggregate aggr0, plexobj_verify: Plex 0 only has 0 working RAID groups (1 total) and is being taken offline

32-bit

Jul 14 11:46:09 [localhost:raid.assim.mirror.noChild:ALERT]: Aggregate aggr0, mirrorobj_verify: No operable plexes found.

aggr3 online raid_dp, aggr nosnap=on

32-bit

No root aggregate or root traditional volume found.

You must specify a root aggregate or traditional volume with

From the latest logs when it is rebooting infinitely:

Jul 14 10:18:59 [localhost:fci.adapter.link.online:info]: Fibre Channel adapter 0c link online.

Jul 14 10:19:01 [localhost:fci.device.quiesce:debug]: Adapter 0a encountered a command timeout on Disk device 0a.24 (0x05000018) LUN 0 cdb 0x28:0000a3f0:0008

retry: 2 Quiescing the device.

Jul 14 10:19:01 [localhost:scsi.cmd.checkCondition:debug]: Disk device 0a.24: Check Condition: CDB 0x28:0000a3f0:0008: Sense Data SCSI:aborted command - (0xb

- 0x8 0x1 0x81)(40439).

Jul 14 10:19:01 [localhost:scsi.cmd.noMorePaths:debug]: Disk device 0a.24: No more paths to device: cdb 0x28:0000a3f0:0008. All retries have failed.

Jul 14 10:19:01 [localhost:disk.senseError:error]: Disk 0a.24: op 0x28:0000a3f0:0008 sector 41968 SCSI:aborted command - (b 8 1 81)

Jul 14 10:19:01 [localhost:cf.nm.nicReset:warning]: HA interconnect: Initiating soft reset on card 0 due to rendezvous reset.

Jul 14 10:19:01 [localhost:rv.connection.torndown:info]: HA interconnect: cfo_rv is torn down on NIC 0.

Jul 14 10:19:01 [localhost:cf.rv.notConnected:error]: HA interconnect: Connection for 'cfo_rv' failed.

Jul 14 10:19:01 [localhost:cf.nm.nicTransitionDown:warning]: HA interconnect: Link down on NIC 0.

Jul 14 10:19:01 [localhost:cf.rv.notConnected:error]: HA interconnect: Connection for 'cfo_rv' failed.

Jul 14 10:19:03 [localhost:cf.nm.nicTransitionUp:info]: HA interconnect: Link up on NIC 0.

Jul 14 10:19:04 [localhost:rv.connection.established:info]: HA interconnect: cfo_rv is connected on NIC 0.

Jul 14 10:19:04 [localhost:fmmb.current.lock.disk:info]: Disk 0a.16 is a local HA mailbox disk.

Jul 14 10:19:04 [localhost:fmmb.instStat.change:info]: normal mailbox instance on local side.

Jul 14 10:19:08 [localhost:fmmb.current.lock.disk:info]: Disk 0c.19 is a partner HA mailbox disk.

Jul 14 10:19:08 [localhost:fmmb.current.lock.disk:info]: Disk 0a.23 is a partner HA mailbox disk.

Jul 14 10:19:08 [localhost:fmmb.instStat.change:info]: normal mailbox instance on partner side.

Jul 14 10:19:08 [localhost:cf.fm.partner:info]: Failover monitor: partner 'Stor02'

Jul 14 10:19:08 [localhost:coredump.host.spare.none:info]: No sparecore disk was found for host 0.

Jul 14 10:19:08 [localhost:raid.assim.rg.missingChild:error]: Aggregate aggr0, rgobj_verify: RAID object 0 has only 7 valid children, expected 12.

Jul 14 10:19:08 [localhost:raid.assim.plex.missingChild:error]: Aggregate aggr0, plexobj_verify: Plex 0 only has 0 working RAID groups (1 total) and is being

taken offline

Jul 14 10:19:08 [localhost:raid.assim.mirror.noChild:ALERT]: Aggregate aggr0, mirrorobj_verify: No operable plexes found.

Uptime: 8m7s

Jul 14 10:19:08 [localhost:raid.assim.rg.missingChild:error]: Aggregate aggr1, rgobj_verify: RAID object 0 has only 7 valid children, expected 12.

Jul 14 10:19:08 [localhost:raid.assim.plex.missingChild:error]: Aggregate aggr1, plexobj_verify: Plex 0 only has 0 working RAID groups (1 total) and is being taken offline

Jul 14 10:19:08 [localhost:raid.assim.mirror.noChild:ALERT]: Aggregate aggr1, mirrorobj_verify: No operable plexes found.

System rebooting...

Phoenix TrustedCore(tm) Server

Thank you very much

What is happening to our NetApp ? How can we recover it ?

I2A Registration is Open!