Hi Alex,
first of all, thank you very much for your reply.
But it seems we have another different problem about it, let me explain better than the last time:
-- The system had a problem and it was down:
Mon Jan 28 07:04:59 GMT [localhost: rc:notice]: The system was down for 155109 seconds
## After that, the panic string appeared:
Mon Jan 28 07:50:08 CET [netapp1:sk.panic:ALERT]: Panic String: NVRAM contents are invalid... in SK process rc on release
8.1.2P4
## I don't know th reason but all aggregates showed this issue as well:
messages.0:Mon Jan 28 08:04:35 CET [netapp1:raid.vol.reparity.issue:warning]: Aggregate aggr3_fc has invalid NVRAM contents.
messages.0:Mon Jan 28 08:04:35 CET [netapp1:raid.vol.reparity.issue:warning]: Aggregate aggr1 has invalid NVRAM contents.
messages.0:Mon Jan 28 08:04:35 CET [netapp1:raid.vol.reparity.issue:warning]: Aggregate aggr0 has invalid NVRAM contents.
To sum up, the aggr0 has been unable to complete a reconstruction with several different disks, all of them failed in the end:
Mon Jan 28 08:04:35 CET [netapp1:raid.vol.reparity.issue:warning]: Aggregate aggr0 has invalid NVRAM contents.
Mon Jan 28 08:04:36 CET [netapp1:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr0' UUID '4d5c0920-2ac7-11df-8f8f-00a0982321ca' was built in 0 msec, after scanning 0 inodes and restarting -1 times with a final result of starting.
Mon Jan 28 08:04:36 CET [netapp1:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr0' UUID '4d5c0920-2ac7-11df-8f8f-00a0982321ca' was built in 112 msec, after scanning 36 inodes and restarting 25 times with a final result of success.
Mon Jan 28 08:05:10 CET [netapp1:raid.rg.recons.resume:debug]: /aggr0/plex0/rg0: resuming reconstruction, using disk 1a.32 (block 529792, 0% complete)
Mon Jan 28 08:16:03 CET [netapp1:raid.rg.recons.done:debug]: /aggr0/plex0/rg0: reconstruction completed for 0c.32 in 10:52.56
Mon Jan 28 08:38:10 CET [netapp1:raid.rg.spares.low:warning]: /aggr0/plex0/rg0
Mon Jan 28 10:20:10 CET [netapp1:raid.config.filesystem.disk.failed:error]: File system Disk /aggr0/plex0/rg0/0c.32 Shelf 2 Bay 0 [NETAPP X269_WMARS01TSSX NA00] S/N [WD-WMATV4808646] failed.
Mon Jan 28 10:20:24 CET [netapp1:raid.rg.recons.missing:notice]: RAID group /aggr0/plex0/rg0 is missing 1 disk(s).
Mon Jan 28 10:20:24 CET [netapp1:raid.rg.recons.cantStart:warning]: The reconstruction cannot start in RAID group /aggr0/plex0/rg0: No matching disks available in spare pool
Mon Jan 28 11:00:00 CET [netapp1:monitor.raid.brokenDisk:warning]: data disk in RAID group /aggr0/plex0/rg0 is broken.
Mon Jan 28 16:47:53 CET [netapp1:raid.rg.recons.missing:notice]: RAID group /aggr0/plex0/rg0 is missing 1 disk(s).
Mon Jan 28 16:47:53 CET [netapp1:raid.rg.recons.info:notice]: Spare disk 0c.32 will be used to reconstruct one missing disk in RAID group /aggr0/plex0/rg0.
Mon Jan 28 16:47:53 CET [netapp1:raid.rg.recons.start:notice]: /aggr0/plex0/rg0: starting reconstruction, using disk 0c.32
Mon Jan 28 16:52:49 CET [netapp1:raid.config.filesystem.disk.failed:error]: File system Disk /aggr0/plex0/rg0/0c.32 Shelf 2 Bay 0 [NETAPP X269_HJUPI01TSSX NA01] S/N [HZ3723PL] failed.
Mon Jan 28 16:52:49 CET [netapp1:raid.rg.disk.reconstruction.failed:notice]: /aggr0/plex0/rg0: reconstruction failed for a disk in the raidgroup
Mon Jan 28 16:52:49 CET [netapp1:raid.rg.recons.aborted:notice]: /aggr0/plex0/rg0: reconstruction aborted at disk block 5248 after 4:56.04
Mon Jan 28 16:52:49 CET [netapp1:raid.rg.recons.missing:notice]: RAID group /aggr0/plex0/rg0 is missing 1 disk(s).
Mon Jan 28 16:52:49 CET [netapp1:raid.rg.recons.cantStart:warning]: The reconstruction cannot start in RAID group /aggr0/plex0/rg0: No matching disks available in spare pool
Mon Jan 28 17:00:00 CET [netapp1:monitor.raiddp.vol.singleDegraded:warning]: data disk in RAID group /aggr0/plex0/rg0 is broken.
Mon Jan 28 23:13:29 CET [netapp1:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr0' UUID '4d5c0920-2ac7-11df-8f8f-00a0982321ca' was built in 0 msec, after scanning 0 inodes and restarting -1 times with a final result of starting.
Mon Jan 28 23:13:29 CET [netapp1:wafl.aggr.btiddb.build:info]: Buftreeid database for aggregate 'aggr0' UUID '4d5c0920-2ac7-11df-8f8f-00a0982321ca' was built in 79 msec, after scanning 36 inodes and restarting 23 times with a final result of success.
Mon Jan 28 23:13:45 CET [netapp1:raid.rg.recons.missing:notice]: RAID group /aggr0/plex0/rg0 is missing 1 disk(s).
Mon Jan 28 23:13:45 CET [netapp1:raid.rg.recons.cantStart:warning]: The reconstruction cannot start in RAID group /aggr0/plex0/rg0: No matching disks available in spare pool
Mon Jan 28 23:14:02 CET [netapp1:monitor.raid.brokenDisk:warning]: data disk in RAID group /aggr0/plex0/rg0 is broken.
Mon Jan 28 23:19:00 CET [netapp1:raid.rg.spares.low:warning]: /aggr0/plex0/rg0
Tue Jan 29 00:00:00 CET [netapp1:monitor.raiddp.vol.singleDegraded:warning]: data disk in RAID group /aggr0/plex0/rg0 is broken.
Wed Jan 30 13:54:18 CET [netapp1:raid.rg.recons.missing:notice]: RAID group /aggr0/plex0/rg0 is missing 1 disk(s).
Wed Jan 30 13:54:18 CET [netapp1:raid.rg.recons.info:notice]: Spare disk 0c.34 will be used to reconstruct one missing disk in RAID group /aggr0/plex0/rg0.
Wed Jan 30 13:54:18 CET [netapp1:raid.rg.recons.start:notice]: /aggr0/plex0/rg0: starting reconstruction, using disk 0c.34
Wed Jan 30 14:03:59 CET [netapp1:raid.config.filesystem.disk.failed:error]: File system Disk /aggr0/plex0/rg0/1a.34 Shelf 2 Bay 2 [NETAPP X269_HJUPI01TSSX NA01] S/N [N03479DL] failed.
Wed Jan 30 14:03:59 CET [netapp1:raid.rg.disk.reconstruction.failed:notice]: /aggr0/plex0/rg0: reconstruction failed for a disk in the raidgroup
Wed Jan 30 14:03:59 CET [netapp1:raid.rg.recons.aborted:notice]: /aggr0/plex0/rg0: reconstruction aborted at disk block 5248 after 9:41.17
Wed Jan 30 14:04:10 CET [netapp1:raid.rg.recons.missing:notice]: RAID group /aggr0/plex0/rg0 is missing 1 disk(s).
Wed Jan 30 14:04:10 CET [netapp1:raid.rg.recons.info:notice]: Spare disk 1a.44 will be used to reconstruct one missing disk in RAID group /aggr0/plex0/rg0.
Wed Jan 30 14:04:10 CET [netapp1:raid.rg.recons.start:notice]: /aggr0/plex0/rg0: starting reconstruction, using disk 1a.44
Wed Jan 30 14:13:29 CET [netapp1:raid.config.filesystem.disk.failed:error]: File system Disk /aggr0/plex0/rg0/1a.44 Shelf 2 Bay 12 [NETAPP X269_HJUPI01TSSX NA01] S/N [N12V1VZL] failed.
Wed Jan 30 14:13:29 CET [netapp1:raid.rg.disk.reconstruction.failed:notice]: /aggr0/plex0/rg0: reconstruction failed for a disk in the raidgroup
Wed Jan 30 14:13:29 CET [netapp1:raid.rg.recons.aborted:notice]: /aggr0/plex0/rg0: reconstruction aborted at disk block 5248 after 9:18.82
Wed Jan 30 14:13:29 CET [netapp1:raid.rg.recons.missing:notice]: RAID group /aggr0/plex0/rg0 is missing 1 disk(s).
Wed Jan 30 14:13:29 CET [netapp1:raid.rg.recons.info:notice]: Spare disk 1a.42 will be used to reconstruct one missing disk in RAID group /aggr0/plex0/rg0.
Wed Jan 30 14:13:59 CET [netapp1:raid.rg.recons.start:notice]: /aggr0/plex0/rg0: starting reconstruction, using disk 0c.42
Wed Jan 30 14:35:35 CET [netapp1:raid.config.filesystem.disk.failed:error]: File system Disk /aggr0/plex0/rg0/0c.42 Shelf 2 Bay 10 [NETAPP X269_HJUPI01TSSX NA01] S/N [J80PGUBL] failed.
Wed Jan 30 14:35:35 CET [netapp1:raid.rg.disk.reconstruction.failed:notice]: /aggr0/plex0/rg0: reconstruction failed for a disk in the raidgroup
Wed Jan 30 14:35:35 CET [netapp1:raid.rg.recons.aborted:notice]: /aggr0/plex0/rg0: reconstruction aborted at disk block 5248 after 21:35.78
Wed Jan 30 14:35:35 CET [netapp1:raid.rg.recons.missing:notice]: RAID group /aggr0/plex0/rg0 is missing 1 disk(s).
Wed Jan 30 14:35:35 CET [netapp1:raid.rg.recons.info:notice]: Spare disk 0c.32 will be used to reconstruct one missing disk in RAID group /aggr0/plex0/rg0.
Wed Jan 30 14:35:35 CET [netapp1:raid.rg.recons.start:notice]: /aggr0/plex0/rg0: starting reconstruction, using disk 0c.32
Wed Jan 30 14:40:58 CET [netapp1:raid.config.filesystem.disk.failed:error]: File system Disk /aggr0/plex0/rg0/0c.32 Shelf 2 Bay 0 [NETAPP X269_HJUPI01TSSX NA01] S/N [J80PH3UL] failed.
Wed Jan 30 14:40:58 CET [netapp1:raid.rg.disk.reconstruction.failed:notice]: /aggr0/plex0/rg0: reconstruction failed for a disk in the raidgroup
Wed Jan 30 14:40:58 CET [netapp1:raid.rg.recons.aborted:notice]: /aggr0/plex0/rg0: reconstruction aborted at disk block 5248 after 5:23.22
Wed Jan 30 14:40:58 CET [netapp1:raid.rg.recons.missing:notice]: RAID group /aggr0/plex0/rg0 is missing 1 disk(s).
Wed Jan 30 14:40:58 CET [netapp1:raid.rg.recons.cantStart:warning]: The reconstruction cannot start in RAID group /aggr0/plex0/rg0: No matching disks available in spare pool
Wed Jan 30 15:00:00 CET [netapp1:monitor.raiddp.vol.singleDegraded:warning]: data disk in RAID group /aggr0/plex0/rg0 is broken.
So we have the aggr0 as follows:
Aggregate aggr0 (online, raid_dp, degraded) (block checksums)
Plex /aggr0/plex0 (online, normal, active)
RAID group /aggr0/plex0/rg0 (degraded, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 1a.16 1a 1 0 FC:A - ATA 7200 847555/1735794176 847827/1736350304
parity 0c.17 0c 1 1 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 1a.18 1a 1 2 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 1a.19 1a 1 3 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 0c.20 0c 1 4 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 0c.21 0c 1 5 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 1a.22 1a 1 6 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 1a.23 1a 1 7 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 0c.24 0c 1 8 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 0c.25 0c 1 9 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 1a.26 1a 1 10 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 0c.27 0c 1 11 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 0c.28 0c 1 12 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 1a.29 1a 1 13 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data FAILED N/A 847555/ -
But every time we try to replace a disk , we see the following message "has bad label" so it's my undertanding that I have to apply the following commands:
disk unfail -s 1a.32
disk zero spares
And it seems the reconstrucction process starts but failed in the end as you can see in the messages above.
Please, do you have any idea, new approach or comment for this behaviour?
thanks in advance!
Regards
Cristian