There are not enough spare disks.... Need pointing in the right direction

NetAppCommChris · ‎2018-05-31

Hi there. My name is Chris. Long story short, I came onboard a company whos 2 network engineers left about the same time (not on good terms I believe). The company has a NetAPP FS3020 running Data ONTAP 7.3.5.1P1. Of course there is no service contract/agreement on it and I'm pretty sure its already reached EOL. The IT Manager knows basically nothing about it other than the login credentials and the fact that it is what all our VMware VMs are running on.

I've never used a NetAPP SAN. All I have are the credentials to login and I can already tell there are a few things wrong with it. See attached image below.

In addition to this it looks like a complete raid group has failed. And beyond that, There is a spare 4th shelf that we have, complete with disks, that is sitting there untouched. I don't know why we are not using this but I would like to install it on the rack and connect it to expand our storage space and offer more spare disks for the RAID arrays. from what I can tell, these seems to be only 1 spare disk left.

If anyone can point me in the right direction to fix the error message about not enough spare disks and a guide on how to install/integrate the 4th spare shelf.

Any help would be appreciated.

Thanks.

Here is the output of sysconfig -r

Aggregate aggr0 (failed, raid_dp, foreign, partial) (block checksums)
Plex /aggr0/plex0 (offline, failed, inactive)
RAID group /aggr0/plex0/rg1 (partial)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity FAILED N/A 272000/557056000
parity FAILED N/A 272000/557056000
data FAILED N/A 272000/557056000
data FAILED N/A 272000/557056000
data FAILED N/A 272000/557056000
data FAILED N/A 272000/557056000
data 0d.41 0d 2 9 FC:B - FCAL 10000 272000/557056000 280104/573653840 (prefail)
data FAILED N/A 272000/557056000
data FAILED N/A 272000/557056000
data FAILED N/A 272000/557056000
data FAILED N/A 272000/557056000
data FAILED N/A 272000/557056000
Raid group is missing 11 disks.
Plex is missing 2 RAID groups.

Aggregate aggr2 (online, raid_dp) (block checksums)
Plex /aggr2/plex0 (online, normal, active)
RAID group /aggr2/plex0/rg0 (normal)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0b.16 0b 1 0 FC:B - ATA 7200 847555/1735794176 847827/1736350304
parity 0b.17 0b 1 1 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 0b.18 0b 1 2 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 0b.19 0b 1 3 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 0c.20 0c 1 4 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 0c.21 0c 1 5 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 0c.22 0c 1 6 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 0c.23 0c 1 7 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 0b.24 0b 1 8 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 0b.25 0b 1 9 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 0c.26 0c 1 10 FC:A - ATA 7200 847555/1735794176 847827/1736350304
data 0b.27 0b 1 11 FC:B - ATA 7200 847555/1735794176 847827/1736350304
data 0c.28 0c 1 12 FC:A - ATA 7200 847555/1735794176 847827/1736350304

Aggregate aggr1 (online, raid_dp) (block checksums)
Plex /aggr1/plex0 (online, normal, active)
RAID group /aggr1/plex0/rg0 (normal)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0d.16 0d 1 0 FC:B - FCAL 10000 272000/557056000 280104/573653840
parity 0d.17 0d 1 1 FC:B - FCAL 10000 272000/557056000 280104/573653840
data 0a.18 0a 1 2 FC:A - FCAL 10000 272000/557056000 280104/573653840
data 0a.22 0a 1 6 FC:A - FCAL 10000 272000/557056000 280104/573653840
data 0d.39 0d 2 7 FC:B - FCAL 10000 272000/557056000 280104/573653840
data 0a.19 0a 1 3 FC:A - FCAL 10000 272000/557056000 280104/573653840
data 0a.20 0a 1 4 FC:A - FCAL 10000 272000/557056000 280104/573653840
data 0a.23 0a 1 7 FC:A - FCAL 10000 272000/557056000 280104/573653840
data 0d.25 0d 1 9 FC:B - FCAL 10000 272000/557056000 280104/573653840
data 0d.21 0d 1 5 FC:B - FCAL 10000 272000/557056000 280104/573653840
data 0a.26 0a 1 10 FC:A - FCAL 10000 272000/557056000 280104/573653840
data 0a.32 0a 2 0 FC:A - FCAL 10000 272000/557056000 280104/573653840
data 0a.33 0a 2 1 FC:A - FCAL 10000 272000/557056000 280104/573653840
data 0d.34 0d 2 2 FC:B - FCAL 10000 272000/557056000 280104/573653840
data 0d.35 0d 2 3 FC:B - FCAL 10000 272000/557056000 280104/573653840
data 0d.36 0d 2 4 FC:B - FCAL 10000 272000/557056000 280104/573653840

RAID group /aggr1/plex0/rg1 (normal)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0d.37 0d 2 5 FC:B - FCAL 10000 272000/557056000 280104/573653840
parity 0d.27 0d 1 11 FC:B - FCAL 10000 272000/557056000 280104/573653840
data 0d.38 0d 2 6 FC:B - FCAL 10000 272000/557056000 280104/573653840
data 0d.28 0d 1 12 FC:B - FCAL 10000 272000/557056000 280104/573653840
data 0d.29 0d 1 13 FC:B - FCAL 10000 272000/557056000 280104/573653840
data 0a.44 0a 2 12 FC:A - FCAL 10000 272000/557056000 274845/562884296
data 0d.42 0d 2 10 FC:B - FCAL 10000 272000/557056000 274845/562884296
data 0d.45 0d 2 13 FC:B - FCAL 10000 272000/557056000 274845/562884296
data 0d.43 0d 2 11 FC:B - FCAL 10000 272000/557056000 274845/562884296
data 0a.40 0a 2 8 FC:A - FCAL 10000 272000/557056000 280104/573653840

Spare disks

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
Spare disks for block or zoned checksum traditional volumes or aggregates
spare 0b.29 0b 1 13 FC:B - ATA 7200 847555/1735794176 847827/1736350304

aborzenkov · ‎2018-05-31

Failed aggregate is foreign, so I suspect it is ghost resulting from using second hand disk, probably as replacement. They are often sold “as is”, without zeroing them first, so NetApp detects that it was part of aggregate. Anyway, if your VMware admins do not scream loud yet, there was probably nothing important there even if i’m wrong 🙂

GidonMarcus · ‎2018-05-31

hi

Agree with the last comment, you can technically destroy this foreign aggr and have it as a spare. but as it already marked as "pre fail" it might not stay as spare candidate for long (will fail itself)

Note that the current actual spare disk you have there is from a type that can only replace a failed disk in AGGR2 and cannot replace a filed disk in AGGR1. you also didn't mention what other shelf and IO Module type you have there. it could a 3rd type or non compatible.

Anyhow. this hardware is 15-10 years old. there were three generations of disks shelves and heads after it and the 4th is coming. it's about time for this system to get decommissioned, i think that touching it and trying to expand it adds more risk for it's stability and getting parts for it will not be easy.

Gidi

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

NetAppCommChris · ‎2018-06-01

I see. Yes it looks like it has prefailed so I wouldn't even bother using it as a spare.

Ohh your right about that spare. I did not see that before. But now that I think of it (and correct me if I'm wrong) isn't the whole purpose of having an aggregates in a RAID setup is so that it allows for disks to go bad? I mean that is what the parity is for. So even if I don't have a spare, I shouldn't sweat it because if a disk does go bad, we can just order a new one and simply swap it out to rebuild the RAID.

Now beyond that, even though the nature of RAID is for fault tolerance, I would feel much safer knowing that there are at least (2) appropriately-sized spares of both sized disks (847827MB & 280104MB). Hencse that is why I want to add that shelf. The other shelf that we have is a DS14MK2. (pic below) It looks very similar to the first 3 shelves. We installed it on the rack and powered it on not too long ago. Its simply not connected via the fiber optic cable.

I know this system is very old, but they do not want to spend any money on upgrading to a new system. Therefore I just want to make sure this system lasts for as long as it can.

What would you think the best course of action would be?

Blow away aggregate0, remove all the disks, replace with new 850GB 7200RPM, and keep as spares? or
Add the 4th shelf, zero all the disks on that shelf, and keep those as spares?

Or in lieu of spares I coudl just join them to the aggregate thus increasing the storage capacity.

NetAppCommChris · ‎2018-06-01

I don't know why but when I replied it didn't show in the thread so I will repost my reply to

I see. Yes it looks like it has prefailed so I wouldn't even bother using it as a spare.

Ohh your right about that spare. I did not see that before. But now that I think of it (and correct me if I'm wrong) isn't the whole purpose of having an aggregates in a RAID setup is so that it allows for disks to go bad? I mean that is what the parity is for. So even if I don't have a spare, I shouldn't sweat it because if a disk does go bad, we can just order a new one and simply swap it out to rebuild the RAID.

Now beyond that, even though the nature of RAID is for fault tolerance, I would feel much safer knowing that there are at least (2) appropriately-sized spares of both sized disks (847827MB & 280104MB). Hencse that is why I want to add that shelf. The other shelf that we have is a DS14MK2. (pic below) It looks very similar to the first 3 shelves. We installed it on the rack and powered it on not too long ago. Its simply not connected via the fiber optic cable.

I know this system is very old, but they do not want to spend any money on upgrading to a new system. Therefore I just want to make sure this system lasts for as long as it can.

What would you think the best course of action would be?

Blow away aggregate0, remove all the disks, replace with new 850GB 7200RPM, and keep as spares? or
Add the 4th shelf, zero all the disks on that shelf, and keep those as spares?

Or in lieu of spares I coudl just join them to the aggregate thus increasing the storage capacity.

aborzenkov · ‎2018-06-01

1. It is possible to have “cold spare”, at the end it depends on how valuable Data is and how quick you can notice problem and replace failed disk. Your raid group will be unprotected (less protected) until rebuild competed; any error during this time means potential data loss.

2. The problem is not disk size (large disk can be used to spare smaller one) but disk type. By default FC-AL and ATA disks cannot be mixed in one aggregate.

3. Your unused shelf is AT which means it cannot be used as spare for small FC-AL disks.

I think there was an option to allow it; you need to check documentation. Think twice before doing it though if there is any serious load on these disks.