RAID rebuilding does not start after replaced broken hard drive.

CoriCommu · ‎2015-12-04

Hi,

I was aware that one of the hard drive is broken on FAS2020.

So I did the replacement for the broken hard drive.

Before I replace the broken hard drive, I confirmed the status of aggregate:

> aggr status -r

Aggregate bigaggr (online, raid_dp, degraded) (block checksums)
Plex /bigaggr/plex0 (online, normal, active, pool0)
RAID group /bigaggr/plex0/rg0 (degraded)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0c.00.11 0c 0 11 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
parity 0c.00.9 0c 0 9 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data 0c.00.6 0c 0 6 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data 0c.00.7 0c 0 7 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data FAILED N/A 847555/1735794176
data 0c.00.5 0c 0 5 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data 0c.00.0 0c 0 0 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816

After I replaced the hard drive, zeroed the new hard drive (it is started as a spare disk), and add it to the aggregate, I confirmed again:

> aggr status -r

Aggregate bigaggr (online, raid_dp, degraded) (block checksums)
Plex /bigaggr/plex0 (online, normal, active, pool0)
RAID group /bigaggr/plex0/rg0 (degraded)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0c.00.11 0c 0 11 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
parity 0c.00.9 0c 0 9 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data 0c.00.6 0c 0 6 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data 0c.00.7 0c 0 7 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data FAILED N/A 847555/1735794176
data 0c.00.5 0c 0 5 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data 0c.00.0 0c 0 0 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data 0c.00.3 0c 0 3 SA:A 0 SATA 7200 423111/866531584 423946/868242816

The new hard drive (0c.00.3) was added successfully (blue line) but the FAILED line (red line) is still appeared.

And I couldn't find any thing shows the RAID rebuilding is starting...

Could you tell me what's the wrong with it?

My questions are:

1) Does it because I have to add a same size as other data hard drives (173579417 but not 866531584)?

2) If 1) is yes, can I remove the disk (0c.00.3) that I just added from the aggregate without any data lose now?

3) If I add a disk that has the same size as other disks in this aggregate, how can I check the rebuilding status (remaining)?

I'm a beginner of NetApp, so my questions maybe very beginner... If you can help me, it will be a big help.

Thank you very much!

SeanHatfield · ‎2015-12-04

You can't replace a failed disk in a raid group with a smaller disk. It appears you have expanded the degraded raid group with a smaller capacity drive, which is not what you intended. You should probably open a case at this point to review your options.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

CoriCommu · ‎2015-12-04

Hi SeanHatfield

Thank you for the reply!

Yes now I have to review my options. I wasn't aware that the size is different... I hope I can have a way to remove the 500GB from the aggregate.

Thank you.

GidonMarcus · ‎2015-12-04

yes, you needed to use the same size disk.
because you added this disk to the aggregate, you now cannot remove it (you can replace to bigger disk after you restore your raid functionality).
if you still have an empty slot, that's the time to add correct size disk and let the system do the rebuild work.
if you didn’t have spare disk/slot from the first place, i will suggest to change your aggregate raid level to 4, that will free up one of the parity disks, and it can then be used as data disk and will allow you to restore the aggregate performance. This suggestion will put you in a very degraded situation, so it’s on your own risk of course….

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

CoriCommu · ‎2015-12-04

Hi GidonMarcus,

Thank you for your answers and solutions in detials!

May I please ask additional questions (confirmation) about answer 2 and 3?

ANSWER 2: Does it mean now I cannot remove it, but if I have an empty slot and add a new 1TB hard drive to the emply slot, I will be able to have a chance to replace it to the 1TB disk? As a result the 500GB disk will be able to remove from this aggregate? If yes, could you give me a hint how can I do that?

ANSWER 3: Actually I still have a empty slot (0c.00.10), so I am planning to add a 1TB hard drive to the empty, by doing following procedure, will the rebuild be started automatically?

1)Insert new 1TB hard drive to the empty slot (0c.00.10)
2)Then the hard drive will be assigned to a spare disk.
I run "disk zero spares" to zero it.
3)After zeroing finish, I run "aggr add bigaggr -d 0c.00.10" to add the spare disk to this aggr.

Then rebuild will be started. Is my thinking and procedure correct?

Sorry for asking so long and so beginner questions. If you can give me a reply, it will be really big help.

Thank you very much!!!

Regards,

aborzenkov · ‎2015-12-04

3)After zeroing finish, I run "aggr add bigaggr -d 0c.00.10" to add the spare disk to this aggr.
Then rebuild will be started. Is my thinking and procedure correct?

NO! You do not replace failed disk by adding disk to aggregate. And I'm rather surprised nobody mentioned this before.

There is no command to manually initiate rebuild. Rebuild is started as soon as spare disk of suitable size is available. In your case it did not happen becuase you did not have matching spare disk.

What you should do now is

add new spare disk of 1TB in size. Do not add disk to aggregate!
wait until rebuild is completed
physically replace failed disk with yet another 1TB disk. Do not add disk to aggregate
run disk replace command to move 500GB disk to new 1TB disk
remove 500GB disk and add one more 1TB disk as spare

This will leave you with healthy aggregate (expanded by one disk) consistng of 1TB disks and one more 1TB spare which will be used next time disk fails.

If you do not want to keep aggregate expanded, your only option is to move/copy data, destroy and recreate aggregate and copy data back.

CoriCommu · ‎2015-12-06

Hi aborzenkov

Thank you for giving me so helpful and detailed explanation!
The procedure that you gave me is really a good leaning for me. Thank you very much!

Following your advice, I'm going to add a new 1TB disk and make it to spare (I still have a empty slot in bay 0c.00.10, so I will use this slot). Then as you mentioned, I will wait until rebuild is completed.

At this point I just have only two question:

1) After the rebuild is completed, will following line in aggr status disappear? And can I see the remaining progress status of rebuild when it is rebuilding?

data FAILED N/A 847555/1735794176

2) I know that 500GB disk in this aggregate is not a suitable size. But is it functioning for this aggregate? You metioned I should run disk replace command to move the 500GB disk to new 1TB disk, but I don't have empty slot anymore after I add a new 1TB disk to bay 0c.00.10. So can I just keep the 500GB disk in this aggregate?

Anyway at this point I don't want to have any risk, so I will focus on how to let the aggregate keep its health.

Thank you very much aborzenkov!

Regards,

aborzenkov · ‎2015-12-06

1) After the rebuild is completed, will following line in aggr status disappear? And can I see the remaining progress status of rebuild when it is rebuilding?

data FAILED N/A 847555/1735794176

Yes to both. Failed disk won't be part of aggregate anymore and you can see percentage in aggr status -r.

2) I know that 500GB disk in this aggregate is not a suitable size. But is it functioning for this aggregate?

Unfortunately, yes. Which is why you cannot remove it.

You metioned I should run disk replace command to move the 500GB disk to new 1TB disk, but I don't have empty slot anymore after I add a new 1TB disk to bay 0c.00.10.

You still have failed disk. You should now remove it, it will give you free slot. Or have you already done it? Please show full "sysconfig -r" output.

CoriCommu · ‎2015-12-07

Hi aborzenkov

Thank you for your detailed explanation!!

You are really giving me big help! Thank you very much!

Yes as you said, actually the the empty slot that I mentioned is inserted a failed hard drive. I think the "data FAILED" in aggregate "bigaggr" is it.

I knew it is broken just because the LED of it turned amber, but I can't see any information of it by "sysconfig -r" command. That's why I said it's a empty slot. As you mentioned, strickly speaking I should said it is failed disk.

Following is the output result of "sysconfig -r" now.

______________________________________________________________________
> sysconfig -r
Aggregate aggr0 (online, raid4) (block checksums)
Plex /aggr0/plex0 (online, normal, active, pool0)
RAID group /aggr0/plex0/rg0 (normal)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
parity 0c.00.4 0c 0 4 SA:A 0 SATA 7200 423111/866531584 423946/868242816
data 0c.00.2 0c 0 2 SA:A 0 SATA 7200 423111/866531584 423946/868242816
data 0c.00.1 0c 0 1 SA:A 0 SATA 7200 423111/866531584 423946/868242816

Aggregate bigaggr (online, raid_dp, degraded) (block checksums)
Plex /bigaggr/plex0 (online, normal, active, pool0)
RAID group /bigaggr/plex0/rg0 (degraded)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0c.00.11 0c 0 11 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
parity 0c.00.9 0c 0 9 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data 0c.00.6 0c 0 6 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data 0c.00.7 0c 0 7 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data FAILED N/A 847555/1735794176
data 0c.00.5 0c 0 5 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data 0c.00.0 0c 0 0 SA:A 0 SATA 7200 847555/1735794176 847884/1736466816
data 0c.00.3 0c 0 3 SA:A 0 SATA 7200 423111/866531584 423946/868242816

Pool1 spare disks (empty)

Pool0 spare disks

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
Spare disks for block or zoned checksum traditional volumes or aggregates
spare 0c.00.8 0c 0 8 SA:A 0 SATA 7200 423111/866531584 423946/868242816
______________________________________________________________________

As you adviced, the next step I'm going to replace the failed hard drive ("empty slot" 0c.00.10) by a 1TB new hard drive, and set the new 1TB hard drive to a spare disk.

After that, I will look the rebuild started automatically, and confirm that the "data FAILED" disappear from the aggregate (bigaggr). I think keep the aggregate healthy is the most important mission for me now...

I'm really a beginner of NetApp, so if my plan has any problems, if you can notice that to me, it really will be a big help!!

Thank you for your kindly explanation! Very helpful!

Thank you very much!

Regards,

aborzenkov · ‎2015-12-07

You have one spare disk. Spare can be used for smaller disks as well. So what I said before still holds:

physically replaces 500GB spare with 1TB spare
run "disk replace" to swap 500GB disk in bigaggr with 1TB spare. This will leave you with 500GB spare again
physically replace 500GB with one more 1TB disk

This will leave you with one 1TB spare suitable for both aggr0 and bigaggr.

If you have possibility to do so, I'd consider moving content of aggr0 to bigaggr, replacing 500GB disks with 1TB disks and adding them to bigaggr. This would improve space utilization and resiliency and may give slight performance boost.