Failed Disk + No Spares in Cluster Mode

TMADOCTHOMAS · ‎2020-03-03

Hello,

I have a pretty basic question that I cannot find an answer for in the knowledge base. Hopefully someone here can help!

First some background: we have a four node AFF8080 cluster. Unfortunately, the pro services company that set us up only configured one spare disk per node. The oldest of the systems has been running for four years and we just had our first failed disk early this morning (which is impressive, but that's another story). When the disk was in pre-fail state and copying to the spare, we got the low spares warning from Unified Manager and I got a call.

I am now waiting for the disk to arrive so I can replace the failed disk. In the meantime I looked around in the knowledge base to be sure I would know what to expect on the off-chance that a second disk failed before I could replace the one that had already failed. (While it seems unlikely, I have seen cases where suddenly multiple disks fail on a system).

I found the KB article shown below for 7-mode systems, but absolutely nothing for cluster mode. I spoke to a NetApp tech who was running the failed disk case but they just pointed me to other 7-mode articles, or a couple of cluster mode articles that didn't fit my scenario. Specifically: when you have 0 spare disks, what happens when a disk fails IN CLUSTER MODE? Can anyone shed light on the answer to this question and preferably point me to a KB article? Thanks so much!

https://kb.netapp.com/app/answers/answer_view/a_id/1004549/~/what-happens-when-a-disk-fails-and-does-not-have-a-hot-spare-installed%3F-

SpindleNinja · ‎2020-03-03

give this a look through.

https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.onc-sm-help-900%2FGUID-C8D33937-4945-48F7-8EFA-D668176F1C53.html

Depending on your disk count and aggr config, 1 per is probably enough for best practice. Granted some people do opt for extra spares.

Ss far as the what if. that's where RAID-DP comes in to play. each raid group has 2 parity disks.

GidonMarcus · ‎2020-03-03

Following the last message in the thread - i just want to make sure it's clear that 1 isn't the best practice.

https://www.netapp.com/us/media/tr-3437.pdf see section 5.4 at page 13

Gidi Marcus (Linkedin) - Storage and Microsoft technologies consultant - Hydro IT LTD - UK

SpindleNinja · ‎2020-03-04

I was checking newer sources due to OP saying AFF8080.

Page 152 of this is the same as the docs page: https://docs.netapp.com/ontap-9/topic/com.netapp.doc.onc-sm-help-900/Cluster%20management%20using%20System%20Manager%209.2%20and%20earlier.pdf

Also see the Disk and aggr power guide page 10

https://docs.netapp.com/ontap-9/topic/com.netapp.doc.dot-cm-psmg/Disk%20and%20aggregate%20management.pdf

TMADOCTHOMAS · ‎2020-03-04

Thank you both for very helpful replies! I didn't even think to look in the System Manager guide.

From what I've read, it appears that two spares per node would be a better configuration, but if we were to lose two disks at the same time we wouldn't experience an outage, at least not for 24 hours. Having said that, it appears that there would likely be a performance hit until one or both failed disks were replaced. Of course the disks would be replaced well before 24 hours had passed so I wouldn't expect that to be an issue. Thank you both again for the links and tips!

SpindleNinja · ‎2020-03-04

No problem. And like I said, it can very depending on comfort level. I have folks that want RAID-TEC and 2 spares per aggr (on drives below 15TB) or like the C190 case. There's no spares.

TMADOCTHOMAS · ‎2020-03-04

@SpindleNinja I'm interested in your comment on the C190 as we were thinking about buying one for a remote office. Is that a specific recommendation for that model? Do you have the option of adding spare disks if you want them?

SpindleNinja · ‎2020-03-04

@TMADOCTHOMAS there's a lot of catches with the C190 (lower cores, can't add a shelf, etc) but it does have its fit with use cases. One big advantage is lower cost and a spare is a drive you're paying for that you're not using for data storage.

Most folks i'm working with are looking at the 8-12 drive option when it comes to the C190. And with 8 drives... that leaves 7 for data, thinking a shared spare. But like you said in your first post "first failed disk early this morning, which is impressive". The SSDs NetApp uses have really low failure rate. And you have RAID DP on top of it, so you could loose two drives and still be OK.

TMADOCTHOMAS · ‎2020-03-04

Thanks @SpindleNinja . If we deployed a C190 we would likely do 12 drives. For years we did 2220s in 7-mode, 12 drives, with a RAID4 aggregate on node 2 and all data on node 1 with a single RAIDDP aggregate. One spare on each side. Gave us 4.1TB usable. We'd likely do something similar with C190 if spares are an option. I'm aware of the other limits but wasn't aware you might not be able to use spares at all.

@paul_stejskal thanks for your comment as well! Yeah it's amazing how notable it was that one of our flash drives failed.

SpindleNinja · ‎2020-03-04

The note is "The AFF C190 defaults to no spare drive. This exception is fully supported."

I would think you can manually create an aggr with a spare, but if you use the wizard, it would default to none.

TMADOCTHOMAS · ‎2020-03-04

Ah, makes sense. Thank you! Good to know.

paul_stejskal · ‎2020-03-04

1 should be fine. You can always reassign the spare if needed from the partner node as a temporary measure. You should have RAID-DP or RAID-TEC giving you 2x or 3x parity protection for that reason alone.