Root aggregate and rebuilds

brendanheading · ‎2010-10-06

Hi all,

I know this subject has been done to death but I was hoping to clarify a few details around aggregate consistency checks (wafl_check), specifically in the scenario common on smaller FASen where there is no dedicated root aggregate.

1. The idea behind having a separate root aggregate is so that the controller in question can be brought up quickly in the event of corruption. But it doesn't effect the length of time the data server will be inaccessible for, since the aggregate being checked must remain offline until the check completes (outside of the obvious fact that all other things assumed equal an aggregate without a root volume would have slightly less data to check). Is this correct ?

2. Is it possible to leave a volume online even though it has been marked for a consistency check, in order (for example) to delay having to take the filer offline until a convenient point eg. outside business hours ? If so what are the risks ? Looking at the documentation it does look as if there is an option to start even if the volume is marked inconsistent. In my head I'm thinking that, in an emergency, one could use snapmirror to move volumes over to another aggregate on the other controller and serve them from there with minimal downtime.

3. Is it possible to estimate how long a check can take given the size of the aggregate and the amount of active data stored therein ? Given the scenario where we've got two 11-disk aggregates (9+2 parity) on a FAS2020, I'm hoping that it's something that would take hours rather than days.

4. Do I understand correctly that the wafliron command can bring up the volumes on an aggregate one by one, as they are checked off, provided that there are no root volumes therein ? As such, it sounds like the quickest option to return to service would be to keep a recent copy (eg with snapmirror) of the root volume; in the event of a failure, redesignate the SM copy as the "real" root and then run wafliron ?

erick_moore · ‎2010-12-06

Lets bump up this thread because these are all valid and important questions. I have heard both ways from NetApp on if you really need a separate aggregate for your root volume. Some seem to think having it separated is a must, others seem to think that it isn't needed anymore with the newer releases of ONTAP. I am somewhat torn on the issue. We have never experienced any kind of data corruption, even when we had the power pulled on both controllers of an HA pair during peak processing. The part where I am torn though is the "what if" case. My concern, like yours, is how long will my data be unavailable? If the system has to work through the entire aggregate the root volume sits on before bringing up the system that could be an issue, especially if you have your root vol sitting on a large 64bit aggregate.

Can anyone from NetApp chime in here? What about system integrators; best practices, things you have seen in the field? I too am curious as we are thinking about forgoing the separate root aggregate if there are sufficient means to bring data online quickly.

Thanks,

E-