Hi all,
I know this subject has been done to death but I was hoping to clarify a few details around aggregate consistency checks (wafl_check), specifically in the scenario common on smaller FASen where there is no dedicated root aggregate.
1. The idea behind having a separate root aggregate is so that the controller in question can be brought up quickly in the event of corruption. But it doesn't effect the length of time the data server will be inaccessible for, since the aggregate being checked must remain offline until the check completes (outside of the obvious fact that all other things assumed equal an aggregate without a root volume would have slightly less data to check). Is this correct ?
2. Is it possible to leave a volume online even though it has been marked for a consistency check, in order (for example) to delay having to take the filer offline until a convenient point eg. outside business hours ? If so what are the risks ? Looking at the documentation it does look as if there is an option to start even if the volume is marked inconsistent. In my head I'm thinking that, in an emergency, one could use snapmirror to move volumes over to another aggregate on the other controller and serve them from there with minimal downtime.
3. Is it possible to estimate how long a check can take given the size of the aggregate and the amount of active data stored therein ? Given the scenario where we've got two 11-disk aggregates (9+2 parity) on a FAS2020, I'm hoping that it's something that would take hours rather than days.
4. Do I understand correctly that the wafliron command can bring up the volumes on an aggregate one by one, as they are checked off, provided that there are no root volumes therein ? As such, it sounds like the quickest option to return to service would be to keep a recent copy (eg with snapmirror) of the root volume; in the event of a failure, redesignate the SM copy as the "real" root and then run wafliron ?