Re: How to physically move disks to another position or shelf wthout changing ownership. Story :)

Alfs29 · ‎2018-07-12

This is not a question but real life expierience with disk moving. So any comments are welcome.

System: ontap 8.1.x 7-mode HA.

Given: FAS3220 HA + 3 shelfs (one shelf is unpopulated with drives. How to add shelf to stack is another story and it can be googled & found)

Task: I want to physically move some disks from shelf1 & shelf2 to shelf 3. Why? To have 2 DS2246 shelfs only for raidgroups with ability to add more FULL shelfs if needed as RG(x) and one shelf with both root aggr and possibly something else i would like to exclude from regular data_aggr (rg0, rg1, rg(x))

1) If you have aggregate (not ROOT aggr!!!!!!!) you want to move to another shelf (same controller, same owner, just another place) and you can put aggregate offline it is easy.

Just offline lun, vol, aggr pull drives and put them i new place. Wait a minute or so until controller sees them in new place and bring them back online same way you took them offline. (reverse order of subtasks of course 🙂

2) More challenging task is to move root aggregate. If you can disable takeover, and shutdown (NOT TAKEOVER WAY!!!!!!!) particular controller then it is easy.

cf disable, halt .... move drives ... wait a minute or so until other controller sees partner drives in another place .... bye halted node and thats it.

BE WARNED NEVER NEVER NEVER TRY DOING THIS IN TAKEOVER MODE. It is not important that particulat controller is halted .... partner monitors halted controller mailbox disks and if they "disappear" you will get core dump on last controller alive ---> more silvery hairs on your head!!!! Yes, it will recover (mostly) but if you are serving vm hosts from taken over data_aggregate you will get freeze/reboots and possibly many many other problems. One more time DONT DO THIS WITH ROOT AGGR in takeover mode.

So if you cannot halt controller with all respective luns/etc offline then you must use root MIGRATION to another aggregate way. That procedure can be found googling.

I suspec that if you can halt both controllers (whole system) then all you have to do is move drives and power system back .... Pros, your comments are welcome on this last statement, as i havent tried it that way .... had enough fun/adrenaline already with core dump when tried moving root aggr in takeover mode 😄

D_BEREZENKO · ‎2018-07-12

I do not see a reason why not just pull disks out one by one and pull them back one by one, one drive at a time?
I’ve done it few times with two disks, it worked for me without raid reconstruction or any other problem.

WAFL can perfectly work without a drive for short period of time without and without raid reconstruction after you put it back.
If for any reason raid reconstruction will start of cause it’ll take long time to wait Lars say for 12 drives. But from my experience if you’ll pull a drive and put is back to another shelf, no reconstruction happens.

Alfs29 · ‎2018-07-12

@D_BEREZENKO wrote:
I do not see a reason why not just pull disks out one by one and pull them back one by one, one drive at a time?
I’ve done it few times with two disks, it worked for me without raid reconstruction or any other problem.

WAFL can perfectly work without a drive for short period of time without and without raid reconstruction after you put it back.
If for any reason raid reconstruction will start of cause it’ll take long time to wait Lars say for 12 drives. But from my experience if you’ll pull a drive and put is back to another shelf, no reconstruction happens.

WOW WOW WOW .... how can raid array live for "some short time" (and what is short time? pull out - and put back?) without 1 drive and not to start reconstruction to spare etc (assuming aggregate is not marked offline) .... ???

Ok, let it be WAFL feature but still .... if wafl/client needs data that is on that lost/removed drive it will start reconstruction .... maybe there are some options you can turn on or off but this is totally wrong aproach i guess.

aborzenkov · ‎2018-07-13

@D_BEREZENKO wrote:
I do not see a reason why not just pull disks out one by one and pull them back one by one, one drive at a time?
I’ve done it few times with two disks, it worked for me without raid reconstruction or any other problem.

I do not see how it can work unless a) you do not have spares available and b) there is no write activity on aggregate. Yes, NetApp can assimilate "lost" drive under some conditions but in this case spare would likely kick in before you can put drive back.

In any case - this is not how it is intended and documented to work so it is definietly not something that anyone should recommend.

@D_BEREZENKO wrote:
WAFL can perfectly work without a drive for short period of time without and without raid reconstruction after you put it back.

Define "short". Yes, WAFL can work for some time even under write load as disk background firmware update demonstrates, but if you look carefully it still performs reconstruction, just partial, not full. And it knows it was going to remove drive temporary. Here it is surprise removal which is equivalent to drive failure.

Anyway, returning to original post

@Alfs29 wrote:
Just offline lun, vol, aggr

You cannot offline aggregate containing volumes in normal mode so usual disclaimer about diag mode applies. And in any case - offline aggregate means loss of access to data at which point you can just as well simply halt filer. Otherwise yes, physical disk location does not matter - you can rearrange disks to your heart content when filer is not up.

It is always better to organize half an hour of planned downtime than risk half a day of unplanned downtime. And in your case I do not even understand the purpose of this exercise.

D_BEREZENKO · ‎2018-07-13

Guys, I tell you from real experience. I do not know whether it is an official feature and on which versions it was working (do not remember).

But it worked for me about 2-3 years ago with a FAS systems with cDOT. But surely it was 8.something ...

So check, whether you can confirm this in your case.

PS. And of course, customers should keep hot spare disks in the case full reconstruction will occur.

AlexDawson · ‎2018-07-22

ouch.. these really aren't recommended practices.. please don't do this.

But the stories are interesting, for sure 🙂