Hot Shelf Removal - check for mailbox disks first before remove ownership

scottgelb · ‎2015-07-27

BURT 931034 was filed for this - thank you Steve! But worth noting for anyone searching on hot removal to make one extra check. The hot shelf removal feature is great, but the documentation at https://library.netapp.com/ecmdocs/ECMP1367947/html/GUID-2B80FBD2-007D-4D19-8EB1-9CCEED211001.html
is missing a check. An edge case if you had mroot on those disks prior, and then moved mroot to other disks/shelves for the hot removal.

If the disks on the shelf or shelves have mailbox disks, right after you remove ownership from the disks (even though all zeroed spares), the node will crash. The ha-partner will not be able to takeover. Then the ha partner will also crash with an inconsistent mailbox. We had both nodes drop after removing ownership from a shelf that was being removed. It is an easy fix to get to maintenance mode and mailbox destroy local, but add that to your run book for hot shelf removal. We didn’t have any data on the ha-pair for the maintenance (the great thing about cDOT which averted any outage even during this maintenance). We recovered after reassigning the disks and deleting the mailbox (had to do that twice) then both nodes came back up after coredump and mailbox was recreated and storage failover was able to enable again.

::> set adv
::*> stor failover mailbox-disk show

One other note on the move mroot procedure https://kb.netapp.com/support/index?page=content&id=3013873&locale=en_US The article now lists to turn on nvfail (submitted that update a while ago), but some systems also had create_ucode enabled on vol0, so I check “vol options vol0” prior to move, then match the settings after the move in addition to nvfail. Also, if you have cdpd.enable on and ip.fastpath.enable off, both of those need to be reset since mroot move sets them back to cdpd off and fastpath on defaults (8.3P1).

Darkstar · ‎2015-07-28

Can't you just "cf disable" (or rather "storage failover modify...") before removing disk ownership from those 2 mailbox disks, and re-enable it again afterwards?

aborzenkov · ‎2015-07-28

Mailboxes must be present also without HA.

Darkstar · ‎2015-07-28

Yes, but they are not actively used. And if one fails or is removed, the system should automatically elect another one as replacement (otherwise every failed mailbox disk would crash the filer in the same way) so if you give the system enough time between the "remove ownership" commands it should work without causing a panic

scottgelb · ‎2015-07-29

I need to test this on a cDOT system with storage failover modify -enable false... This is one of the big differences in cDOT and 7-Mode... I did a 7-Mode test moving the root volume. Immediately after marking the new volume as root with vol options volname root, the mailbox disks move automatically to the new disks. In cDOT we only mark the new volume as root in maintenance mode which does not do this...then rebuild mroot instead of copying (ndmpcopy, vol copy, snapmirror, etc.) old root to new root in 7-Mode. If cDOT handled the mailbox move with the mark of root in maintenance mode we would be all good here. The case I ran into is when you move mroot from the old disks to new disks...not so much of an edge case since that is a typical use case before hot removing drives.

Tue May 22 22:15:30 GMT [node1: fmmb.lock.disk.remove:info]: Disk 0c.00.0 removed from local mailbox set.
Tue May 22 22:15:30 GMT [node1: fmmb.lock.disk.remove:info]: Disk 0c.00.1 removed from local mailbox set.
Tue May 22 22:15:30 GMT [node1: fmmb.current.lock.disk:info]: Disk 0c.00.7 is a local HA mailbox disk.

scottgelb · ‎2015-07-29

Don't have a system I can test on... with the procedure of disk remove_ownership with the ha failover disabled... It could crash the system again. I do know that a maintenance mode boot while taken over then mailbox destroy local, then disk remove_ownership does work with no issue though but would be nice to do online without taking the node to option 5...I just don't trust it removing the disk ownership with a mailbox even if it does work. Takeover then maintenance mode is extra work but safe.

shatfield · ‎2015-07-29

Thats an interesting scenario. I've hot removed many times and not hit this case, fortunately.

We can get the list of mailbox disks with storage failover mailbox-disk show, then its a question of finding the most elegent way to encourage ontap to pick a more appropriate disk.

I suspect if ownership of one mailbox disk at a time were removed ontap would pick another disk, then you could repeat until the shelf was clear. Need some lab time to repro.

scottgelb · ‎2015-07-30

I found a 2240 I'm zeroing now...should be able to get back to it by tomorrow to do a test... move mroot to another aggr, stor fail modify -enable false, then remove mailbox disks one at a time and see what happens.

shatfield · ‎2015-07-30

I tried to reproduce this scenario, but when I moved root it also moved the mailbox disks. I ran through it on 8.2.1, so I am wondering:

What version of ONTAP was the node running when root was moved?

scottgelb · ‎2015-07-31

8.3P1 cDOT.. in 7-Mode it moves the mailbox disks too automatically when you mark the volume as root... but from maintenance mode 8.3P1 it did not

shatfield · ‎2015-07-31

OK, I was on CDOT 821, but I'll try it again on 8.3p1.

Were you able to crash the 2240?

scottgelb · ‎2015-08-01

I just ran a test on the 2240 with 8.3.1RC1... the mailbox disks did move to the new drives automatically. Let me revert to 8.3P1 later and try again

scottgelb · ‎2015-08-01

The only other difference is this is a root partitioned system...even though I zeroed it out with the option set to false...

scottgelb · ‎2015-08-01

8.3P1 also worked. The mailbox disks automatically moved to the new mroot aggregate. Again the only difference is a root partition. I will chalk this up to an anomaly... NetApp support has the logs and ASUPs that went out when we ran into this and should be able to see it all (RSA also enabled with updates). But I will always check mailbox location before every removing disk ownership... although cDOT did save the day since we were able to do this with no data on the HA-pair...

RichardPittman · ‎2015-08-14

I ran into this exact problem last week! I have a loop of FC shelves that I am replacing with a SAS stack. I had to do the NDO procedure to move the node root aggr off of the FC disks. That worked like a charm but apparently the mailbox disks didn't get moved along with the root aggr/vol. As soon as I tried to unown the now "unused" FC disks the node paniced and went down. The partner node stayed up though. I was able to boot into maintenance mode and do the 'mailbox destroy local' command to get them to rebuild on the SAS drives. The node came back up but I'm not convinced it is 100% kosher yet. I have a feeling I'm looking at another planned outage window to take both nodes down and rebuild the mailboxes on both sides.

I've done a bunch of hot-shelf removals but never encountered this. This is the first time I've had to move the node root aggr though and you can be sure I'll be checking mailbox disk location form now on!

scottgelb · ‎2015-08-14

Support has the log files and is looking why it didn't move... in every other case it seems to move but we think it may be a lock on the mailbox that doesn't happen often... if we add a check of mailbox to the procedure then we can see if on the old disks... do a takeover... then when the partner is down, destroy local and boot...then giveback. Should be all NDU 🙂 Just a check needed to make sure... or fix the bug where it doesn't move or give us the ability to move mailboxes hot...lots of ways to go but my advice is check mailbox location then remediate ndu if needed. Could be worse but what we don't know can cause a panic.