2015-07-27 10:09 AM
BURT 931034 was filed for this - thank you Steve! But worth noting for anyone searching on hot removal to make one extra check. The hot shelf removal feature is great, but the documentation at https://library.netapp.com/ecmdocs/ECMP1367947/html/GUID-2B80FBD2-007D-4D19-8EB1-9CCEED211001.html
is missing a check. An edge case if you had mroot on those disks prior, and then moved mroot to other disks/shelves for the hot removal.
If the disks on the shelf or shelves have mailbox disks, right after you remove ownership from the disks (even though all zeroed spares), the node will crash. The ha-partner will not be able to takeover. Then the ha partner will also crash with an inconsistent mailbox. We had both nodes drop after removing ownership from a shelf that was being removed. It is an easy fix to get to maintenance mode and mailbox destroy local, but add that to your run book for hot shelf removal. We didn’t have any data on the ha-pair for the maintenance (the great thing about cDOT which averted any outage even during this maintenance). We recovered after reassigning the disks and deleting the mailbox (had to do that twice) then both nodes came back up after coredump and mailbox was recreated and storage failover was able to enable again.
::> set adv
::*> stor failover mailbox-disk show
One other note on the move mroot procedure https://kb.netapp.com/support/index?page=content&id=3013873&locale=en_US The article now lists to turn on nvfail (submitted that update a while ago), but some systems also had create_ucode enabled on vol0, so I check “vol options vol0” prior to move, then match the settings after the move in addition to nvfail. Also, if you have cdpd.enable on and ip.fastpath.enable off, both of those need to be reset since mroot move sets them back to cdpd off and fastpath on defaults (8.3P1).
2015-07-28 01:38 AM
Can't you just "cf disable" (or rather "storage failover modify...") before removing disk ownership from those 2 mailbox disks, and re-enable it again afterwards?
2015-07-28 02:09 AM
Yes, but they are not actively used. And if one fails or is removed, the system should automatically elect another one as replacement (otherwise every failed mailbox disk would crash the filer in the same way) so if you give the system enough time between the "remove ownership" commands it should work without causing a panic
2015-07-29 08:34 AM
I need to test this on a cDOT system with storage failover modify -enable false... This is one of the big differences in cDOT and 7-Mode... I did a 7-Mode test moving the root volume. Immediately after marking the new volume as root with vol options volname root, the mailbox disks move automatically to the new disks. In cDOT we only mark the new volume as root in maintenance mode which does not do this...then rebuild mroot instead of copying (ndmpcopy, vol copy, snapmirror, etc.) old root to new root in 7-Mode. If cDOT handled the mailbox move with the mark of root in maintenance mode we would be all good here. The case I ran into is when you move mroot from the old disks to new disks...not so much of an edge case since that is a typical use case before hot removing drives.
Tue May 22 22:15:30 GMT [node1: fmmb.lock.disk.remove:info]: Disk 0c.00.0 removed from local mailbox set.
Tue May 22 22:15:30 GMT [node1: fmmb.lock.disk.remove:info]: Disk 0c.00.1 removed from local mailbox set.
Tue May 22 22:15:30 GMT [node1: fmmb.current.lock.disk:info]: Disk 0c.00.7 is a local HA mailbox disk.
2015-07-29 08:36 AM
Don't have a system I can test on... with the procedure of disk remove_ownership with the ha failover disabled... It could crash the system again. I do know that a maintenance mode boot while taken over then mailbox destroy local, then disk remove_ownership does work with no issue though but would be nice to do online without taking the node to option 5...I just don't trust it removing the disk ownership with a mailbox even if it does work. Takeover then maintenance mode is extra work but safe.
2015-07-29 09:47 PM - edited 2015-07-30 07:14 PM
Thats an interesting scenario. I've hot removed many times and not hit this case, fortunately.
We can get the list of mailbox disks with storage failover mailbox-disk show, then its a question of finding the most elegent way to encourage ontap to pick a more appropriate disk.
I suspect if ownership of one mailbox disk at a time were removed ontap would pick another disk, then you could repeat until the shelf was clear. Need some lab time to repro.
2015-07-30 03:27 PM
I found a 2240 I'm zeroing now...should be able to get back to it by tomorrow to do a test... move mroot to another aggr, stor fail modify -enable false, then remove mailbox disks one at a time and see what happens.
2015-07-30 08:28 PM
I tried to reproduce this scenario, but when I moved root it also moved the mailbox disks. I ran through it on 8.2.1, so I am wondering:
What version of ONTAP was the node running when root was moved?
2015-07-31 08:28 AM
8.3P1 cDOT.. in 7-Mode it moves the mailbox disks too automatically when you mark the volume as root... but from maintenance mode 8.3P1 it did not