Solved: Epsilon and root move from an aggregate to anoter

gmilazzoitag · ‎2018-03-12

Hi everybody,

past friday during an operation given as NDO we've had a service interruption on NAS component.

We had to move the root aggregate from some old disks to new ones and we've literally followed the procedure reported here (our cDOT is 8.3.2P9)

https://kb.netapp.com/app/answers/answer_view/a_id/1030179

In a very simple way it says:

Check for epsilon on the node you've to migrate and move it to another node
- there's a warining about SAN protocols interruptions but we DID NOT have SAN protocols running, only NFS/CIFS.
Lif migration after the aggregate relocation

Well, NFS was restarted and all servers and apps belonging to it went down! I let you imagine customer reaction...

Also console after this command:

system node modify -node node01 -eligibility false

give us a warning about SAN disruption. As I wrote it did not matter us.

Only after that we've found on manual this, but as usual manuals are always less updated than knowledgebase so it could be the last place where to find fresh informations!

https://library.netapp.com/ecmdocs/ECMP1367947/html/GUID-AB52F821-3A25-4E02-B1EF-1B09EBE4009D.html

Moving epsilon for certain manually initiated takeovers

Note: Although cluster formation voting can be modified by using the cluster modify -eligibility false command, you should avoid this except for situations such as restoring the node configuration or prolonged node maintenance. If you set a node to be ineligible, it stops serving SAN data until the node is reset to eligible and rebooted. NAS data access to the node might also be affected when the node is ineligible.

And, what does it mean "might be". I translate that as a "nobody knows, try..."

Now the most important question (we must migrate other three nodes!) is this:

Assuming that we've well understood that 1. migrate lif and only 2. epsilon false, it there an official answer/doc with updated information that ensure that is this the right procedure to avoid also NAS protocols interruption?

Thank you very much,

AlexDawson · ‎2018-03-12

Hi there,

As you've experienced service disruption at a client site, I would suggest you should log a support case with our support centre, and ask for clarification of the documentation and correct steps to avoid an outage in the future.

I've brought this issue and this thread to the attention of the writer of that KB article.

View solution in original post

AlexDawson · ‎2018-03-12

Hi there,

As you've experienced service disruption at a client site, I would suggest you should log a support case with our support centre, and ask for clarification of the documentation and correct steps to avoid an outage in the future.

I've brought this issue and this thread to the attention of the writer of that KB article.

Sebastian_Goetze · ‎2018-03-13

Just for clarification, what Giacomo is refering to is this:

Hmmmmm,

Going through the steps in the KB, I would have done the epsilon and eligibility steps (Step 1 in the KB) right before the reboot (Step 8), *after* moving the aggregate and the LIFs away from the node to be worked on.

At that point in time it shouldn't disturb anything, since no user traffic should pass through this nodes interfaces or disks.

What do you think?

(I'm a little unclear about the meaning of "NFS was restarted", but I have a feeling the above change in sequence should help)

Also, if you look at the revert steps, the KB first restores eligibility and HA failover and only at the end reverts aggregates and LIFs.

Regards

Sebastian

gmilazzoitag · ‎2018-03-13

Thank you very much Sebastian.

Now we'll attempt to have some official answer by support on the sequence that you're suggesting, and on which I completely agree as suspected after that fault.

My best regards,

gmilazzoitag · ‎2018-03-13

Hi,
thank you for the quick answer.
Is the thing that I should do also because customer need some official position before to let us the steps with inverse path. Otherwise he will plan an outage.

Another experienced Se told me that the sequence should be inverted also because if you look at the sequence after the reboot in that kb epsilon eleggibilità is the first thing so that it should be the last after aggr relocation and life move.

Regards