Failover Monitor: unable to transit - giveback process is hung (vfiler_low_level) in SK

CSTOCKDA · ‎2021-05-10

Hello,

We have a system still on 8.1.4 which we have replaced the motherboard for (Process all followed) however when we try to do the giveback we get the below message:

Panic string: Failover Monitor: unable to transit - giveback process is hung (vfiler_low_level) in SK process cf_main on release 8.1.4P1

We have attempted the following:

vfiler run * cifs terminate

partner vfiler run * cifs terminate

This still resulted in filer panic when attempting to perform a giveback

cf giveback and cf giveback -f both resulted in panics.

It seems impossible to clear. At this moment I cannot think of any option other than too reboot the node or a last ditch effort of forcegiveback. The data is migrated off so no risk of data loss.

Has anyone come into this error before and does anyone have any suggestions to clear this other than drastic measures such as hard reboot and forcegiveback ?

Thanks

Reverett · ‎2021-05-10

Which node will panic during the cf giveback -f?

Is it the node issuing the givie back that panics or the node we are trying to boot after the motherboard replacement?

Can i have you provide a cf monitor and cf status outputs as well please.

JPoorboy · ‎2021-05-10

It is the up node which panics.

Here is what we tried this evening on this.

Terminated CIFS on Node1 as well as in the partner context
Stopped all vfilers
cf giveback showed the same message ("vfiler_low_level", 110 of 158 modules)
This time though the filer completed the giveback successfully
Restarted both CIFS and VFILERS
Attempted new takeover for testing
Disabled just the CIFS and VFIler in partner context and attempting giveback
Giveback successful
Did a takeover from Node2 this time
Takeover successful
Attempting giveback without shutting down CIFS or vfilers
Node2 is showing the same message ("vfiler_low_level", 110 of 158 modules)
Giveback took over 30 minutes but did complete
Takeover from Node1 again
Terminated CIFS and Vfilers in partner context giveback has hung again

Reverett · ‎2021-05-10

Is anything done differently when the giveback is successful and when it fails, or does it just seem to work on occasion?

From the above chat it appears the giveback works or fails independently of the vfilers running or not

JPoorboy · ‎2021-05-10

So, we got the giveback to work on Node1 twice and both times vfilers and CIFS had been terminated.
Yesterday without stopping both the giveback would result in the up node (Node1) having a panic.

We tried a takeover from Node2 as well tonight for the first time and the giveback took over 30 minutes to complete.

The one thing that has me stumped is that the only place I see any mention of this process ("vfiler_low_level") is in an ONTAP 9 doc for giveback veto.
I can't find anything for ONTAP 8...

Reverett · ‎2021-05-10

We currently do not appear to have any public KBs specifically for ontap 8 7-mode relating to this panic.

Based on other panics we have seen with the same message this panic, I can see this panic has different known causes that could be identified with the core file analysis of the core file that would be dumped with this panic.

ttran · ‎2021-06-01

Hello @CSTOCKDA,

There are some very old bugs with a race condition with tearing down the vfiler too soon on the up node and rebuilding it on the node that was given back with CIFS workload. As @Reverett mentioned, this will require the CORE dump to be analyzed to look at the stack trace during the PANIC to concretely identify the cause. Unfortunately, ONTAP 8.1.x is End of Support therefore if possible to upgrade to ONTAP 8.2.5P5. Without looking at the stack trace we also won't be able to link an exact bug or KB.

Data ONTAP 8.2.5P5

Regards,

Team NetApp

Ethy · ‎2021-06-28

Try this as workaround , we tested and it works

" Disable HTTP/HTTPS on all the vfilers 20 min before initiating the takeover/giveback phases should prevent the same panics.

Once the protocols have been disabled, monitor the sessions opened on the systems with the command "netstat -na" to confirm no sessions are opened on the ports 80 and 443”