Solved: Re: Root volume is damaged

bashirgas · ‎2016-03-07

Hi,

I have two node cluster (vsim 8.3) running on VMware Fusion. Doing some exercise I changed the MTU on my cluster ports using network port broadcas.... after few minutes one of the nodes went down and when it came back again, the following message appeared.

***********************
** SYSTEM MESSAGES **
***********************

The contents of the root volume may have changed and the local management
configuration may be inconsistent and/or the local management databases may be
out of sync with the replicated databases. This node is not fully operational.
Contact support personnel for the root volume recovery procedures

After a while the other node also showed same message.

I can not see my root volume in clustershell and I cannot use any of the subcommands of vol/volume....

Anyone who had a similar problem with VSIM 8.3?

How can I recover my root volumes or repair it in clustershell?

If I enter a nodeshell, I have no option creating additional volumes using the vol command...

I'm stuck..

Thank you

Bash

SeanHatfield · ‎2016-03-08

OK. Need to try to revive the RDBs. Shut down the non-epsilon node, halt the epsilon node and boot to the loader. unset the boot_recovery bootarg and see if it will come back up.

unsetenv bootarg.init.boot_recovery

If the epsilon node comes back up, make sure the cluster ports are mtu1500, then try to bring up the other node.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

View solution in original post

Emile-Bodin · ‎2016-03-07

you can connect to the node management ip and run

::> node run local

This will drop you down the node shell and you should be able to manager the volumes 7-mode style.

cmemile01::*> node run local
Type 'exit' or 'Ctrl-D' to return to the CLI
cmemile01-01> vol status
Volume State Status Options
vol0_n1 online raid4, flex root, nosnap=on, nvfail=on 64-bit

What do you get running "::> cluster show" from the cluster shell?

If you get RPC errors and the cluster can not communicate over the interconnect the RDB is out of sync, you try and undo the changes that you have made on the MTU.

::> net port modify -node cmemile01-01 -port e0a -mtu 1500

Another option is to redeploy the sims and start fresh.

bashirgas · ‎2016-03-07

I can access the nodeshell and the output looks ok

n2> vol status vol0
Volume State Status Options
vol0 online raid_dp, flex root, nvfail=on
64-bit
Volume UUID: b450b0c7-fe1f-4d49-adf5-160776524770
Containing aggregate: 'aggr0_n2'

The vol0 has no space issues

n2> df -g vol0
Filesystem total used avail capacity Mounted on
/vol/vol0/ 7GB 1GB 5GB 19% /vol/vol0/

n2> vol
The following commands are available; for more information
type "vol help <command>"
autosize lang options size
container offline restrict status
destroy online

I cannot create new volumes inside nodeshell

clu1::> aggr show -fields has-mroot
aggregate has-mroot
--------- ---------

Warning: Only local entries can be displayed at this time.
aggr0_n2 true
n2aggr false
2 entries were displayed.

The Volume subcommands are not available

clu1::*> volume ?

Error: "" is not a recognized command

clu1::*> volume

Only the node lif is available

clu1::*> net int show
(network interface show)
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
clu1
c1n2mgmt up/up 192.168.101.37/28 n2 e0c true

cluster show is not available

clu1::*> cluster setup

Error: command failed: Exiting the cluster setup wizard. The root volume is damaged. Contact support personnel for the
root volume recovery procedures. Run the "cluster setup" command after the recovery procedures are complete.

I changed the MTU on both nodes in the cluster but something caused cluster configuration lost, I cannot figure it out!

clu1::*> net port show
(network port show)
Speed (Mbps)
Node Port IPspace Broadcast Domain Link MTU Admin/Oper
------ --------- ------------ ---------------- ----- ------- ------------
n2
e0a Cluster - up 1500 auto/1000
e0b Cluster - up 1500 auto/1000
e0c Default - up 1500 auto/1000
e0d - - up 1500 auto/1000
e0e - - up 9000 auto/1000
5 entries were displayed.

In advanced or Diagnostic level, I cannot run cluster ring show. Its like cluster services are not available!.

Never seen this before.

SeanHatfield · ‎2016-03-07

I think you've hit on an interesting scenario. Jumbo frames don't work on VMware Fusion, and by attempting to use jumbo frames for the cluster network something has gone wrong with the RDBs.

Did they panic and reboot?

Do you know which one had epsilon?

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

bashirgas · ‎2016-03-08

That's what I have noticed. All 5 RDBs are gone but somehow the cluster shell remains!!!

No panic, but the node2 (the one with the epsilon role) rebooted.

SeanHatfield · ‎2016-03-08

I tried to break one on purpose last night, but its still up this morning. Which version are you using specifically? 8.3GA, RC1? My test box was 8.3.1.

I modified the MTU with:

network port broadcast-domain modify -broadcast-domain Cluster -ipspace Cluster -mtu 9000

Cluster ping identifies the MTU problem but operationally its all still working

cluster1::*> cluster ping -node tst831-02
Host is tst831-02
Getting addresses from network interface table...
Cluster tst831-02_clus1 tst831-02 e0a       169.254.181.27 
Cluster tst831-02_clus2 tst831-02 e0b       169.254.59.32  
Cluster tst831_clus1    tst831-01 e0a       169.254.115.129 
Cluster tst831_clus2    tst831-01 e0b       169.254.113.38 
Local = 169.254.181.27 169.254.59.32
Remote = 169.254.115.129 169.254.113.38
Cluster Vserver Id = 4294967293
Ping status:
.... 
Basic connectivity succeeds on 4 path(s)
Basic connectivity fails on 0 path(s)
................ 
Detected 1500 byte MTU on 4 path(s):
    Local 169.254.181.27 to Remote 169.254.113.38
    Local 169.254.181.27 to Remote 169.254.115.129
    Local 169.254.59.32 to Remote 169.254.113.38
    Local 169.254.59.32 to Remote 169.254.115.129
Larger than PMTU communication fails on 4 path(s):
    Local 169.254.181.27 to Remote 169.254.113.38
    Local 169.254.181.27 to Remote 169.254.115.129
    Local 169.254.59.32 to Remote 169.254.113.38
    Local 169.254.59.32 to Remote 169.254.115.129
RPC status:
2 paths up, 0 paths down (tcp check)
2 paths up, 0 paths down (udp check)

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

bashirgas · ‎2016-03-08

Only node management lifs and a clustershell (via direct ssh to node) are abailable.

There are no cluster lifs available. I get a clustershell when I directly logon to node using ssh.

cluster show or cluster ring show are not available.

The nodeshell does not give me a possibility to create new volumes. I could reinstall a new two node cluster but would like to repair vol0 if there is a way.

Thank you

SeanHatfield · ‎2016-03-08

OK. Need to try to revive the RDBs. Shut down the non-epsilon node, halt the epsilon node and boot to the loader. unset the boot_recovery bootarg and see if it will come back up.

unsetenv bootarg.init.boot_recovery

If the epsilon node comes back up, make sure the cluster ports are mtu1500, then try to bring up the other node.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

bashirgas · ‎2016-03-09

Hi,

Removing the recovery flag "unsetenv bootarg.init.boot_recovery" did the trick.

I did first on the node with the epsilon role. When it came back online, RDBs started one after one starting with mgmt, vifmg, bcomd.... and so on... of course after view RPC connect errors... The vldb took a while to start but could fetch all info of my volumes again 🙂

Then I ran on the other node..... Now my node management and cluster shell as well as OnCommand System Manager are fully functional again...

What I want to know if you can assist me

1) What did the unsetenv bootarg.init.boot_recovery to the boot process and logs in order to bypass a set of configurations and try to launch the RDBs again?

2) What trigged the continuation of "my cluster inconsistence" after I switched the MTU settings right after the node with the epsilon rebooted?

Thank you so much for your prof. help

// Bash

SeanHatfield · ‎2016-03-09

That flag gets set when a condition is detected that casts doubt on the state of the root volume. In the simulator this usually happens when the root volume gets full, but I've also seen it after some kinds of panics. Once set, it stays set until you unset it. This makes sure that someone diagnoses the condition that lead to it getting set in the first place before the node is placed back into service.

This particular message seems to indicate the RDB and local management DBs didn't match at boot time, possibly from a pooly timed panic. The simulated nvram has its limits.

If you encounter this in real life you should contact support.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

elideckel · ‎2016-05-09

I thought i'd be lucky like you when i read the last remedy with the 'unsetenv' commad. but, my vSIMs still coming back with the same errors - Recovery Required!

I wonder if there's additional flag(s) that can fix the corrupted vol0?

I had those sims and windows DC install on vmware workstation and apparently microsoft decided to apply some Win patches and rebooted the machine. this caused the sims to boot up in unstable state as you described. prior to that, I already took care of both Vol0 sizes (53% avail) and this with the 'unsetenv' command did not help getting my CDot8.3.2 Sims back on full function as i had them running yesterday.

SeanHatfield · ‎2016-05-10

In your case the file system is actually damaged.

Recovering from that is more involved. If any one node has a good root vol you may be able to reseed the failed nodes, but if they are all damaged you may need to restore the the last good system configuration backup.

This can happen during a dirty shutdown because by default the simulated nvram operates in a non-persistent mode. If you have them running on SSD you could change that, but if they are on HDD the performance impact could be significant.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

elideckel · ‎2016-05-10

yep, no backup, no snapshots taken. I already created a new pair and I make sure to suspend them next time I walk away...

thanks for confirming the cause.

dkorns · ‎2016-08-29

I also have to use this occasionally (after lab power failures). In my case (single node cluster vsims) it seems to take 10 to 20 minutes for the full recovery to take place after the system boots. I'm wondering if there is anything one can look at (a log file, a show command, etc) that gives a clue as to RDB rebuild progress. I just do 'network interface show' commands until I see all the LIFs show up again).

tdada · ‎2016-09-05

SOLVED!!! Had Single node cluster (vsim 9.0) running on VMware Worstation 11.

interupted normal boot into Vloader and ran the cmd "unsetenv bootarg.init.boot_recovery"

Rebooted and problem solved. Cluster now healthy. Cheers.

Davidjai · ‎2016-10-10

hi this is jai, my root volume is damaged,i cant able to recovery it from the error,i used = "unsetenv bootarg.init.boot_recovery" ,bt this command not working , am getting the same problem again and again.

Thanks in advance.

SeanHatfield · ‎2016-10-10

Then you likely need to make some free space first.

Start by deleting aggregate snapshots and vol0 snapshots if any are present, then try it again.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

Davidjai · ‎2016-10-11

Thank you so much Mr.hat, i got solution

Regards and Thanks

JAI

rsankuru · ‎2017-09-04

Hey, How did you resovle this

@Davidjai wrote:
Thank you so much Mr.hat, i got solution

Regards and Thanks

JAI

How did you resolve this issue ?

Davidjai · ‎2017-09-04

Am just simply followed the recovery procedure, and we used "unsetenv

bootarg.init.boot_recovery".

And i deleted volume which is damaged, in CLI mode

SupraJari · ‎2018-04-24

Hi, trying to recover with 9.3 simulator.

VLOADER> unsetenv bootarg.init.boot_recovery

no such file or directory

Also printenv does not show such option.

Has something changed in 9.3?

-Jari