Subscribe
Accepted Solution

Root volume is damaged

Hi,

 

I have two node cluster (vsim 8.3) running on VMware Fusion. Doing some exercise I changed the MTU on my cluster ports using network port broadcas.... after few minutes one of the nodes went down and when it came back again, the following message appeared.

 

***********************
** SYSTEM MESSAGES **
***********************

The contents of the root volume may have changed and the local management
configuration may be inconsistent and/or the local management databases may be
out of sync with the replicated databases. This node is not fully operational.
Contact support personnel for the root volume recovery procedures

 

After a while the other node also showed same message.

 

I can not see my root volume in clustershell and I cannot use any of the subcommands of vol/volume....

 

Anyone who had a similar problem with VSIM 8.3? 

How can I recover my root volumes or repair it in clustershell?

 

If I enter a nodeshell, I have no option creating additional volumes using the vol command...

 

I'm stuck..

 

Thank you

Bash

Re: Root volume is damaged

you can connect to the node management ip and run 

::> node run local

 

This will drop you down the node shell and you should be able to manager the volumes 7-mode style.

 

cmemile01::*> node run local
Type 'exit' or 'Ctrl-D' to return to the CLI
cmemile01-01> vol status
Volume State Status Options
vol0_n1 online raid4, flex root, nosnap=on, nvfail=on 64-bit

 

What do you get running "::> cluster show" from the cluster shell?

If you get RPC errors and the cluster can not communicate over the interconnect the RDB is out of sync, you try and undo the changes that you have made on the MTU.

 

::> net port modify -node cmemile01-01 -port e0a -mtu 1500

 

Another option is to redeploy the sims and start fresh.

 

 

Re: Root volume is damaged

 

I can access the nodeshell  and the output looks ok

 

n2> vol status vol0
Volume State Status Options
vol0 online raid_dp, flex root, nvfail=on
64-bit
Volume UUID: b450b0c7-fe1f-4d49-adf5-160776524770
Containing aggregate: 'aggr0_n2'

 

The vol0 has no space issues

n2> df -g vol0
Filesystem total used avail capacity Mounted on
/vol/vol0/ 7GB 1GB 5GB 19% /vol/vol0/

 

n2> vol
The following commands are available; for more information
type "vol help <command>"
autosize lang options size
container offline restrict status
destroy online

 

I cannot create new volumes inside nodeshell

 

clu1::> aggr show -fields has-mroot
aggregate has-mroot
--------- ---------

Warning: Only local entries can be displayed at this time.
aggr0_n2 true
n2aggr false
2 entries were displayed.

 

The Volume subcommands are not available

clu1::*> volume ?

Error: "" is not a recognized command

clu1::*> volume

 

Only the node lif is available

clu1::*> net int show
(network interface show)
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
clu1
c1n2mgmt up/up 192.168.101.37/28 n2 e0c true

 

cluster show is not available

 

clu1::*> cluster setup

Error: command failed: Exiting the cluster setup wizard. The root volume is damaged. Contact support personnel for the
root volume recovery procedures. Run the "cluster setup" command after the recovery procedures are complete.

 

I changed the MTU on both nodes in the cluster but something caused cluster configuration lost, I cannot figure it out!

 

clu1::*> net port show
(network port show)
Speed (Mbps)
Node Port IPspace Broadcast Domain Link MTU Admin/Oper
------ --------- ------------ ---------------- ----- ------- ------------
n2
e0a Cluster - up 1500 auto/1000
e0b Cluster - up 1500 auto/1000
e0c Default - up 1500 auto/1000
e0d - - up 1500 auto/1000
e0e - - up 9000 auto/1000
5 entries were displayed.

 

In advanced or Diagnostic level, I cannot run cluster ring show. Its like cluster services are not available!.

Never seen this before.

 

 

Re: Root volume is damaged

I think you've hit on an interesting scenario.  Jumbo frames don't work on VMware Fusion, and by attempting to use jumbo frames for the cluster network something has gone wrong with the RDBs.

Did they panic and reboot?

Do you know which one had epsilon?  

 

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

Re: Root volume is damaged

That's what I have noticed. All 5 RDBs are gone but somehow the cluster shell remains!!!

No panic, but the node2 (the one with the epsilon role) rebooted.

 

 

 

 

Re: Root volume is damaged

I tried to break one on purpose last night, but its still up this morning.  Which version are you using specifically?  8.3GA, RC1?  My test box was 8.3.1.

I modified the MTU with:

network port broadcast-domain modify -broadcast-domain Cluster -ipspace Cluster -mtu 9000

Cluster ping identifies the MTU problem but operationally its all still working

cluster1::*> cluster ping -node tst831-02
Host is tst831-02
Getting addresses from network interface table...
Cluster tst831-02_clus1 tst831-02 e0a       169.254.181.27 
Cluster tst831-02_clus2 tst831-02 e0b       169.254.59.32  
Cluster tst831_clus1    tst831-01 e0a       169.254.115.129 
Cluster tst831_clus2    tst831-01 e0b       169.254.113.38 
Local = 169.254.181.27 169.254.59.32
Remote = 169.254.115.129 169.254.113.38
Cluster Vserver Id = 4294967293
Ping status:
.... 
Basic connectivity succeeds on 4 path(s)
Basic connectivity fails on 0 path(s)
................ 
Detected 1500 byte MTU on 4 path(s):
    Local 169.254.181.27 to Remote 169.254.113.38
    Local 169.254.181.27 to Remote 169.254.115.129
    Local 169.254.59.32 to Remote 169.254.113.38
    Local 169.254.59.32 to Remote 169.254.115.129
Larger than PMTU communication fails on 4 path(s):
    Local 169.254.181.27 to Remote 169.254.113.38
    Local 169.254.181.27 to Remote 169.254.115.129
    Local 169.254.59.32 to Remote 169.254.113.38
    Local 169.254.59.32 to Remote 169.254.115.129
RPC status:
2 paths up, 0 paths down (tcp check)
2 paths up, 0 paths down (udp check)
If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

Re: Root volume is damaged

Only node management lifs and a clustershell (via direct ssh to node) are abailable.

There are no cluster lifs available. I get a clustershell when I directly logon to node using ssh.

cluster show or cluster ring show are not available.

 

 

The nodeshell does not give me a possibility to create new volumes. I could reinstall a new two node cluster but would like to repair vol0 if there is a way.

 

Thank you

Re: Root volume is damaged

[ Edited ]

OK.  Need to try to revive the RDBs.  Shut down the non-epsilon node, halt the epsilon node and boot to the loader.  unset the boot_recovery bootarg and see if it will come back up.

 

unsetenv bootarg.init.boot_recovery

If the epsilon node comes back up, make sure the cluster ports are mtu1500, then try to bring up the other node.

 

 

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

Re: Root volume is damaged

Hi,

 

Removing the recovery flag "unsetenv bootarg.init.boot_recovery" did the trick.

 

I did first on the node with the epsilon role. When it came back online, RDBs started one after one starting with mgmt, vifmg, bcomd.... and so on... of course after view RPC connect errors... The vldb took a while to start but could fetch all info of my volumes again Smiley Happy

 

Then I ran on the other node..... Now my node management and cluster shell as well as OnCommand System Manager are fully functional again...

 

What I want to know if you can assist me

 

1)  What did the unsetenv bootarg.init.boot_recovery to the boot process and logs in order to bypass a set of configurations and try to launch the RDBs again?

2) What trigged the continuation of  "my cluster inconsistence" after I switched the MTU settings right after the node with the epsilon rebooted?

 

Thank you so much for your prof. help

 

// Bash

 

Re: Root volume is damaged

That flag gets set when a condition is detected that casts doubt on the state of the root volume.  In the simulator this usually happens when the root volume gets full, but I've also seen it after some kinds of panics.  Once set, it stays set until you unset it.  This makes sure that someone diagnoses the condition that lead to it getting set in the first place before the node is placed back into service.

 

This particular message seems to indicate the RDB and local management DBs didn't match at boot time, possibly from a pooly timed panic.  The simulated nvram has its limits.

 

If you encounter this in real life you should contact support.

 

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.