2016-03-07 02:27 AM
I have two node cluster (vsim 8.3) running on VMware Fusion. Doing some exercise I changed the MTU on my cluster ports using network port broadcas.... after few minutes one of the nodes went down and when it came back again, the following message appeared.
** SYSTEM MESSAGES **
The contents of the root volume may have changed and the local management
configuration may be inconsistent and/or the local management databases may be
out of sync with the replicated databases. This node is not fully operational.
Contact support personnel for the root volume recovery procedures
After a while the other node also showed same message.
I can not see my root volume in clustershell and I cannot use any of the subcommands of vol/volume....
Anyone who had a similar problem with VSIM 8.3?
How can I recover my root volumes or repair it in clustershell?
If I enter a nodeshell, I have no option creating additional volumes using the vol command...
Solved! SEE THE SOLUTION
2016-03-07 06:58 AM
you can connect to the node management ip and run
::> node run local
This will drop you down the node shell and you should be able to manager the volumes 7-mode style.
cmemile01::*> node run local
Type 'exit' or 'Ctrl-D' to return to the CLI
cmemile01-01> vol status
Volume State Status Options
vol0_n1 online raid4, flex root, nosnap=on, nvfail=on 64-bit
What do you get running "::> cluster show" from the cluster shell?
If you get RPC errors and the cluster can not communicate over the interconnect the RDB is out of sync, you try and undo the changes that you have made on the MTU.
::> net port modify -node cmemile01-01 -port e0a -mtu 1500
Another option is to redeploy the sims and start fresh.
2016-03-07 12:20 PM
I can access the nodeshell and the output looks ok
n2> vol status vol0
Volume State Status Options
vol0 online raid_dp, flex root, nvfail=on
Volume UUID: b450b0c7-fe1f-4d49-adf5-160776524770
Containing aggregate: 'aggr0_n2'
The vol0 has no space issues
n2> df -g vol0
Filesystem total used avail capacity Mounted on
/vol/vol0/ 7GB 1GB 5GB 19% /vol/vol0/
The following commands are available; for more information
type "vol help <command>"
autosize lang options size
container offline restrict status
I cannot create new volumes inside nodeshell
clu1::> aggr show -fields has-mroot
Warning: Only local entries can be displayed at this time.
2 entries were displayed.
The Volume subcommands are not available
clu1::*> volume ?
Error: "" is not a recognized command
Only the node lif is available
clu1::*> net int show
(network interface show)
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
c1n2mgmt up/up 192.168.101.37/28 n2 e0c true
cluster show is not available
clu1::*> cluster setup
Error: command failed: Exiting the cluster setup wizard. The root volume is damaged. Contact support personnel for the
root volume recovery procedures. Run the "cluster setup" command after the recovery procedures are complete.
I changed the MTU on both nodes in the cluster but something caused cluster configuration lost, I cannot figure it out!
clu1::*> net port show
(network port show)
Node Port IPspace Broadcast Domain Link MTU Admin/Oper
------ --------- ------------ ---------------- ----- ------- ------------
e0a Cluster - up 1500 auto/1000
e0b Cluster - up 1500 auto/1000
e0c Default - up 1500 auto/1000
e0d - - up 1500 auto/1000
e0e - - up 9000 auto/1000
5 entries were displayed.
In advanced or Diagnostic level, I cannot run cluster ring show. Its like cluster services are not available!.
Never seen this before.
2016-03-07 09:26 PM
I think you've hit on an interesting scenario. Jumbo frames don't work on VMware Fusion, and by attempting to use jumbo frames for the cluster network something has gone wrong with the RDBs.
Did they panic and reboot?
Do you know which one had epsilon?
2016-03-08 07:59 AM
I tried to break one on purpose last night, but its still up this morning. Which version are you using specifically? 8.3GA, RC1? My test box was 8.3.1.
I modified the MTU with:
network port broadcast-domain modify -broadcast-domain Cluster -ipspace Cluster -mtu 9000
Cluster ping identifies the MTU problem but operationally its all still working
cluster1::*> cluster ping -node tst831-02 Host is tst831-02 Getting addresses from network interface table... Cluster tst831-02_clus1 tst831-02 e0a 169.254.181.27 Cluster tst831-02_clus2 tst831-02 e0b 169.254.59.32 Cluster tst831_clus1 tst831-01 e0a 169.254.115.129 Cluster tst831_clus2 tst831-01 e0b 169.254.113.38 Local = 169.254.181.27 169.254.59.32 Remote = 169.254.115.129 169.254.113.38 Cluster Vserver Id = 4294967293 Ping status: .... Basic connectivity succeeds on 4 path(s) Basic connectivity fails on 0 path(s) ................ Detected 1500 byte MTU on 4 path(s): Local 169.254.181.27 to Remote 169.254.113.38 Local 169.254.181.27 to Remote 169.254.115.129 Local 169.254.59.32 to Remote 169.254.113.38 Local 169.254.59.32 to Remote 169.254.115.129 Larger than PMTU communication fails on 4 path(s): Local 169.254.181.27 to Remote 169.254.113.38 Local 169.254.181.27 to Remote 169.254.115.129 Local 169.254.59.32 to Remote 169.254.113.38 Local 169.254.59.32 to Remote 169.254.115.129 RPC status: 2 paths up, 0 paths down (tcp check) 2 paths up, 0 paths down (udp check)
2016-03-08 12:39 PM
Only node management lifs and a clustershell (via direct ssh to node) are abailable.
There are no cluster lifs available. I get a clustershell when I directly logon to node using ssh.
cluster show or cluster ring show are not available.
The nodeshell does not give me a possibility to create new volumes. I could reinstall a new two node cluster but would like to repair vol0 if there is a way.
2016-03-08 04:42 PM - edited 2016-03-08 04:43 PM
OK. Need to try to revive the RDBs. Shut down the non-epsilon node, halt the epsilon node and boot to the loader. unset the boot_recovery bootarg and see if it will come back up.
If the epsilon node comes back up, make sure the cluster ports are mtu1500, then try to bring up the other node.
2016-03-09 02:51 AM
Removing the recovery flag "unsetenv bootarg.init.boot_recovery" did the trick.
I did first on the node with the epsilon role. When it came back online, RDBs started one after one starting with mgmt, vifmg, bcomd.... and so on... of course after view RPC connect errors... The vldb took a while to start but could fetch all info of my volumes again
Then I ran on the other node..... Now my node management and cluster shell as well as OnCommand System Manager are fully functional again...
What I want to know if you can assist me
1) What did the unsetenv bootarg.init.boot_recovery to the boot process and logs in order to bypass a set of configurations and try to launch the RDBs again?
2) What trigged the continuation of "my cluster inconsistence" after I switched the MTU settings right after the node with the epsilon rebooted?
Thank you so much for your prof. help
2016-03-09 06:58 PM
That flag gets set when a condition is detected that casts doubt on the state of the root volume. In the simulator this usually happens when the root volume gets full, but I've also seen it after some kinds of panics. Once set, it stays set until you unset it. This makes sure that someone diagnoses the condition that lead to it getting set in the first place before the node is placed back into service.
This particular message seems to indicate the RDB and local management DBs didn't match at boot time, possibly from a pooly timed panic. The simulated nvram has its limits.
If you encounter this in real life you should contact support.