Simulator Discussions

Root volume is damaged

bashirgas

Hi,

 

I have two node cluster (vsim 8.3) running on VMware Fusion. Doing some exercise I changed the MTU on my cluster ports using network port broadcas.... after few minutes one of the nodes went down and when it came back again, the following message appeared.

 

***********************
** SYSTEM MESSAGES **
***********************

The contents of the root volume may have changed and the local management
configuration may be inconsistent and/or the local management databases may be
out of sync with the replicated databases. This node is not fully operational.
Contact support personnel for the root volume recovery procedures

 

After a while the other node also showed same message.

 

I can not see my root volume in clustershell and I cannot use any of the subcommands of vol/volume....

 

Anyone who had a similar problem with VSIM 8.3? 

How can I recover my root volumes or repair it in clustershell?

 

If I enter a nodeshell, I have no option creating additional volumes using the vol command...

 

I'm stuck..

 

Thank you

Bash

1 ACCEPTED SOLUTION

SeanHatfield

OK.  Need to try to revive the RDBs.  Shut down the non-epsilon node, halt the epsilon node and boot to the loader.  unset the boot_recovery bootarg and see if it will come back up.

 

unsetenv bootarg.init.boot_recovery

If the epsilon node comes back up, make sure the cluster ports are mtu1500, then try to bring up the other node.

 

 

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

View solution in original post

25 REPLIES 25

haopengl

1.  Bring the node to the LODER prompt:

::*>halt -node <node>

2.  Check to see if the following bootargs have been set:

LOADER>printenv bootarg.init.boot_recovery
LOADER>printenv bootarg.rdb_corrupt

3.  IF either bootarg as been set to a value, un-set it and boot ONTAP

LOADER>unsetenv bootarg.init.boot_recovery
LOADER>unsetenv bootarg.rdb_corrupt
LOADER>bye

SeanHatfield

The simloader has no saveenv, so when you bye none of the changes will be saved.  Instead you have to boot, and let it proceed to at least the boot menu for changes from the loader prompt to be persistent.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

Emile-Bodin

you can connect to the node management ip and run 

::> node run local

 

This will drop you down the node shell and you should be able to manager the volumes 7-mode style.

 

cmemile01::*> node run local
Type 'exit' or 'Ctrl-D' to return to the CLI
cmemile01-01> vol status
Volume State Status Options
vol0_n1 online raid4, flex root, nosnap=on, nvfail=on 64-bit

 

What do you get running "::> cluster show" from the cluster shell?

If you get RPC errors and the cluster can not communicate over the interconnect the RDB is out of sync, you try and undo the changes that you have made on the MTU.

 

::> net port modify -node cmemile01-01 -port e0a -mtu 1500

 

Another option is to redeploy the sims and start fresh.

 

 

bashirgas

 

I can access the nodeshell  and the output looks ok

 

n2> vol status vol0
Volume State Status Options
vol0 online raid_dp, flex root, nvfail=on
64-bit
Volume UUID: b450b0c7-fe1f-4d49-adf5-160776524770
Containing aggregate: 'aggr0_n2'

 

The vol0 has no space issues

n2> df -g vol0
Filesystem total used avail capacity Mounted on
/vol/vol0/ 7GB 1GB 5GB 19% /vol/vol0/

 

n2> vol
The following commands are available; for more information
type "vol help <command>"
autosize lang options size
container offline restrict status
destroy online

 

I cannot create new volumes inside nodeshell

 

clu1::> aggr show -fields has-mroot
aggregate has-mroot
--------- ---------

Warning: Only local entries can be displayed at this time.
aggr0_n2 true
n2aggr false
2 entries were displayed.

 

The Volume subcommands are not available

clu1::*> volume ?

Error: "" is not a recognized command

clu1::*> volume

 

Only the node lif is available

clu1::*> net int show
(network interface show)
Logical Status Network Current Current Is
Vserver Interface Admin/Oper Address/Mask Node Port Home
----------- ---------- ---------- ------------------ ------------- ------- ----
clu1
c1n2mgmt up/up 192.168.101.37/28 n2 e0c true

 

cluster show is not available

 

clu1::*> cluster setup

Error: command failed: Exiting the cluster setup wizard. The root volume is damaged. Contact support personnel for the
root volume recovery procedures. Run the "cluster setup" command after the recovery procedures are complete.

 

I changed the MTU on both nodes in the cluster but something caused cluster configuration lost, I cannot figure it out!

 

clu1::*> net port show
(network port show)
Speed (Mbps)
Node Port IPspace Broadcast Domain Link MTU Admin/Oper
------ --------- ------------ ---------------- ----- ------- ------------
n2
e0a Cluster - up 1500 auto/1000
e0b Cluster - up 1500 auto/1000
e0c Default - up 1500 auto/1000
e0d - - up 1500 auto/1000
e0e - - up 9000 auto/1000
5 entries were displayed.

 

In advanced or Diagnostic level, I cannot run cluster ring show. Its like cluster services are not available!.

Never seen this before.

 

 

SeanHatfield

I think you've hit on an interesting scenario.  Jumbo frames don't work on VMware Fusion, and by attempting to use jumbo frames for the cluster network something has gone wrong with the RDBs.

Did they panic and reboot?

Do you know which one had epsilon?  

 

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

bashirgas

That's what I have noticed. All 5 RDBs are gone but somehow the cluster shell remains!!!

No panic, but the node2 (the one with the epsilon role) rebooted.

 

 

 

 

SeanHatfield

I tried to break one on purpose last night, but its still up this morning.  Which version are you using specifically?  8.3GA, RC1?  My test box was 8.3.1.

I modified the MTU with:

network port broadcast-domain modify -broadcast-domain Cluster -ipspace Cluster -mtu 9000

Cluster ping identifies the MTU problem but operationally its all still working

cluster1::*> cluster ping -node tst831-02
Host is tst831-02
Getting addresses from network interface table...
Cluster tst831-02_clus1 tst831-02 e0a       169.254.181.27 
Cluster tst831-02_clus2 tst831-02 e0b       169.254.59.32  
Cluster tst831_clus1    tst831-01 e0a       169.254.115.129 
Cluster tst831_clus2    tst831-01 e0b       169.254.113.38 
Local = 169.254.181.27 169.254.59.32
Remote = 169.254.115.129 169.254.113.38
Cluster Vserver Id = 4294967293
Ping status:
.... 
Basic connectivity succeeds on 4 path(s)
Basic connectivity fails on 0 path(s)
................ 
Detected 1500 byte MTU on 4 path(s):
    Local 169.254.181.27 to Remote 169.254.113.38
    Local 169.254.181.27 to Remote 169.254.115.129
    Local 169.254.59.32 to Remote 169.254.113.38
    Local 169.254.59.32 to Remote 169.254.115.129
Larger than PMTU communication fails on 4 path(s):
    Local 169.254.181.27 to Remote 169.254.113.38
    Local 169.254.181.27 to Remote 169.254.115.129
    Local 169.254.59.32 to Remote 169.254.113.38
    Local 169.254.59.32 to Remote 169.254.115.129
RPC status:
2 paths up, 0 paths down (tcp check)
2 paths up, 0 paths down (udp check)
If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

bashirgas

Only node management lifs and a clustershell (via direct ssh to node) are abailable.

There are no cluster lifs available. I get a clustershell when I directly logon to node using ssh.

cluster show or cluster ring show are not available.

 

 

The nodeshell does not give me a possibility to create new volumes. I could reinstall a new two node cluster but would like to repair vol0 if there is a way.

 

Thank you

SeanHatfield

OK.  Need to try to revive the RDBs.  Shut down the non-epsilon node, halt the epsilon node and boot to the loader.  unset the boot_recovery bootarg and see if it will come back up.

 

unsetenv bootarg.init.boot_recovery

If the epsilon node comes back up, make sure the cluster ports are mtu1500, then try to bring up the other node.

 

 

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

View solution in original post

Davidjai

hi this is jai, my root volume is damaged,i cant able to recovery it from the error,i used = "unsetenv bootarg.init.boot_recovery" ,bt this command not working , am getting the same problem again and again.

 

 

Thanks in advance.

SeanHatfield

Then you likely need to make some free space first.

 

Start by deleting aggregate snapshots and vol0 snapshots if any are present, then try it again.

 

 

 

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

Davidjai

Thank you so much Mr.hat, i got solution

 

Regards and Thanks

 

JAI

 

 

rsankuru

Hey, How did you resovle this


@Davidjai wrote:

Thank you so much Mr.hat, i got solution

 

Regards and Thanks

 

JAI

 

 


How did you resolve this issue ?

Davidjai

Am just simply followed the recovery procedure, and we used "unsetenv 

bootarg.init.boot_recovery". 

 

And i deleted volume which is damaged, in CLI mode

SupraJari

Hi, trying to recover with 9.3 simulator.

 

VLOADER> unsetenv bootarg.init.boot_recovery

no such file or directory

 

Also printenv does not show such option.

 

Has something changed in 9.3?

 

-Jari

SeanHatfield

Thats the response you get when you try to unset a variable that has not been set.  Maybe something changed, or maybe you are hitting a different scenario.  Whats the message you get after a normal boot?

 

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

Greg_Wilson

this option is no longer there in 9.5

 

any ideas how to recover cluster from a power outage

unsetenv bootarg.rdb_corrupt

tdada

SOLVED!!! Had Single node cluster (vsim 9.0) running on VMware Worstation 11.

 

interupted normal boot into Vloader and ran the cmd "unsetenv bootarg.init.boot_recovery

 

Rebooted and problem solved. Cluster now healthy. Cheers.

 

 

bashirgas

Hi,

 

Removing the recovery flag "unsetenv bootarg.init.boot_recovery" did the trick.

 

I did first on the node with the epsilon role. When it came back online, RDBs started one after one starting with mgmt, vifmg, bcomd.... and so on... of course after view RPC connect errors... The vldb took a while to start but could fetch all info of my volumes again 🙂

 

Then I ran on the other node..... Now my node management and cluster shell as well as OnCommand System Manager are fully functional again...

 

What I want to know if you can assist me

 

1)  What did the unsetenv bootarg.init.boot_recovery to the boot process and logs in order to bypass a set of configurations and try to launch the RDBs again?

2) What trigged the continuation of  "my cluster inconsistence" after I switched the MTU settings right after the node with the epsilon rebooted?

 

Thank you so much for your prof. help

 

// Bash

 

Announcements
NetApp on Discord Image

We're on Discord, are you?

Live Chat, Watch Parties, and More!

Explore Banner

Meet Explore, NetApp’s digital sales platform

Engage digitally throughout the sales process, from product discovery to configuration, and handle all your post-purchase needs.

NetApp Insights to Action
I2A Banner
Public