Simulator Discussions

Unable to recover the local database of Data Replication Module

Peetrk
5,728 Views

Installed 2 node cluster in simulator 9.8.

 

Systems went down NOT gracefully and now after startup both system report message:

 

node-01:

nac101::> node run local
Type 'exit' or 'Ctrl-D' to return to the CLI
nac101-01> vol status vol0
Volume State Status Options
vol0 online raid_dp, flex root, nvfail=on, space_slo=none
64-bit
Volume UUID: b33f166a-7c4a-4fbe-9d97-bb7e60809652
Containing aggregate: 'aggr0_nac101'
nac101-01> df -g vol0
Filesystem total used avail capacity Mounted on
/vol/vol0/ 0GB 0GB 0GB 82% /vol/vol0/
/vol/vol0/.snapshot 0GB 0GB 0GB 263% /vol/vol0/.snapshot

 

node-02:

***********************
** SYSTEM MESSAGES **
***********************

CRITICAL. This node is not healthy because the root volume is low on space
(<10MB). The node can still serve data, but it cannot participate in cluster
operations until this situation is rectified. Free space using the nodeshell or
contact technical support for assistance.

Internal error. Cannot open corrupt replicated database. Automatic recovery
attempt has failed or is disabled. Check the event logs for details. This node
is not fully operational. Contact support personnel for the root volume recovery
procedures.


nac101::> node run local
Type 'exit' or 'Ctrl-D' to return to the CLI
nac101-02> df -g vol0
Filesystem total used avail capacity Mounted on
/vol/vol0/ 0GB 0GB 0GB 100% /vol/vol0/
/vol/vol0/.snapshot 0GB 0GB 0GB 541% /vol/vol0/.snapshot
nac101-02> vol status vol0
Volume State Status Options
vol0 online raid_dp, flex root, nvfail=on, space_slo=none
64-bit
Volume UUID: 0c714e69-f7c9-4590-847c-a5f6a4b27677
Containing aggregate: 'aggr0_nac102'

 

Please advice how to recover and how to gracefull shutdown simulator nide 1 and node 2.

1 ACCEPTED SOLUTION

Peetrk
5,494 Views

Did all adviced, simulators fell in stalling state over and over and space and recover database events.

 

Installed simulators 9.8 on VMWorkstation and ESX1 7.0, both same issues.

 

After installing simulators 9.7 all OK ....

View solution in original post

10 REPLIES 10

hmoubara
5,687 Views

Hello, 

 

Most likely something is filling up the root vol so you will either have to clear those or grow the volume. You can review the kb below to help you in checking what is filling the root vol.

 

https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/How_to_find_what_is_filling_a_node's_root_volume

 

As for doing a gracefull shutdown, you can perform a takeover and giveback one at a time.

 

Thanks

Peetrk
5,672 Views

Thank you for your response

 

I can't access the KB article, opening the link ends in: You do not have permission to view this page.

Peter
Platform Engineer

Peetrk
5,671 Views

The shutdown situation ....

I've setup simulator cluster in my personal lab, which I want to shutdown EOB.

Shutting down the simulator nodes/cluster damages the cluster database and systems disks, so anytime I want to work with the simulator I have to rebuild the nodes and cluster from scratch, taking a lot of time.

 

Is there another way?

hmoubara
5,668 Views

Hello,


Sorry missed that you working with a simulator.

For the shutdown, you will have to shutdown the simulator properly to avoid data loss since the Non-volatible RAM is simulated and is not persistent. To do so you can do either:

1. You can issue a shutdown guest from VMware.

2. You can issue the halt command from the cli, wait till it is complete then turn it off manually.

 

As for the space issue, you will need to set more space for the root vol or check whats eating space. Per example check if there is any coredump saved for that node and removed them or delete any snapshot for the root vol:

 

cluster::> system node coredump show

cluster::> system node coredump delete-all

 

To check for snapshot and delete them on vol0

 

cluster::>system node run -node <node name> -command "snap list vol0"

cluster::>system node run -node <node name> -command "snap delete -a vol0"

 

Thanks

Peetrk
5,660 Views

Thank you, this helped to get the node available for login again, the cluster however still does not respond.

No coredumps, but snapshots where there and all deleted.

 

After reboot I can login to node managent IP, but what is holding cluster managent service is:

***********************
** SYSTEM MESSAGES **
***********************

Internal error. Cannot open corrupt replicated database. Automatic recovery
attempt has failed or is disabled. Check the event logs for details. This node
is not fully operational. Contact support personnel for the root volume recovery
procedures.

jcolonfzenpr
5,641 Views

Hello,

 

If sim 9.8 you can run those commands from each node member:

 

::> set -priv diag

::*> system configuration recovery node mroot-state clear -recovery-state rdb

::*> reboot

 

Hope this helps.

Jonathan Colón | Blog | Linkedin

Peetrk
5,611 Views

The recovery was succesfull.

When log into node:

***********************
** SYSTEM MESSAGES **
***********************

CRITICAL. This node is not healthy because the root volume is low on space
(<10MB). The node can still serve data, but it cannot participate in cluster
operations until this situation is rectified. Free space using the nodeshell or
contact technical support for assistance.

It looks like I'm closing in to the final healthy setup of the cluster.

hmoubara
5,603 Views

Hello,

 

Seems that something is eating space on the root vol or your root vol is too small.

Common practice is to add another idisk to the aggr and grow the root volume, and turn off snapshots.

 

cluster> node run local

node> snap delete -a vol0

node> vol options vol0 nosnap on

node> ctrl+D

cluster> reboot

 

reboot the simulator and that should do the trick.

 

Thanks

 

 

Peetrk
5,495 Views

Did all adviced, simulators fell in stalling state over and over and space and recover database events.

 

Installed simulators 9.8 on VMWorkstation and ESX1 7.0, both same issues.

 

After installing simulators 9.7 all OK ....

Public