Solved: Unable to recover the local database of Data Replication Module

Peetrk · ‎2021-07-07

Installed 2 node cluster in simulator 9.8.

Systems went down NOT gracefully and now after startup both system report message:

node-01:

nac101::> node run local
Type 'exit' or 'Ctrl-D' to return to the CLI
nac101-01> vol status vol0
Volume State Status Options
vol0 online raid_dp, flex root, nvfail=on, space_slo=none
64-bit
Volume UUID: b33f166a-7c4a-4fbe-9d97-bb7e60809652
Containing aggregate: 'aggr0_nac101'
nac101-01> df -g vol0
Filesystem total used avail capacity Mounted on
/vol/vol0/ 0GB 0GB 0GB 82% /vol/vol0/
/vol/vol0/.snapshot 0GB 0GB 0GB 263% /vol/vol0/.snapshot

node-02:

***********************
** SYSTEM MESSAGES **
***********************

CRITICAL. This node is not healthy because the root volume is low on space
(<10MB). The node can still serve data, but it cannot participate in cluster
operations until this situation is rectified. Free space using the nodeshell or
contact technical support for assistance.

Internal error. Cannot open corrupt replicated database. Automatic recovery
attempt has failed or is disabled. Check the event logs for details. This node
is not fully operational. Contact support personnel for the root volume recovery
procedures.

nac101::> node run local
Type 'exit' or 'Ctrl-D' to return to the CLI
nac101-02> df -g vol0
Filesystem total used avail capacity Mounted on
/vol/vol0/ 0GB 0GB 0GB 100% /vol/vol0/
/vol/vol0/.snapshot 0GB 0GB 0GB 541% /vol/vol0/.snapshot
nac101-02> vol status vol0
Volume State Status Options
vol0 online raid_dp, flex root, nvfail=on, space_slo=none
64-bit
Volume UUID: 0c714e69-f7c9-4590-847c-a5f6a4b27677
Containing aggregate: 'aggr0_nac102'

Please advice how to recover and how to gracefull shutdown simulator nide 1 and node 2.

Peetrk · ‎2021-07-12

Did all adviced, simulators fell in stalling state over and over and space and recover database events.

Installed simulators 9.8 on VMWorkstation and ESX1 7.0, both same issues.

After installing simulators 9.7 all OK ....

View solution in original post

hmoubara · ‎2021-07-07

Hello,

Most likely something is filling up the root vol so you will either have to clear those or grow the volume. You can review the kb below to help you in checking what is filling the root vol.

https://kb.netapp.com/Advice_and_Troubleshooting/Data_Storage_Software/ONTAP_OS/How_to_find_what_is_filling_a_node's_root_volume

As for doing a gracefull shutdown, you can perform a takeover and giveback one at a time.

Thanks

Peetrk · ‎2021-07-08

Thank you for your response

I can't access the KB article, opening the link ends in: You do not have permission to view this page.

Peter
Platform Engineer

Peetrk · ‎2021-07-08

The shutdown situation ....

I've setup simulator cluster in my personal lab, which I want to shutdown EOB.

Shutting down the simulator nodes/cluster damages the cluster database and systems disks, so anytime I want to work with the simulator I have to rebuild the nodes and cluster from scratch, taking a lot of time.

Is there another way?

hmoubara · ‎2021-07-08

Hello,

Sorry missed that you working with a simulator.

For the shutdown, you will have to shutdown the simulator properly to avoid data loss since the Non-volatible RAM is simulated and is not persistent. To do so you can do either:

1. You can issue a shutdown guest from VMware.

2. You can issue the halt command from the cli, wait till it is complete then turn it off manually.

As for the space issue, you will need to set more space for the root vol or check whats eating space. Per example check if there is any coredump saved for that node and removed them or delete any snapshot for the root vol:

cluster::> system node coredump show

cluster::> system node coredump delete-all

To check for snapshot and delete them on vol0

cluster::>system node run -node <node name> -command "snap list vol0"

cluster::>system node run -node <node name> -command "snap delete -a vol0"

Thanks

Peetrk · ‎2021-07-08

Thank you, this helped to get the node available for login again, the cluster however still does not respond.

No coredumps, but snapshots where there and all deleted.

After reboot I can login to node managent IP, but what is holding cluster managent service is:

***********************
** SYSTEM MESSAGES **
***********************

Internal error. Cannot open corrupt replicated database. Automatic recovery
attempt has failed or is disabled. Check the event logs for details. This node
is not fully operational. Contact support personnel for the root volume recovery
procedures.

jcolonfzenpr · ‎2021-07-08

Hello,

If sim 9.8 you can run those commands from each node member:

::> set -priv diag

::*> system configuration recovery node mroot-state clear -recovery-state rdb

::*> reboot

Hope this helps.

Jonathan Colón | Blog | Linkedin

Peetrk · ‎2021-07-09

The recovery was succesfull.

When log into node:

***********************
** SYSTEM MESSAGES **
***********************

CRITICAL. This node is not healthy because the root volume is low on space
(<10MB). The node can still serve data, but it cannot participate in cluster
operations until this situation is rectified. Free space using the nodeshell or
contact technical support for assistance.

It looks like I'm closing in to the final healthy setup of the cluster.

hmoubara · ‎2021-07-09

Hello,

Seems that something is eating space on the root vol or your root vol is too small.

Common practice is to add another idisk to the aggr and grow the root volume, and turn off snapshots.

cluster> node run local

node> snap delete -a vol0

node> vol options vol0 nosnap on

node> ctrl+D

cluster> reboot

reboot the simulator and that should do the trick.

Thanks

jcolonfzenpr · ‎2021-07-09

Take a look at this link:

https://community.netapp.com/t5/Simulator-Discussions/Question-about-Simiulator-vol0-usage-size/m-p/164595/highlight/true#M2702

Jonathan Colón | Blog | Linkedin

Peetrk · ‎2021-07-12

Did all adviced, simulators fell in stalling state over and over and space and recover database events.

Installed simulators 9.8 on VMWorkstation and ESX1 7.0, both same issues.

After installing simulators 9.7 all OK ....