Re: Problem recovering from steelstore critical failure

justanotheritguy · ‎2015-07-30

I was moving large folders (several GBs) from one top level folder (served by steelstore CIFS) to another using windows file explorer when the copy failed halfway through. At that point the entire steelstore appliance stopped responding, including the web console interface (displayed a basic steelstore page indicating the web server was up but no console running behind it, so not quite a 404 error).

I waited 24 hours and the web console eventually returned on it's own accord, however the alarms triggered were:

Appliance Health: Critical

Storage Optimization Service: Critical

Storage Optimization Service Down: Critical

A colleague recommended I stand-up a new Steelstore appliance (now based on NetApp) and follow the upgrade procedure in the "NetApp® AltaVault® Cloud Integrated Storage 4.0 Installation and Service Guide for Cloud Appliances" (ECMP12455065) guide - which I did...

The first hurdle (after importing the config from the old console and attaching the required volumes to the new instance except /dev/sda1 and /dev/sdk) was when I executed the "megastore guid reset" line in the doco....

I received the warning/error: "Deleting megastore.guid in cloud bucket returned 110" for which I can find no information about online.

I pushed forward and ran "service enable" and after many hours of waiting for "Starting optimization service..." I got a "Storage Optimization Service: initialization error"

The current console shows the same alarm and service states:

Alarms Triggered:

Appliance Health: Critical

Storage Optimization Service: Critical

Storage Optimization Service Down: Critical

Optimization Service:

Service: running

Status: not ready
Mode: Optimized for backup workloads

I am not sure how to proceed in recovering the system to get access to the data in the S3 bucket. I want to decommission this appliance anyway but want to recover the data that it served as some of it is important. There isn't a lot of doco on recovery from critical failure so any advice would be appreciated. I'll include some other ssh console output (below) which i've logged along the way in case it helps.

steelstore01 (config) # show log
ng error message to mgmtd
Jul 28 23:24:07 steelstore01 rfsd[6600]: [replicator.ERR] (6602) test_cloud fail ed in restore mode:invalid argument
Jul 28 23:24:07 steelstore01 rfsd[6600]: [rfsd.ERR] (6602) Cloud test failed: in valid argument
Jul 28 23:24:07 steelstore01 rfsd[6600]: [rfsd.ERR] (6602) Cloud test failed
Jul 28 23:24:07 steelstore01 rfsd[6600]: [rfsd.INFO] (6600) tearing down RfsCont ext
Jul 28 23:24:07 steelstore01 rfsd[6600]: [rfsd.INFO] (6600) Megamount not runnin g
Jul 28 23:24:07 steelstore01 rfsd[6600]: [rfsd.INFO] (6600) Shutting down backen d threads
Jul 28 23:24:07 steelstore01 rfsd[6600]: [mgmt/mgmtd.NOTICE] (6600) rfsd sent ev ent to mgmtd: /rbt/rfsd/events/notready
Jul 28 23:24:07 steelstore01 mgmtd[2443]: [mgmtd.INFO]: EVENT: /rbt/rfsd/events /notready
Jul 28 23:24:07 steelstore01 mgmtd[2443]: [mgmtd.INFO]: in rfsd_notup
Jul 28 23:24:07 steelstore01 mgmtd[2443]: [mgmtd.ERR]: Error no message binding from rfsd.
Jul 28 23:24:07 steelstore01 cli[31032]: [cli.INFO]: user admin: Executing comma nd: show log
Jul 28 23:24:07 steelstore01 cli[31032]: [cli.INFO]: user admin: Command show lo g authorized
lines 24877-24889/24889 (END) steelstore01 (config) #
steelstore01 (config) #
steelstore01 (config) # show log
Jul 29 09:03:15 steelstore01 mgmtd[2443]: [mgmtd.INFO]: Calling module apply fun ction for 15 modules
Jul 29 09:03:15 steelstore01 mgmtd[2443]: [mgmtd.INFO]: Finished calling module apply functions for 14 modules
Jul 29 09:03:15 steelstore01 mgmtd[2443]: [mgmtd.INFO]: Finished database commit
Jul 29 09:13:15 steelstore01 mgmtd[2443]: [mgmtd.INFO]: Starting database commit
Jul 29 09:13:15 steelstore01 mgmtd[2443]: [mgmtd.INFO]: Commit side-effects loop executed 1 times
Jul 29 09:13:15 steelstore01 mgmtd[2443]: [mgmtd.INFO]: Calling node apply funct ions and sysctl key handling functions for 0 nodes
Jul 29 09:13:15 steelstore01 mgmtd[2443]: [mgmtd.INFO]: Finished calling node ap ply functions for 0 nodes
Jul 29 09:13:15 steelstore01 mgmtd[2443]: [mgmtd.INFO]: Finished applying sysctl node values for 0 nodes
Jul 29 09:13:15 steelstore01 mgmtd[2443]: [mgmtd.INFO]: Calling module apply fun ction for 15 modules
Jul 29 09:13:15 steelstore01 mgmtd[2443]: [mgmtd.INFO]: Finished calling module apply functions for 14 modules
Jul 29 09:13:15 steelstore01 mgmtd[2443]: [mgmtd.INFO]: Finished database commit

chriswong · ‎2015-07-31

Hi,

I'm sorry to hear you're having problems. While I don't know the cause of the original problem (the SteelStore going down during the copy operation), I can assist at this time with the proper move of the appliance.

The first question I need to clarify however is whether you're using a virtual SteelStore (AltaVault) appliance, or a cloud-based SteelStore (AltaVault) appliance? The information below suggests that you're using a virtual appliance (i.e. VMware), but the manual you pointed out is the cloud-based appliance manual. The virtual appliance install guide is actually this one: https://library.netapp.com/ecm/ecm_download_file/ECMP12455064. Assuming you're on a virtual appliance and attempting to upgrade from SteelStore 3.x to AltaVault 4.0, then you would follow the directions listed in the appendix A.

To provide additional clarification around steps 5-8.

5. Deploy a new 4.0.0 AltaVault virtual appliance. Do not go through the step of adding a second disk (you'll reattach the original SteelStore one to this AltaVault instance).

5a. Power on the appliance, and run through the CLI wizard to IP the appliance management interface.

5b. Connect to the appliance GUI.
6.Import a shared only configuration on a 4.0 VM:
a. Go to the UI and choose Settings > Setup Wizard.
b. Select Import Configuration.
c. Click on the option, Import Shared Data Only, while specifying the configuration file to import.

d. After the import completes, do not restart the service! Connect back to the appliance CLI.
7. Reset the Megastore GUID on the 4.0 AltaVault virtual appliance by using the CLI command:
CLI > megastore guid reset
8. Associate the 3.x datastore disks from vCenter to the 4.0 VM:
a. Navigate to the VM in the vCenter UI.
b. Right-click on the VM and choose Edit Settings.
c. Click on the tab, Virtual Hardware.
d. Select New device > Existin
g Hard Disk, and click Add.
e. Select the path of the disk file that you noted in the previous Step 4e.
f. Click OK.

If you need more information about doing DR (which would be different from attempting to migrate the datastore disk from the 3.x to AltaVault 4.0), you can actually refer to the Deployment Guide chapter on DR. That guide is available here: https://library.netapp.com/ecm/ecm_download_file/ECMP12434738

Let us know if the above helps.

Regards,

Christopher

justanotheritguy · ‎2015-08-02

Hi Chris,

Thank you for your reply. The original system that failed was actually an Amazon AWS steelstore virtual appliance backed by an S3 bucket. The bucket is still available and I used these instructions - sorry I included the wrong link in my original post!

All other information I've given is accurate. After following all the steps in the chapter "Launching and Configuring the New AltaVault AMI Instance Upgrade" it still shows:

Alarms Triggered:

Appliance Health: Critical

Storage Optimization Service: Critical

Storage Optimization Service Down: Critical

Optimization Service:

Service: running

Status: not ready
Mode: Optimized for backup workloads

Are there any special commands I can run to determine what part of the system is failing?

Thank you.

chriswong · ‎2015-08-03

Hi Jason,

OK, sorry for the confusion. Most of the information in my last update above won't apply, so disregard it.

To reiterate: You've got the cloud-based AMI AltaVault instance, you tried to use the instructions from "Ch 2. Upgrade Process for AltaVault AMI", and after the steps were taken you failed to start the service successfully. You noted that the megastore GUID reset didn't work correctly - this could potentially be one of the problems. The megastore.GUID file is used to track ownership of the AltaVault that owns the cloud bucket. In your case, we issue a reset to say this new (AltaVault) appliance will now be the owner of the bucket, rather than the previous instance which you powered down during these instructions.

I'm not clear why the megastore GUID reset didn't work, but I think we need to resolve that error first, which will probably help things along. Assuming that the migration steps were handled correctly, let's do the following:

1. Reboot the appliance from the CLI:

enable

conf t

no service enable

reload

2. When the appliance comes back up, reconnect to it via CLI and issue:

enable

conf t

no service enable (it should indicate the service is still disabled)

meagstore guid reset

service enable

3. If the service fails to startup at that point, then let's copy and email a system log from the GUI (Settings > System Log) and send that to me (christopher.wong@netapp.com) to review. Make sure it captures messages from the megastore guid reset forward - note that you may see this over more than one page, depending on where the page breaks occur. Errors are colored red.

Thanks,

Christopher

chriswong · ‎2015-08-03

Hi,

Sorry, I don't know why I called you Jason (must've misread it from somewhere else), just realized that right now! 🙂 Anyways, I wanted to provide you another technical diagnosis step. From the CLI, can you issue:

enable

conf t

cloudctl exec "-a list"

and provide the output that appears? If it is successful it will connect to the cloud provider (I'm assuming Amazon) and provide you a listing of the buckets.

If it doesn't appear and you get an error, this could indicate a problem with credentials, or the IAM of the user who's credentials you've applied to the appliance. IAM security requirements are listed in the appendix of the AltaVault/SteelStore user guide. For example: https://library.netapp.com/ecm/ecm_download_file/ECMP12031271

Thanks,

Christopher

chriswong · ‎2015-08-06

Hi,

Checking in to see if you were able to get the AltaVault appliance upgrade done to resolve the error?

Regards,

Christopher

justanotheritguy · ‎2015-08-09

Hi Chris,

Thanks for the follow up. I ran your suggested command and left it to run overnight (it took a while) but it seems it's timed-out.

steelstore01 (config) # cloudctl exec "-a list"

Failed to get bucket list: 7: Couldn't connect to server : Connection timed out

I was very careful to disconnect and connect the appropriate drives as per the upgrade documentation (i.e. moving sdb,sdc,sdd,sde,sdf,sdg,sdh,sdi across from the old steelstore and leaving the existing /dev/sda1 and sdk).

Any ideas? Thank you.

chriswong · ‎2015-08-10

Hi,

Thanks for the information. Did you also try the megastore GUID reset as well, or just the cloudctl command? The cloudctl usually takes only a little time to run (like ~5-10 seconds tops), so this is unusual. The timeout suggests that the configuration is unable to connect to the cloud storage target - possibly due to the configuration not being set. You can double check this by going back the GUI and selecting Storage > Cloud Settings and verifying/re-applying the cloud credentials to see if it can connect. Note that this suggestion is probably also going to mean you need to perform the other response about resetting the megastore guid:

1. Reboot the appliance from the CLI:

enable

conf t

no service enable

reload

2. When the appliance comes back up, reconnect to it via CLI and issue:

enable

conf t

no service enable (it should indicate the service is still disabled)

meagstore guid reset

service enable

3. If the service fails to startup at that point, then let's copy and email a system log from the GUI (Settings > System Log) and send that to me (christopher.wong@netapp.com) to review. Make sure it captures messages from the megastore guid reset forward - note that you may see this over more than one page, depending on where the page breaks occur. Errors are colored red.

Regards,

Christopher

ITnuB · ‎2015-08-19

I am having the same problem on a 3030 appliance. I am getting a storage optimization service not ready - initialization error, but I am also still working through some IAM list permission issues on the S3 side I have to resolve with a 3rd Party. Will the optimization service remain in the not-ready state until I successfully connect to AWS?

chriswong · ‎2015-08-19

Hi,

Yes that is correct - you will not have any capability to have the service enter the healthy state until you can properly connect and communicate with a cloud storage target. Note that while the service initialization error is the same, the underlying cause is significantly different than that discussed higher up in this thread (which pertains to the cloud-based AltaVault AMI appliance).

Regards,

Christopher

TonyWu · ‎2016-08-26

Hi Chris,

The AVA 4.2 for HyperV has the initialization error. Please advise

I email the log and screen captures to you already.

Many thanks.

Best Regards,

Tony