BlueXP Services

Data unavailable after AWS appliance migration

AaronHauck
4,141 Views
We have a situation where we've attempted to migrate a AWS AMI SteelStore appliance to a new instance.  We are now in a situation where the new instance is on line but it is not presenting any of the data previously stored by the original instance.
 
Full details of the problem are as follows:
  • Original instance was an AWS market place instance running from the following AWS AMI ID: SteelStore_apollo_730_GA-23c1c006-945c-41c7-83f2-f45427c2ef86-ami-70e67218.2 (ami-5f046a65) using S3 cloud storage.
  • We noticed some large disk queueing on the cache storage of the instance and wanted to change to SSD backed storage from the default magnetic.  Additionally we wanted to enable "EBS Optimised" on the instance.
  • Plan was to perform a "DR Migration" to a new instance, following the online documentation for this process.
Process was as follows:
  1. Log into original appliance, export configuration and shut down
  2. Launch a new instance from the AWS marketplace (Same VPC, subnet, security group etc. but enable EBS Optimised and configure all storage as SSD instead of magnetic)
  3. SSH to new instance, set the admin password, and initialise storage with the following commands:

    enable
    configure terminal
    username admin password 0 xxxxx
    aws setup data partition
    write memory
  4. Log in to the web interface of the new appliance and import the configuration exported from the original appliance.  Supply encryption key and import shared data only.
  5. SSH back into the appliance and set the datastore recovery mode with the following commands

    enable
    configure terminal
    no service enable
    replication recovery enable
    service enable

  6. Wait for the cloud restore process to complete (3+hours)
  7. Once the service is in a running state attempt to browse a CIFS share, most of which appear empty

    One share has a folder structure and a couple of files out of the thousands of expected files.  This share was previously pinned.

    Am able to write to any share.

  8. Attempt to run a prepopulation job on one of the shares which fails immediately.
  9. At this point I decided to roll back by stopping the new appliance and restarting the old appliance.  The storage optimisation service on the old appliance would not start.  There was a message in the logs saying something along the lines of "cloud is ahead of appliance" as per the following log.

    May 28 12:10:35 steelstore01 rfsd[5679]: [ctl.INFO] (5679) /etc/rfsd.conf set gc.mode=1
    May 28 12:10:35 steelstore01 rfsd[5679]: [ctl.ERR] (5679) no such node prepop.restore_percent_limit_per_hour
    May 28 12:10:35 steelstore01 rfsd[5679]: [ctl.INFO] (5679) delaying setting prepop.restore_percent_limit_per_hour=5: not available
    May 28 12:10:35 steelstore01 rfsd[5679]: [rfsd.INFO] (5681) rbt_internal=/data
    May 28 12:10:35 steelstore01 rfsd[5679]: [shelf_config.INFO] (5681) Adding shelf: path=/data device=
    May 28 12:10:35 steelstore01 rfsd[5679]: [shelf_config.INFO] (5681) Adding shelf evicter config: num_entries= 8388630 reserve=4194301 reserve_incr=500000 max_secs_between_syncs=1800 max_unsynced_changes=1000000
    May 28 12:10:35 steelstore01 mgmtd[2370]: [mgmtd.INFO]: session 36: accepted client 'rfsd-5679' for authenticated user 'admin' (uid 0, gid 0) authorization key 'admin', interactive 'false'
    May 28 12:10:35 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) rbt_oids read from rbt_oids.dat on disk:
    May 28 12:10:35 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) OID_PREFIX : 6360455
    May 28 12:10:35 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) SLAB_PREFIX : 3368957
    May 28 12:10:35 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) ANCHOR_HASH_SYNC : 18690
    May 28 12:10:35 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) HOT_SLAB_PREFIX : 2
    May 28 12:10:35 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) ESC_SLAB_PREFIX : 1099511699414
    May 28 12:10:35 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) PREPOP_TASK_PREFIX : 13
    May 28 12:10:35 steelstore01 rfsd[5679]: [ha.INFO] (5681) Local datastore is not empty
    May 28 12:10:35 steelstore01 rfsd[5679]: [ha.INFO] (5681) Configured as a standalone WW
    May 28 12:10:35 steelstore01 rfsd[5679]: [mgmt/mgmtd.INFO] (5682) connected to mgmtd
    May 28 12:10:35 steelstore01 statsd[2885]: [statsd.INFO]: session 4: accepted client 'rfsd-5679' for authenticated user 'admin' (uid 0, gid 0) authorization key 'admin', interactive 'false'
    May 28 12:10:35 steelstore01 rfsd[5679]: [ha.INFO] (5681) Nothing to do
    May 28 12:10:35 steelstore01 rfsd[5679]: [mgmt/statsd.INFO] (5683) connected to statsd
    May 28 12:10:35 steelstore01 rfsd[5679]: [rfsd.INFO] (5681) Continuing RfsContext initialization
    May 28 12:10:35 steelstore01 rfsd[5679]: [rfsd.NOTICE] (5681) Checking cloud connection.
    May 28 12:10:35 steelstore01 rfsd[5679]: [rfsd/graphite.INFO] (5684) Starting Graphite statistics thread
    May 28 12:10:36 steelstore01 rfsd[5679]: [replicator.INFO] (5681) Using bucket in01-steelstore-01.t1cloud.com
    May 28 12:10:36 steelstore01 rfsd[5679]: [replicator.INFO] (5681) Validating local and cloud copies of rbt_oids.dat
    May 28 12:10:36 steelstore01 rfsd[5679]: [replicator.INFO] (5681) Reading local rbt_oids.dat
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) rbt_oids read from rbt_oids.dat on disk:
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) OID_PREFIX : 6360455
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) SLAB_PREFIX : 3368957
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) ANCHOR_HASH_SYNC : 18690
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) HOT_SLAB_PREFIX : 2
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) ESC_SLAB_PREFIX : 1099511699414
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) PREPOP_TASK_PREFIX : 13
    May 28 12:10:36 steelstore01 rfsd[5679]: [replicator.INFO] (5681) Reading remote rbt_oids.dat
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) rbt_oids read from rbt_oids.dat.cloud on disk:
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) OID_PREFIX : 6394006
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) SLAB_PREFIX : 3380368
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) ANCHOR_HASH_SYNC : 18759
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) HOT_SLAB_PREFIX : 2
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) ESC_SLAB_PREFIX : 1099511699455
    May 28 12:10:36 steelstore01 rfsd[5679]: [oidgen.INFO] (5681) PREPOP_TASK_PREFIX : 15
    May 28 12:10:36 steelstore01 rfsd[5679]: [replicator.CRIT] (5681) Cloud is ahead of appliance. Please contact Riverbed Support
    May 28 12:10:36 steelstore01 pm[2309]: [pm.ERR]: Output from rfsd: /opt/rbt/bin/rfsd (pid 5679) received signal 6 (SIGABRT) dumping core

  10. Attempt to clear the local data store on the original appliance failed as follows:

    # datastore format local
    WARNING: The local datastore has data pending to be replicated.
    Format local datastore? (y/N) y
    Could not stat Not --- No such file or directory

    The device apparently does not exist; did you specify it correctly?
    Failed to clean /data.
    Failed to format local datastore.

  11. At this time the new AMI AltaVault became avaiable so I attempted to deploy this version with a similar result but saw a little bit more information in the logs

    May 28 15:27:12 amnesiac rfsd[4055]: [restore_progress_meter.INFO] (4057) Cloud restore phase progress: 82.5% done, estimated time left: 00:36:16, rate: 529.4 objects per second
    May 28 15:27:40 amnesiac rfsd[4055]: [thread_tuner.INFO] (4508) Reached update time Current time = 15:27:40, Last calibration time = 15:24:40
    May 28 15:27:41 amnesiac hald[2985]: [hald.INFO]: Support query handler called
    May 28 15:27:41 amnesiac webasd[5908]: [web.INFO]: web: User admin viewing System Logs page.
    May 28 15:27:42 amnesiac rfsd[4055]: [restore_progress_meter.INFO] (4057) Cloud restore phase progress: 82.7% done, estimated time left: 00:35:39, rate: 529.4 objects per second
    May 28 15:27:45 amnesiac hald[2985]: [hald.INFO]: Support query handler called
    May 28 15:27:45 amnesiac webasd[5908]: [web.INFO]: web: User admin viewing System Logs page.
    May 28 15:27:47 amnesiac hald[2985]: [hald.INFO]: Support query handler called
    May 28 15:27:47 amnesiac webasd[5908]: [web.INFO]: web: User admin viewing System Logs page.
    May 28 15:27:50 amnesiac hald[2985]: [hald.INFO]: Support query handler called
    May 28 15:27:50 amnesiac webasd[5908]: [web.INFO]: web: User admin viewing System Logs page.
    May 28 15:28:05 amnesiac rfsd[4055]: [replicationdb.INFO] (4622) Level-0 table #2845: started
    May 28 15:28:05 amnesiac rfsd[4055]: [replicationdb.INFO] (4622) Level-0 table #2845: 3263661 bytes OK
    May 28 15:28:05 amnesiac rfsd[4055]: [replicationdb.INFO] (4622) Delete type=0 #2842
    May 28 15:28:12 amnesiac rfsd[4055]: [restore_progress_meter.INFO] (4057) Cloud restore phase progress: 83.0% done, estimated time left: 00:35:03, rate: 1285.7 objects per second
    May 28 15:28:42 amnesiac rfsd[4055]: [restore_progress_meter.INFO] (4057) Cloud restore phase progress: 83.2% done, estimated time left: 00:34:51, rate: 818.2 objects per second
    May 28 15:28:43 amnesiac rfsd[4055]: [namespace_restore.INFO] (4057) Cloud restore phase complete. Starting namespace generation phase
    May 28 15:28:45 amnesiac rfsd[4055]: [namespace_restore.NOTICE] (4057) File /data/fs/SQLBackups/IN01-SQL02-SQL0/RestoreCheckpointDB5.CKP has already been restored with oid 6376843;dropping the restore for oid 0x614661
    May 28 15:28:47 amnesiac rfsd[4055]: [meta_data_log_entry.ERR] (4057) Failed to decrypt file name: Encrypt/decrypt failed
    May 28 15:28:47 amnesiac rfsd[4055]: [namespace_restore.ERR] (4057) Failed to read buffer for oid 0x611380: Encrypt/decrypt failed
    May 28 15:28:47 amnesiac rfsd[4055]: [mgmt/mgmtd.NOTICE] (4057) rfsd sent event to mgmtd: /rbt/rfsd/events/notready
    May 28 15:28:47 amnesiac rfsd[4055]: [namespace_restore.ERR] (4057) Failed to create_shell_file for oid 0x611380: Encrypt/decrypt failed
    May 28 15:28:47 amnesiac mgmtd[2478]: [mgmtd.INFO]: EVENT: /rbt/rfsd/events/notready
    May 28 15:28:47 amnesiac mgmtd[2478]: [mgmtd.INFO]: in rfsd_notup
    May 28 15:28:47 amnesiac mgmtd[2478]: [mgmtd.INFO]: message Mismatch in encryption key
    May 28 15:28:47 amnesiac rfsd[4055]: [megamount.ERR] (4057) Error in restoring user namespace : Encrypt/decrypt failed
    May 28 15:28:47 amnesiac rfsd[4055]: [rfsd.INFO] (4055) tearing down RfsContext
    May 28 15:28:47 amnesiac rfsd[4055]: [rfsd.INFO] (4055) Megamount not running
    May 28 15:28:47 amnesiac rfsd[4055]: [rfsd.INFO] (4055) Shutting down backend threads
    May 28 15:28:47 amnesiac rfsd[4055]: [rfsd.INFO] (4055) Stopping restore of user namespace
    May 28 15:28:47 amnesiac rfsd[4055]: [reclaim_file.INFO] (4055) Processor ReclaimFile shutting down 1 threads
    May 28 15:28:47 amnesiac rfsd[4055]: [vnode_evicter_syncer.INFO] (4055) Processor VnodeEvicterSyncer shutting down 32 threads
    May 28 15:28:47 amnesiac rfsd[4055]: [megastore.INFO] (4055) shutting down megastore instance for volume rfsd_volume
    May 28 15:28:47 amnesiac rfsd[4055]: [megastore.INFO] (4055) shutting down megastore instance rfsd_volume
    May 28 15:28:47 amnesiac rfsd[4055]: [online_mfsck.INFO] (4055) shutting down online_mfsck completed
    May 28 15:28:47 amnesiac rfsd[4055]: [gc.INFO] (4055) Shutting down GC
    May 28 15:28:47 amnesiac rfsd[4055]: [gc.INFO] (4055) GC not currently running, so nothing to cancel
    May 28 15:28:47 amnesiac rfsd[4055]: [reaper.INFO] (4055) Shutting down reaper
    May 28 15:28:47 amnesiac rfsd[4055]: [reaper.INFO] (4055) Waiting for main thread completion
    May 28 15:28:47 amnesiac rfsd[4055]: [reaper.INFO] (4055) Reaper shutdown complete
    May 28 15:28:47 amnesiac rfsd[4055]: [dataprepop.INFO] (4512) prepop task processor thread shutting down.
    May 28 15:28:47 amnesiac rfsd[4055]: [dataprepop.INFO] (4513) dryrun prepop task processor thread shutting down.
    May 28 15:28:47 amnesiac rfsd[4055]: [dataprepop.INFO] (4514) prepop job monitor thread shutting down.
    May 28 15:28:47 amnesiac rfsd[4055]: [prefetch.INFO] (4055) Processor prefetch_processor shutting down 1 threads
    May 28 15:28:47 amnesiac rfsd[4055]: [megastore.INFO] (4055) synching anchor hash
    May 28 15:28:47 amnesiac rfsd[4055]: [anchor_hash.INFO] (4055) Shutting down AnchorHash Cleaner
    May 28 15:28:47 amnesiac rfsd[4055]: [replicator.INFO] (4055) Shutting down replication manager thread
    May 28 15:28:47 amnesiac rfsd[4055]: [retry_monitor.INFO] (4509) Replication retry monitoring thread exiting
    May 28 15:28:47 amnesiac rfsd[4055]: [replicator.INFO] (4055) Shutting down backend threads
    May 28 15:28:47 amnesiac rfsd[4055]: [replicator.INFO] (4507) Slab task wait thread shutting down
    May 28 15:28:47 amnesiac rfsd[4055]: [replicator.INFO] (4506) slab replicator shutting down : Server is shutting down
    May 28 15:28:47 amnesiac rfsd[4055]: [replicator.INFO] (4505) replication manager shutting down.
    May 28 15:28:48 amnesiac rfsd[4055]: [replicator.INFO] (4508) Shutting down thread tuner
    May 28 15:28:48 amnesiac rfsd[4055]: [replicator.INFO] (4055) Shutting down replicator
    May 28 15:28:48 amnesiac rfsd[4055]: [replicator.INFO] (4055) Shutting down all replication worker threads
    May 28 15:28:48 amnesiac rfsd[4055]: [replicator.INFO] (4055) Processor replicator shutting down 20 threads
    May 28 15:28:48 amnesiac rfsd[4055]: [replicator.INFO] (4055) Completed shutdown of replication worker threads
    May 28 15:28:48 amnesiac rfsd[4055]: [replicator.INFO] (4055) Replicator shutdown complete
    May 28 15:28:48 amnesiac rfsd[4055]: [replicator.INFO] (4055) ReplicationManager shutdown complete
    May 28 15:28:48 amnesiac rfsd[4055]: [megastore.INFO] (4055) Time stamp until which data is restorable : Thu May 28 08:05:57 2015
    May 28 15:28:48 amnesiac rfsd[4055]: [mutable_slab_manager.INFO] (4055) releasing partial slabs
    May 28 15:28:48 amnesiac rfsd[4055]: [evicter.INFO] (4501) evicter monitor thread stopping
    May 28 15:28:48 amnesiac rfsd[4055]: [evicter.INFO] (4500) evicter thread stopping
    May 28 15:28:48 amnesiac rfsd[4055]: [evict.INFO] (4055) Processor EvictProcessor shutting down 257 threads
    May 28 15:28:48 amnesiac rfsd[4055]: [megastore.NOTICE] (4055) Megastore rfsd_volume shut down cleanly!
    May 28 15:28:48 amnesiac rfsd[4055]: [bs_index.INFO] (4055) Enqueuing 0 Index blocks for syncing
    May 28 15:28:48 amnesiac rfsd[4055]: [index_syncer.INFO] (4055) Processor IndexSyncer shutting down 1 threads
    May 28 15:28:48 amnesiac rfsd[4055]: [slab_consistency.INFO] (4055) Shutting down all slab consistency worker threads
    May 28 15:28:48 amnesiac rfsd[4055]: [slab_consistency.INFO] (4055) Processor slab_consistency shutting down 16 threads
    May 28 15:28:48 amnesiac rfsd[4055]: [slab_consistency.INFO] (4055) Completed shutdown of slab consistency worker threads
    May 28 15:28:48 amnesiac rfsd[4055]: [anchor_inserter.INFO] (4055) Processor AnchorHashInserter shutting down 1 threads
    May 28 15:28:48 amnesiac rfsd[4055]: [megastore.INFO] (4055) Transaction Statistics for Transaction Manager rfsd_volume
    May 28 15:28:48 amnesiac rfsd[4055]: [megastore.INFO] (4055) Log Records Per Transaction: Samples: 0 **bleep**. Total: 0 Min: 0 Max: 0 Mean: -nan Median: 0 Median Err: 5
    May 28 15:28:48 amnesiac rfsd[4055]: [megastore.INFO] (4055) Payload Bytes Logged Per Transaction: Samples: 0 **bleep**. Total: 0 Min: 0 Max: 0 Mean: -nan Median: 0 Median Err: 5
    May 28 15:28:49 amnesiac rfsd[4055]: [slabcache.INFO] (4055) SlabCache is empty
    May 28 15:28:49 amnesiac rfsd[4055]: [slab_syncer.INFO] (4055) Processor SlabSyncer shutting down 32 threads
    May 28 15:28:49 amnesiac rfsd[4055]: [slab_expander.INFO] (4055) Processor SlabExpander shutting down 32 threads
    May 28 15:28:54 amnesiac rfsd[4055]: [mgmt/mgmtd.NOTICE] (4055) rfsd sent event to mgmtd: /rbt/rfsd/events/notready
    May 28 15:28:54 amnesiac mgmtd[2478]: [mgmtd.INFO]: EVENT: /rbt/rfsd/events/notready
    May 28 15:28:54 amnesiac mgmtd[2478]: [mgmtd.INFO]: in rfsd_notup
    May 28 15:28:54 amnesiac mgmtd[2478]: [mgmtd.ERR]: Error no message binding from rfsd.
    May 28 15:28:54 amnesiac rfsd[4055]: [rfsd.INFO] (4055) Destroying fuse object
    May 28 15:29:08 amnesiac hald[2985]: [hald.INFO]: Support query handler called
    May 28 15:29:08 amnesiac webasd[5908]: [web.INFO]: web: User admin viewing System Logs page.

Am looking for a solution to either getting the old instance back online or a new instance presenting all of the data.

1 REPLY 1

chriswong
4,114 Views

Hi,

There is a formal upgrade process to migrate from an existing SteelStore AMI appliance to the AltaVault AMI appliance. It is documented in the install guide, chapter 2. Please check out this forum post which has the link to the documentation and goes over that process. You will need to use that process to complete the conversion. I am hoping that the actions you've taken already haven't damaged the original datastore volumes, which will need to be migrated across to the new appliance. Please use the established procedure for migration and let us know how things go.

 

Regards,

Christopher

Public