Simulator Discussions
Simulator Discussions
Has anyone had any issues with multiple disk failures on the OnTap simulator? We have two single-node simulators setup (running on VMWare 6.0), and have had multiple occasions over the past year where multiple disks will fail in very quick succession, and it ends up failing the aggregate. I can't imagine what would cause some, but not all the virtual disks to fail, since they are all part of the same disk that has been assigned to the simulator in VMWare.
At this point, the simulator is dead in the water, because the root volumes for all the vservers were on the failed aggregate. I tried unfailing the disks, but now they are "orphaned" and the hot spares that tried to replace the failed disks are marked as "reconstruct stalled". It won't even let me delete any of the volumes, or the failed aggregate.
Does anyone have any suggestions for how to recover from this short of blowing away the entire simulator and rebuilding from scratch? Also, any suggestions for why this may be happening and how to prevent it in the future would be nice as well (or, at least how to quickly recover from it).
Thanks!
Solved! See The Solution
On a new sim thats never booted before you can control it with bootargs in the loader. Once its up and running you can make adjustments from the systemshell and nodeshell.
This will get you a list of disks:
run local disk show -v
Then you can pull one out of the shelf (must be owned by the local node):
run local disk simpull v0.29
And then delete it:
systemshell local "sudo rm /sim/dev/,disks/,pulled/v0.29*"
Repeat as needed until you've removed all the unwanted disks.
The disks are sparse files on IDE1:1 (/dev/ad3), which is probably getting full. You can check it on a working sim with:
set d;systemshell local df -h /sim
The default disk population of the ESX version of the simulator is 4gbx56 disks, so if you use them all it will eventually run out of space. You can use fewer disks, or you can replace IDE1:1 with a bigger disk after deploying the ova, but before powering on the simulator.
Thank you, @SeanHatfield, that does indeed seem to be the issue:
Filesystem Size Used Avail Capacity Mounted on
/dev/ad3 223G 223G -18G 109% /sim
So, I'm assuming there is no way to save the simulator in its current state?
You mentioned using fewer disks. You are correct that it is setup with 4x56 disks. I have left 13 of them as spares... or are those still considered being used? How would I not use them? The couple of times I have rebuilt this, I have just gone into the boot menu of the simulator and told it to wipe the disks and do a fresh setup. Is there a way from there to adjust the # and size of the disks?
Thanks again!
On a new sim thats never booted before you can control it with bootargs in the loader. Once its up and running you can make adjustments from the systemshell and nodeshell.
This will get you a list of disks:
run local disk show -v
Then you can pull one out of the shelf (must be owned by the local node):
run local disk simpull v0.29
And then delete it:
systemshell local "sudo rm /sim/dev/,disks/,pulled/v0.29*"
Repeat as needed until you've removed all the unwanted disks.
@SeanHatfield , it looks like deleting the disks got the space down:
Filesystem Size Used Avail Capacity Mounted on /dev/ad3 223G 181G 24G 88% /sim
But, the disks are still showing as orphaned and the aggregate is offline. Do you have any other sorcery to recover this aggregate, or is a rebuild of the sim my only option? Not a huge deal if it is, but I'd like to avoid it if possible.
Aggregate aggr1 (failed, raid_dp, partial) (block checksums) Plex /aggr1/plex0 (offline, failed, inactive) RAID group /aggr1/plex0/rg0 (partial, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity v2.16 v2 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 parity v3.16 v3 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v1.26 v1 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 (reconstruct stalled) data v0.21 v0 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v2.17 v2 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v3.17 v3 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v1.20 v1 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v0.22 v0 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v2.18 v2 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v3.18 v3 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v1.21 v1 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v0.24 v0 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v2.19 v2 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v3.19 v3 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v1.22 v1 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v0.25 v0 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v2.20 v2 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v3.20 v3 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v1.24 v1 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v0.28 v0 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 (reconstruct stalled) data v2.21 v2 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v3.21 v3 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v1.25 v1 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v0.27 v0 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data v2.22 v2 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 data FAILED N/A 4020/ - Raid group is missing 1 disk. Unassimilated aggr1 disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- orphaned v0.26 v0 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 orphaned v1.19 v1 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448 orphaned v3.22 v3 - - FC:A 0 FCAL 15000 4020/8233984 4027/8248448
If nothing else, I think your suggestions have hopefully saved my other sim from the same fate, and will hopefully prevent this from happening again in the future.
Thanks again!
Another suggestion for quick recovery without starting from scratch is :
1) Take vm snashot once (atleast once the cluster is setup with all the basic configuration)
2) Take another snapshot when you feel you have added some changes or provisioned iscsi, nfs, cifs setup or mirror/vaulting or snapcenter stuff, so that even if it core-dumps or crashes, you need not repeat the base line setup configuration again.
Simulators are prone to core-dump for known & unknown reasons, one of the most known reason being root vol running out of space, and sometimes it's too late to even remediate this. In such scenarios, if you have a snapshots (atleast the base_line) to go back to , it makes life easier, b'cos then you can actually get to node level and delete all the logs and snapshots which is otherwise not available, once the simulator is crashed.
It's not ideal but then it's just a simulator (test/demo device), instead of troubleshooting you can actually get to back to your base-line state.
@Ontapforrum wrote:
1) Take vm snashot once (atleast once the cluster is setup with all the basic configuration)
That is a great suggestion, @Ontapforrum, unfortunately this customer will not let us keep snapshots for longer than a week or so.