Solved: Disk Failures on OnTap Simulator 9.5

Jim_Robertson · ‎2019-10-02

Has anyone had any issues with multiple disk failures on the OnTap simulator? We have two single-node simulators setup (running on VMWare 6.0), and have had multiple occasions over the past year where multiple disks will fail in very quick succession, and it ends up failing the aggregate. I can't imagine what would cause some, but not all the virtual disks to fail, since they are all part of the same disk that has been assigned to the simulator in VMWare.

At this point, the simulator is dead in the water, because the root volumes for all the vservers were on the failed aggregate. I tried unfailing the disks, but now they are "orphaned" and the hot spares that tried to replace the failed disks are marked as "reconstruct stalled". It won't even let me delete any of the volumes, or the failed aggregate.

Does anyone have any suggestions for how to recover from this short of blowing away the entire simulator and rebuilding from scratch? Also, any suggestions for why this may be happening and how to prevent it in the future would be nice as well (or, at least how to quickly recover from it).

Thanks!

SeanHatfield · ‎2019-10-02

On a new sim thats never booted before you can control it with bootargs in the loader. Once its up and running you can make adjustments from the systemshell and nodeshell.

This will get you a list of disks:

run local disk show -v

Then you can pull one out of the shelf (must be owned by the local node):

run local disk simpull v0.29

And then delete it:

systemshell local "sudo rm /sim/dev/,disks/,pulled/v0.29*"

Repeat as needed until you've removed all the unwanted disks.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

View solution in original post

SeanHatfield · ‎2019-10-02

The disks are sparse files on IDE1:1 (/dev/ad3), which is probably getting full. You can check it on a working sim with:

set d;systemshell local df -h /sim

The default disk population of the ESX version of the simulator is 4gbx56 disks, so if you use them all it will eventually run out of space. You can use fewer disks, or you can replace IDE1:1 with a bigger disk after deploying the ova, but before powering on the simulator.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

Jim_Robertson · ‎2019-10-02

Thank you, @SeanHatfield, that does indeed seem to be the issue:

Filesystem Size Used Avail Capacity Mounted on
/dev/ad3 223G 223G -18G 109% /sim

So, I'm assuming there is no way to save the simulator in its current state?

You mentioned using fewer disks. You are correct that it is setup with 4x56 disks. I have left 13 of them as spares... or are those still considered being used? How would I not use them? The couple of times I have rebuilt this, I have just gone into the boot menu of the simulator and told it to wipe the disks and do a fresh setup. Is there a way from there to adjust the # and size of the disks?

Thanks again!

SeanHatfield · ‎2019-10-02

On a new sim thats never booted before you can control it with bootargs in the loader. Once its up and running you can make adjustments from the systemshell and nodeshell.

This will get you a list of disks:

run local disk show -v

Then you can pull one out of the shelf (must be owned by the local node):

run local disk simpull v0.29

And then delete it:

systemshell local "sudo rm /sim/dev/,disks/,pulled/v0.29*"

Repeat as needed until you've removed all the unwanted disks.

If this post resolved your issue, help others by selecting ACCEPT AS SOLUTION or adding a KUDO.

Jim_Robertson · ‎2019-10-03

@SeanHatfield , it looks like deleting the disks got the space down:

Filesystem    Size    Used   Avail Capacity  Mounted on
/dev/ad3      223G    181G     24G    88%    /sim

But, the disks are still showing as orphaned and the aggregate is offline. Do you have any other sorcery to recover this aggregate, or is a rebuild of the sim my only option? Not a huge deal if it is, but I'd like to avoid it if possible.

Aggregate aggr1 (failed, raid_dp, partial) (block checksums)
  Plex /aggr1/plex0 (offline, failed, inactive)
    RAID group /aggr1/plex0/rg0 (partial, block checksums)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   v2.16   v2    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      parity    v3.16   v3    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v1.26   v1    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448 (reconstruct stalled)
      data      v0.21   v0    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v2.17   v2    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v3.17   v3    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v1.20   v1    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v0.22   v0    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v2.18   v2    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v3.18   v3    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v1.21   v1    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v0.24   v0    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v2.19   v2    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v3.19   v3    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v1.22   v1    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v0.25   v0    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v2.20   v2    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v3.20   v3    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v1.24   v1    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v0.28   v0    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448 (reconstruct stalled)
      data      v2.21   v2    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v3.21   v3    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v1.25   v1    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v0.27   v0    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      v2.22   v2    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      data      FAILED          N/A                        4020/ -
      Raid group is missing 1 disk.

  Unassimilated aggr1 disks

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      orphaned  v0.26   v0    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      orphaned  v1.19   v1    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448
      orphaned  v3.22   v3    -   -   FC:A   0  FCAL 15000 4020/8233984      4027/8248448

If nothing else, I think your suggestions have hopefully saved my other sim from the same fate, and will hopefully prevent this from happening again in the future.

Thanks again!

Ontapforrum · ‎2019-10-02

Another suggestion for quick recovery without starting from scratch is :

1) Take vm snashot once (atleast once the cluster is setup with all the basic configuration)
2) Take another snapshot when you feel you have added some changes or provisioned iscsi, nfs, cifs setup or mirror/vaulting or snapcenter stuff, so that even if it core-dumps or crashes, you need not repeat the base line setup configuration again.

Simulators are prone to core-dump for known & unknown reasons, one of the most known reason being root vol running out of space, and sometimes it's too late to even remediate this. In such scenarios, if you have a snapshots (atleast the base_line) to go back to , it makes life easier, b'cos then you can actually get to node level and delete all the logs and snapshots which is otherwise not available, once the simulator is crashed.

It's not ideal but then it's just a simulator (test/demo device), instead of troubleshooting you can actually get to back to your base-line state.

Jim_Robertson · ‎2019-10-03

@Ontapforrum wrote:

1) Take vm snashot once (atleast once the cluster is setup with all the basic configuration)

That is a great suggestion, @Ontapforrum, unfortunately this customer will not let us keep snapshots for longer than a week or so.