Solved: ActiveIQ Unified Mgr Cluster Lacks Spare Disks issue

atwilson · ‎2020-11-11

Hi
hoping someone can shed some light on this issue, so far working with support for several days has not.
Nov.7 we lost an SSD, an alert was generated in Unified Mgr warning that both node7 & node8 had low spares.
The disk was replaced Nov.9. I was expecting the Unified Mgr warnings to clear but it changed to warning that node8 has low spares.
It stopped warning for node7 only.

before disk 11.17 failed on Nov.7 the spares looked like this:
node7:
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
Spare disks for block checksum
spare 0a.00.23P3 0a 0 23 SA:A 0 SSD N/A 55176/113000448 55184/113016832
spare 0d.00.22P3 0d 0 22 SA:B 0 SSD N/A 55176/113000448 55184/113016832
spare 0c.11.23 0c 11 23 SA:A 0 SSD N/A 3662580/7500964352 3662830/7501476528

node8:
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
Spare disks for block checksum
spare 0a.00.11P3 0a 0 11 SA:B 0 SSD N/A 55176/113000448 55184/113016832
spare 0d.00.10P3 0d 0 10 SA:A 0 SSD N/A 55176/113000448 55184/113016832
spare 0c.11.11 0c 11 11 SA:B 0 SSD N/A 3662580/7500964352 3662830/7501476528

Unified mgr seemed to be ok with the above as there were no warnings.

After disk 11.17 failed and was replaced the spares look like this now:
node7:

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
Spare disks for block checksum
spare 0a.00.23P3 0a 0 23 SA:A 0 SSD N/A 55176/113000448 55184/113016832
spare 0d.00.22P3 0d 0 22 SA:B 0 SSD N/A 55176/113000448 55184/113016832
spare 0c.11.11P2 0c 11 11 SA:A 0 SSD N/A 1803679/3693936128 1803687/3693952512
spare 0c.11.17 0c 11 17 SA:A 0 SSD N/A 3662580/7500964352 3662830/7501476528

node8:
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
Spare disks for block checksum
spare 0a.00.11P3 0a 0 11 SA:B 0 SSD N/A 55176/113000448 55184/113016832
spare 0c.11.11P3 0c 11 11 SA:B 0 SSD N/A 55176/113000448 55184/113016832
spare 0d.00.10P3 0d 0 10 SA:A 0 SSD N/A 55176/113000448 55184/113016832
spare 0c.11.11P1 0c 11 11 SA:B 0 SSD N/A 1803679/3693936128 1803687/3693952512

Unified Mgr is warning that node8 has low spares.
Is this expected behaviour?
Is there something I have to do to address the warning for node8?

thanks

atwilson · ‎2020-11-12

Finally resolved after several hours with support.

The first support engineer didn’t seem very knowledgeable and we fumbled about quite a bit trying various things.

Eventually we managed to switch the problem from node7 over to node8. Haha/sigh....

He brought in Jim W, a senior engineer who was able to sort it out.

Jim’s analysis is that ONTAP is functioning correctly, but it looks like UM has a quirk because it wants to see an entire spare disk assigned to each node, not shared, which we originally had before the disk failure.

Jim’s summary:

As discussed, in this HA pair, nodes CLUS01-N7 / CLUS01-N8; there was (1) whole spare owned by node -08 and (1) Partitioned spare (with the container owned by node -07) distributed across the HA pair. This was causing OnCommand UM to report a SPARES LOW event.

As the nodes have a mix of whole disk RAID groups and partitioned RAID groups; ONTAP requires (2) whole spares across the HA pair, (1) that is maintained as whole and (1) that is partitioned to be available as required in the event if a whole or container disk failure.

The OnCommand UM Spares low message that was seen when the system was cleared as described by unpartitioning the whole disk 0c.11.17 (owned by node -07) with the steps as shown.

From the cluster shell:

cluster_CLI::> storage disk option modify -node CLUS01-N7 -autoassign off

cluster_CLI::> storage disk option modify -node CLUS01-N8 -autoassign off

Then from the Node shells of the nodes

CLUS01-N8> priv set diag

CLUS01-N8*> disk assign 0c.11.17P1 -s unowned -f

CLUS01-N7> priv set diag

CLUS01-N7*> disk assign all

CLUS01-N7*> disk unpartition 0c.11.17

which returned the 0c.11.17 disk back as a whole spare on node -07.

Note that in the event of a container disk failure, the whole disk on either node can be auto-partitioned by ONTAP to be used as required.

Additionally, the auto-partitioning of the replaced disk 0c.11.17 is a normal ONTAP RAID_LM (RAID Layout Manager) subsystem operation and is not required to be altered as we did in order to alleviate the OCUM reporting.

Should the issue persist with OCUM, I would recommend opening a new case specifically for the version of OCUM that you are using and let that respective team investigate it accordingly.

fyi - we are currently using UM v.9.7P1

View solution in original post

AlainTansi · ‎2020-11-11

Hi,

This is an expected behavior depending on the version of ontap your are currently on. https://mysupport.netapp.com/site/bugs-online/product/ONTAP/BURT/908898

This is due to the way UM identifies spare drives. Since you are having root partitioning, the spare partitions are not identified by UM as spare.

Before, you did not have any alert because on node 7 and 8 you had 0c.11.23 and 0c.11.11 which were not partitioned.

And now UM doesnt see any spare on node 8 because all the spares are part of P1 and P3

So if you are not on a fixed version of ONTAP, you should upgrade.
https://kb.netapp.com/Advice_and_Troubleshooting/Data_Infrastructure_Management/OnCommand_Suite/OnCommand_Unified_Manager_for_cluster_mode_reports_err...

atwilson · ‎2020-11-11

thanks for the info

we are already on a fixed version 9.5P13

AlainTansi · ‎2020-11-12

Thanks for your reply,

Was this information useful ?

If you still need assistance with the technical case as you mentioned is taking long, you can email me the details of the case at alainb@netapp.com and i can further assist you internally if you don't mind.

Otherwise let us have any feedback that you may have.

Regards

atwilson · ‎2020-11-12

I'll be working with support to remove ownership from disk 11.11 and reassign it to node8 unpartitioned.

It seems there is a bit of a discrepancy between what ONTAP alerts on and what UM alerts on .

ONTAP did not generate a low spares alert when the disk was replaced. I guess ONTAP sees there are spares available between the 2 nodes whether they are partitioned or not.

While UM wants to see a full unpartitioned disk for each node.

Is this a common issue with replacing partitioned failed disks and auto assignment? I was under a delirious impression that ONTAP would auto assign disks and partitions appropriately. Should I expect to have to periodically manually adjust the assignments after a disk fails? Or is it hit and miss, sometimes it works out ok and sometimes you have to intervene?

I'm pretty sure we have replaced disks on other partitioned systems without this issue. Perhaps we just got lucky and ONTAP assigned them in a way that satisfied UM.

AlainTansi · ‎2020-11-12

That is correct,

UM sees spares differently from how ONTAP sees and this as i mentioned earlier was already identified and fixed under this bug https://mysupport.netapp.com/site/bugs-online/product/ONTAP/BURT/908898

Not particular sure why this is coming up if you are on a fix version.

Since we now have an understand of the issue and support is involved, it is best to continue the investigation at support level as it may take some deeper investigation into the system to see why this is happening.

The workaround you have should resolve the issue. But you can only get further details more better on the case with support team.

Thanks

atwilson · ‎2020-11-12