Hi hoping someone can shed some light on this issue, so far working with support for several days has not. Nov.7 we lost an SSD, an alert was generated in Unified Mgr warning that both node7 & node8 had low spares. The disk was replaced Nov.9. I was expecting the Unified Mgr warnings to clear but it changed to warning that node8 has low spares. It stopped warning for node7 only.
before disk 11.17 failed on Nov.7 the spares looked like this: node7: RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block checksum spare 0a.00.23P3 0a 0 23 SA:A 0 SSD N/A 55176/113000448 55184/113016832 spare 0d.00.22P3 0d 0 22 SA:B 0 SSD N/A 55176/113000448 55184/113016832 spare 0c.11.23 0c 11 23 SA:A 0 SSD N/A 3662580/7500964352 3662830/7501476528
node8: RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block checksum spare 0a.00.11P3 0a 0 11 SA:B 0 SSD N/A 55176/113000448 55184/113016832 spare 0d.00.10P3 0d 0 10 SA:A 0 SSD N/A 55176/113000448 55184/113016832 spare 0c.11.11 0c 11 11 SA:B 0 SSD N/A 3662580/7500964352 3662830/7501476528
Unified mgr seemed to be ok with the above as there were no warnings.
After disk 11.17 failed and was replaced the spares look like this now: node7:
If you still need assistance with the technical case as you mentioned is taking long, you can email me the details of the case at email@example.com and i can further assist you internally if you don't mind.
Otherwise let us have any feedback that you may have.
I'll be working with support to remove ownership from disk 11.11 and reassign it to node8 unpartitioned.
It seems there is a bit of a discrepancy between what ONTAP alerts on and what UM alerts on .
ONTAP did not generate a low spares alert when the disk was replaced. I guess ONTAP sees there are spares available between the 2 nodes whether they are partitioned or not.
While UM wants to see a full unpartitioned disk for each node.
Is this a common issue with replacing partitioned failed disks and auto assignment? I was under a delirious impression that ONTAP would auto assign disks and partitions appropriately. Should I expect to have to periodically manually adjust the assignments after a disk fails? Or is it hit and miss, sometimes it works out ok and sometimes you have to intervene?
I'm pretty sure we have replaced disks on other partitioned systems without this issue. Perhaps we just got lucky and ONTAP assigned them in a way that satisfied UM.
Not particular sure why this is coming up if you are on a fix version.
Since we now have an understand of the issue and support is involved, it is best to continue the investigation at support level as it may take some deeper investigation into the system to see why this is happening.
The workaround you have should resolve the issue. But you can only get further details more better on the case with support team.
Finally resolved after several hours with support.
The first support engineer didn’t seem very knowledgeable and we fumbled about quite a bit trying various things.
Eventually we managed to switch the problem from node7 over to node8. Haha/sigh....
He brought in Jim W, a senior engineer who was able to sort it out.
Jim’s analysis is that ONTAP is functioning correctly, but it looks like UM has a quirk because it wants to see an entire spare disk assigned to each node, not shared, which we originally had before the disk failure.
As discussed, in this HA pair, nodes CLUS01-N7 / CLUS01-N8; there was (1) whole spare owned by node -08 and (1) Partitioned spare (with the container owned by node -07) distributed across the HA pair. This was causing OnCommand UM to report a SPARES LOW event.
As the nodes have a mix of whole disk RAID groups and partitioned RAID groups; ONTAP requires (2) whole spares across the HA pair, (1) that is maintained as whole and (1) that is partitioned to be available as required in the event if a whole or container disk failure.
The OnCommand UM Spares low message that was seen when the system was cleared as described by unpartitioning the whole disk 0c.11.17 (owned by node -07) with the steps as shown.
From the cluster shell:
cluster_CLI::> storage disk option modify -node CLUS01-N7 -autoassign off
cluster_CLI::> storage disk option modify -node CLUS01-N8 -autoassign off
Then from the Node shells of the nodes
CLUS01-N8> priv set diag
CLUS01-N8*> disk assign 0c.11.17P1 -s unowned -f
CLUS01-N7> priv set diag
CLUS01-N7*> disk assign all
CLUS01-N7*> disk unpartition 0c.11.17
which returned the 0c.11.17 disk back as a whole spare on node -07.
Note that in the event of a container disk failure, the whole disk on either node can be auto-partitioned by ONTAP to be used as required.
Additionally, the auto-partitioning of the replaced disk 0c.11.17 is a normal ONTAP RAID_LM (RAID Layout Manager) subsystem operation and is not required to be altered as we did in order to alleviate the OCUM reporting.
Should the issue persist with OCUM, I would recommend opening a new case specifically for the version of OCUM that you are using and let that respective team investigate it accordingly.