2014-08-14 08:35 AM
After confirming with our storage team, it looks like OCI is giving a false path outage. I'd like to clear the violation, but I'm hitting a road block somewhere.
In the Violation, OCI is picking up an incomplete volume name, which does not exist, and reporting a path outage. OCI does see the correct volume name, so I'm not sure where this incomplete volume, that apparently doesn't exist, is coming from. Does this sound familiar to anyone? Or is there a way around it? I've tried changing the path/host policies to see if I can override the violation somehow, but it won't budge. Any thoughts on where this violation is coming from and how to clear it?
2014-08-14 09:05 AM
The violation should go away when you dismiss it, and generally should stay gone. You would see a Path Outage if a volume by that name used to exist and has since been removed.
If it comes back after being dismissed, it might be worth considering whether there is some process automatically creating and destroying the volume in the background. Sites that mount SYM snapshot volumes to TSM servers for backups will see path outages on the temporary snapshot volumes each time they are deleted, for instance.
Take a look at the list of volumes for the storage array in question, and make sure that they correspond to actual volumes on the storage array. If there's a pattern of volumes coming across with wrong names, you would have a data source issue. This is unlikely.
2014-08-14 09:22 AM
I've tried clearing it, but it keeps coming back. This is originating from a UCS cluster, and we just recently were able to add the associated NPV chassis switches to help with our view. We still have a couple of NPV switches that we are trying to identify at the moment, and perhaps once we get those ID'd in OCI, it may clear it. This is the only blade reporting a violation in the cluster (10 total), so it's curious. Specifically, it's a volume on a NetApp 7-Mode array.
Looking at the Volumes view, the volumes that OCI reports are the actual volumes on the storage arrays, and the names are correct; we don't see the "non-existent" volume in the Volumes view. It's only with this violation where we see this non-existent volume name. There hasn't been a pattern of this on any other server/cluster, but this is also a relatively new cluster.
I'll need to ask the storage team about temporary snapshot volumes, if they happen to have similar volume names.
2014-08-14 09:58 AM
It is considered a "Path Outage", port connectivity was changed. What is throwing me off is that the violation view is reporting a volume that doesn't seem to exist. Or more specifically, the name of the volume is off by one letter from the correct volume name. It's missing one letter in the name. But this incorrect volume name is only seen in the Violation view (right click>Analyze Violation).
I'm trying to find out from our storage team if there are temporary volumes that might get a similar but slightly modified name...this is the first time I've seen a violation with a "volume does not exist". Then again, we haven't identified all of the fiber interconnects for the NPV switches, and that might be impacting our acquisition, too.
2014-08-14 10:05 AM
#1. When you analyze the violation, the first tab in the resulting window has a filtered Changes view - if someone renamed the volume, OCI is likely to perceive it as a volume deletion / volume creation at the same time period, as OCI has no way of knowing that "lunX" was "lunV" previously - OCI only sees that lunV is gone, and lunX now exists - in this scenario, OCI throws a path outage violation for lunX
#2. OCI might keep generating the violation because someone configured a path policy for this particular path - so long as the individual path policy exists, OCI has the expectation that the volume should exist.
Under Assurance -> SAN Path Policies - see if there are any path policies for the host + volume in question. You can right click -> Remove Policy in this view
2014-08-14 12:03 PM
It took three attempts, but the "remove policy" finally worked. I'm not sure why it took so many attempts, but for the moment it has cleared. It was one specific LUN that this blade was using, that isn't being accessed by the others, hence no violation on the other 9 blades.
Thank you for walking me through this!