2018-03-22 01:39 PM
Since upgrading to ONTAP 9.3, OCUM reports "Cluster not reachable" off and on. OCUM throws out a "cluster not reachable" alert (new) which immediately goes away (obsolete) every 10-15 minutes. Pinging the cluster shows that the mgmt interface is not even skipping a beat. OCUM says pairing status is bad. This has occured on two FAS systems now as soon as we upgrade from 9.1 to 9.3P2
OCUM 6.4P1 and OCUM 7.3 see the same behavior. (we are in OCUM version transition)
Anyone else run into this?
2018-03-23 08:24 AM
We saw this a lot for an OnCommand server that was monitoring a cluster that physically was far away. We opened a support case and support can modify a timeout value to allow for longer response times - once this was set the issue went away. You could try that, and at the very least they may be able to identify any other issues if there are any.
2018-03-26 01:47 PM
Thanks for the reply. We implemented a couple changes that may have fixed this but in general here are some things we tried:
1. Remove and re-add the cluster into OCPM (warning, you will lose historical events)
2. Rediscover the cluster in OCUM
3. Upgrade to OCUM v7 (latest minor release)
4. Renew SSL certs for the cluster
And one other thing i stumbled on that may be an issue for you too if you use underscores in your DNS names for your hosts (cluster mgmt lif's)
This became apparant when we upgraded to ONTAP 9.3P2 and i could no longer reach the system mgr interface via name but could reach it via IP. (This was because the DNS entry for the cluster mgmt lif included an underscore which is not permitted by 9.3 standards i guess)
2018-03-29 06:40 PM
I had this issue on multiple clusters. The issue turns out to be the routing tables. Apparently in 9.x you MUST have proper static routes. If you don’t you will experience this error. The way it was explained to me, in prior versions a packet coming in through a lif/rote, went out the same way. Now it only goes out vi a static route, unless there is a dynamic route setup.
a week ago
Do you know the specific timeout value that was modified? We are experiencing similar issues with systems and are looking to modify timeout value, but not sure which options to use or which value range to use.
Sorry about the delay; things have been crazy here. So for the Cluster not reachable, you need to make sure you have routes. See my previous description.