Re: MetroCluster has been in failover for weeks, seeing problems

findlay · ‎2015-04-24

We have a metrocluster providing storage for our vmware environment over NFS. We lost one half of it 6 weeks ago and have not yet been able to replace the failed unit.

Everything seemed fine until this week, when our vmware hosts started reporting "all paths down" on volumes hosted by the metrocluster. This is happening for 30 secs or so once or twice a day. Volumes hosted on other storage and mounted on the same hosts are fine.

There's nothing in the network switch logs to indicate a network problem.

Does anyone know if there are issues with a MC being in takeover mode for long periods? one thing we've noticed is that ASIS seems to have stopped working on the volumes which have been taken over. Is this expected?

we're running a v7 metrocluster

thanks

aborzenkov · ‎2015-04-24

I doubt it directly relates to being MetroCluster; and failover can be relevant to the extent that your controller experiences increased load for a long time. But APD on NFS with VMware somehow rings the bell ... I would start with following How to troubleshoot NFS APD (All-Paths-Down) issues on VMware ESXi

findlay · ‎2015-04-24

Thanks for the helpful suggestion. Unfortunately we've already tried that and it didn't help.

mglanville2 · ‎2015-04-25

During those 30 second periods, is there an increase in 'pause frames' on the switch port statistics/NIC's

Is flow control 'off' on all the ports?

Also may want to check port statistics on network ports for links between switches if there are any..

A sudden < 60 burst in network traffic or IOPS on guests may not be spotted with normal performance monitoring tools.

Could be a 'shark' somehwere swimming in this environment biting you once in a while, may need to collect some stats ever second or so and pinpoint where the shark is hiding.

Matt