Intermittent Loss of Disk on Client Servers

kevinmelton · ‎2011-02-02

I have been contracted by a client to resolve an issue they are currently experiencing with respect to their NetApp and their VMWare Infrastructure.

In the clients current configuration, they use VMWare running on IBM ESX Hardware. The VMWare hardware communicates in a specific VLAN to the NetAPP. The client uses iSCSI to connect from their VMWare Infrastructure to the NetApp to talk to their VMDK's. They also I beleive are using NFS from the VMWare Infrastructure to the NetApp for storage.

The problem that the client is experiencing is that sporadically they have a VM display a Windows error message indicating "No Disk" (picture of error attached to case).

This does not happen every day, but is happening from time to time. The client has this as their #1 IT priority currently to have this resolved.

I have used a sniffer to sniff the VLAN where the transactions occur between the VMWare Infrastructure and the NetApp. I see what I would call very clean IP communications between the VMWare and the NetApp (less than 1% retransmissions or any other TCP errors.).

I am hoping that perhaps some of you may have seen this behavior before, and may be able to provide suggestions toward resolution.

Thanks for any input.

Kevin

BrendonHiggins · ‎2011-02-02

Hi, welcome to the community.

I would confirm the ESX, Snapdrive, DoT, SMVI, HBA version on the matrix and make sure all the version play nice together. https://now.netapp.com/matrix/
Get the event time from the VMs (Windows events log) and then check the ESX and then filer logs for events at the same time
If they have operations manager (DFM) I would review what else was happening on the filers are the same time - any strange 'activity' or spikes?
If they have VSC v2, have all the recommended NetApp and VMware settings been configured? - Great tool
Is it just VMs on one ESX box, datastore, DRS group or Farm?
Are the events always at the same time? - Cleaner unplugging LAN switch to hover, etc.

The above should allow you to rule loads out and look very busy for the client.

Good luck

Bren

kevinmelton · ‎2011-02-03

Thanks for the response Bren.

I am going to have the NetApp/VMWare Admin look at your step 1 suggestions and confirm. Step 2 Windows Logs was something I was going to ask if they were checking when the problem occurs. Ideally I would like to have a sniffer running capturing the data 24/7 until the problem shows up again, so that we can coordinate network occurences, if any, between timestamps in the event logs on the affected Windows box. Steps 3 and 4 I will need for them to confirm as well. I would say that they have a Farm on the ESX hosts as it is multiple servers. The datastore is located on the NetApp.

The events are not at the same time. What did you mean when you said "cleaner unplugging LAN switch to hover, etc" ? Are you saying to unplug the LAN connection to the switch when the incident occurs? There are some times when it happens in the early hours of the morning...

Thanks Bren

Kevin

BrendonHiggins · ‎2011-02-07

In the UK it is an urban myth, that the cleaner comes into the computer room during the night and removes the power plug, so that they can use the vacuum cleaner...