VMware Solutions Discussions

RDM LUNs not fully removed from VM

GRAEMEOGDEN
13,657 Views

I have an issue where RDMs disconnected from the Guest OS (in this case, Windows 2008R2 64bit) using SnapDrive 6.3.1 are sometimes not fully removed from the virtual machine. The disk is not visible in the OS or in SnapDrive so appears to have been removed, but going to Edit Settings on the VM will still show the RDM as connected, and the LUN still exists and appears mapped on the filer. I've also noticed when viewing the storage paths on the host a number of 'dead' paths, which I assume are old RDM connections.

This happens on both ESXi4 and ESXi5 hosts.

This can cause problems if a LUN is removed from the filer, as it appears to be no longer connected to the Guest OS, when the VM still thinks it is connected.

Also posted at VMware community.

http://communities.vmware.com/thread/340362

1 ACCEPTED SOLUTION

vloiseau
12,081 Views

You should give a try to use SDW

https://now.netapp.com/NOW/download/software/snapdrive_win/6.4/

see in the release notes :

https://now.netapp.com/NOW/knowledge/docs/snapdrive/relsnap64/pdfs/rnote.pdf

bug 515927 - Removing a LUN from an ESX host causes multipath software to report that all the paths to the LUN are down.

View solution in original post

17 REPLIES 17

vloiseau
13,595 Views

Hi,

Are you using vcenter or ESX directly with Snapdrive ?

Valéry.

GRAEMEOGDEN
13,595 Views

Hi,

We use vCentre and the authentication details within Snapdrive are configured correctly, as are the Transport Protocol Settings.

Thanks,

Graeme

vloiseau
13,596 Views

If you replace vcenter by ESX in Snapdrive configuration, can you confirm things go better to see if I hava had the same problem as you.

vloiseau
13,597 Views

or restart the vcenter service also do it.

GRAEMEOGDEN
13,596 Views

I will try authenticating with ESX host rather than vCentre.

Restarting the vCentre service isn't really a solution as this happens quite regularly

Thanks

vloiseau
13,597 Views

Hi,

OK, tell me if that resolves temporarly the problem, I will take a look into the internal bug database to see if there is an explanation and a correction.

De : c-xdl-communities

Envoyé : Thursday, December 22, 2011 09:24 AM

À : Loiseau, Valery

Objet : - Re: RDM LUNs not fully removed from VM

<https://communities.netapp.com/index.jspa>

Re: RDM LUNs not fully removed from VM

created by GRAEMEOGDEN<https://communities.netapp.com/people/GRAEMEOGDEN> in Virtualization - View the full discussion<https://communities.netapp.com/message/70418#70418>

I will try authenticating with ESX host rather than vCentre.

Restarting the vCentre service isn't really a solution as this happens quite regularly

Thanks

Reply to this message by replying to this email -or- go to the message on NetApp Community<https://communities.netapp.com/message/70418#70418>

Start a new discussion in Virtualization by email<mailto:discussions-community-products_and_solutions-virtualization@netappcommunity.hosted.jivesoftware.com> or at NetApp Community<https://communities.netapp.com/choose-container.jspa?contentType=1&containerType=14&container=2160>

PETERG865
13,595 Views

In VCenter, have you tried Rescanning the Datastores?

GRAEMEOGDEN
13,595 Views

Hi Peter,

Rescanning the datastores will remove the dead paths, however we have an automated process which maps/unmaps LUNS via snapdrive every hour to update some databases. Whenever a LUN is removed there's chance of a dead path remaining. Eventually these accumulate and seem to cause performance issues.

Of course I could manually rescan the datastores every week or so but a root cause would be nice!

Thanks,

Graeme

itstaff
13,594 Views

This exact thing is also happening to us.  The rescan drops the dead luns, but manually scanning isn't a good enough solution.  We have had hosts disconnect because of the Snapdrive mounting the LUNS to the ESX's local datastore too when using Snapdrive 6.3.1.  I've had opened many tickets with Netapp and Vmware but still no concrete solution.

At first it was the version of Snapdrive we were using, things clear up and then it happens again out of the blue.  I have now been able to recreate the Host disconnection issue, and it happens during a Snapmanager backup for SQL.  As soon as VMware scans the HBA's the Host disconnects. I haven't tried to connect Snapdrive to the host, because of HA, I wouldn't think you'd want to do that if the machine running Snapdrive migrated off that host, right?

GRAEMEOGDEN
12,025 Views

Would be interested to know how you were able to recreate the issue so I could open support cases with Netapp/VMware.

At the moment I'm thinking of scheduling a powershell script to rescan the HBAs every month or so to try work around this problem.

vloiseau
12,025 Views

After looking in our bug database this bug is know as APD condition or issue (all path down)...

The APD discussion appears in the VMWARE kb 1016626, 1015084 and those related to this problem knowned by VMware...

and in our bug database : 346071, 515927 ... and a lot of them which are public or not.

But open a case so as your customers appears impacted by this problem and be associated to this problem and confirm that it will be corrected.

GRAEMEOGDEN
12,025 Views

Interesting that http://kb.vmware.com/kb/1016626 suggests this was resolved in ESXi 4.1 Update 1 as we're seeing this on ESXi 5!

Not sure if it's related but I have noticed that although the LUN is disconnected, the vmdk pointer file sometimes isn't removed from the datastore. All mounts/dismounts are being performed by Snapdrive CLI which I thought removed RDMs fully....?

I'll try raising a support case and see if they can help, but it's very hard to simulate!

vloiseau
12,025 Views

Maybe still affect the ESX5 ? And the fact that VMDK pointer files is still there after is something that need to be investigating ? Maybe a interruption in the Snaprive / Vcenter execution of RDM lun remove ?

Documented in SDW 6.3.1R1 release notes under Known issues.

http://now.netapp.com/NOW/knowledge/docs/snapdrive/relsnap631r1/pdfs/rnote.pdf

title: Removing a LUN from an ESX host causes multipath

software to report that all the paths to the LUN are down

Issue: When you remove a LUN from an ESX 4.0 or 4.1 host, the multipath software

reports that all the paths to the LUN are down.

Corrective action: For hosts running ESX 4.0 Update 3 or ESX 4.1 Update 1, before removing the

LUN, perform the following steps:

1. Enable VMFS3.FailVolumeOpenIfAPD.

2. Remove the LUN.

3. Perform a complete rescan.

4. Disable VMFS3.FailVolumeOpenIfAPD

vloiseau
12,082 Views

You should give a try to use SDW

https://now.netapp.com/NOW/download/software/snapdrive_win/6.4/

see in the release notes :

https://now.netapp.com/NOW/knowledge/docs/snapdrive/relsnap64/pdfs/rnote.pdf

bug 515927 - Removing a LUN from an ESX host causes multipath software to report that all the paths to the LUN are down.

GRAEMEOGDEN
12,025 Views

Looks like this version of SnapDrive will solve the issue, thanks! I'll get this tested and rolled out soon.

GRAEMEOGDEN
9,675 Views

Wanted to update this post as a warning to others as I've just had confirmation from NetApp Support on what I've suspected for a month or so...

SnapDrive 6.4 does include a workaround for this issue. However this is broken again when ESXi hosts are upgraded to ESXi 5.0 Update 1. Do NOT upgrade to Update 1 as we found it made our hosts completely unmanageable. VMs kept running but the vpxa service would crash so vMotion or any changes to the VMs (adding LUNs etc) while the host is in this state was impossible. After a while the host would disconnect from vCentre and would require a restart. Due to the service crash, it was also impossible to obtain any ESX log files which made troubleshooting particularly difficult.

We have reverted back to ESXi 5.0 and the hosts are now working again, and have a case open with VMware to investigate.

Cheers,

Graeme

bsti
9,675 Views

I'm having an issue very very similar to this, but I'm not using SnapDrive to connect/disconnect the LUNs.  Does anybody know what exactly is causing the issue? 

Public