2-node Exchange DAG (built on vSphere):
SME - 6.0.4.1008
SDW - 7.0.1
Windows OS - 2008 R2 EE SP1
ONTAP - 8.1.2P4
FCP
Stale LUN left attached to DAG node (verification node) after failed verification...
Storage System Name: filer01
LUN Path: /vol/sdw_cl_vol_maildata_DAGNODE1_System_0/lun_maildata_DAGNODE1_System
LUN Size: 60 GB
Volume Name: \\?\Volume{22b06c02-dc83-11e4-ba5c-0050569c4f7e}\
Backed by Snapshot path: /vol/vol_maildata_DAGNODE1_System/.snapshot/exchsnap_maildata_DAGNODE1_04-09-2015_07.22.09/lun_maildata_DAGNODE1_System
SnapMirror: No
Disk Type: Disk in Snapshot Read Write
Disk Serial Number: B7sUt+C0gEVP
PartitionType: GPT
Restore Status: Normal
RDM File Name : [NetApp SATA - vmware Servers Config Files] DAGNODE1 (Dag Server)/DAGNODE1 (Dag Server)_SD_filer01_B7sUt+C0gEVP_0.vmdk
Data Store Name : NetApp SATA - vmware Servers Config Files
I know how to clean this up but am looking for an RCA, as customer has seen this occur a number of times now and would like to get to the bottom of why it is happening.
Looking in the SME log for the failed verification job we see the failure occurs sometime between 10:35-10:37:
SME Log...
[10:25:56.250] [DAGNODE1] Running Integrity Verification using CheckSgFiles API
[10:25:56.609] [DAGNODE1] File: C:\Program Files\NetApp\SnapManager for Exchange\SnapMgrMountPoint\MPDisk114\Exchange Mailstores\System Management\System Management.edb
...
[10:35:22.388] [DAGNODE1] Operation completed successfully in 566.174 seconds.
[10:35:22.388] [DAGNODE1] Dismounting LUN [C:\Program Files\NetApp\SnapManager for Exchange\SnapMgrMountPoint\MPDisk114] of Snapshot [exchsnap__mail-DAGNODE1_04-09-2015_07.22.09]...
[10:37:01.317] [DAGNODE1] [SnapDrive Error]:Failed to delete disk in virtual machine, Failed to delete virtual disk: Unable to connect to the remote server.
(SnapDrive Error Code: 0xc0040414)
[10:37:14.530] [DAGNODE1] [SnapDrive Error]:The LUN may not be connected, because its mount point cannot be found.
(SnapDrive Error Code: 0xc0041085)
[10:37:14.530] [DAGNODE1] SnapDrive failed to dismount the snapshot.
[10:37:15.013] [DAGNODE1] Unknown Error, Error Code: 0xc0041085
[10:37:15.013] [DAGNODE1] Re-trying to force dismounting LUN...
[10:37:15.294] [DAGNODE1] [SnapDrive Error]:The LUN with serial number 'B7sUt+C0gEVP' is not found on the system
(SnapDrive Error Code: 0xc0040374)
[10:37:15.294] [DAGNODE1] SnapManager will pause 70 seconds after force dismount, please wait...
[10:38:25.303] [DAGNODE1] SnapDrive failed to dismount the snapshot.
[10:38:25.303] [DAGNODE1] [SnapDrive Error]:The LUN with serial number 'B7sUt+C0gEVP' is not found on the system
Going to SDW debugs to try and track for more detail and it looks like the mount point removal occurs (although there is still an empty mount point dir left in that location?) but RDM removal fails:
SDW Debug log...
04/09-10:22:46.643 | PID:4744 | TID:2344 | SnapShot.cpp@4151 | | MountSnapShotInternal: mch='MAIL-DAGNODE1', livedsk='G', snap='exchsnap__mail-DAGNODE1_04-09-2015_07.22.09', bwritble=1, bAuto=0, mntPtPath='C:\Program Files\NetApp\SnapManager for Exchange\SnapMgrMountPoint\MPDisk114', smdest=0, dfile='none', dvol='none', mcalback=-1, shared='0', clustergroup='none', SnapVaultSecondaryConnect='0', IgnoreCloneLicense='0' |
..
04/09-10:25:55.844 | PID:4744 | TID:2344 | DrvLetEdit.cpp@708 | | Start addVolumeMountPoint(), volumeName:'\\?\Volume{22b06c02-dc83-11e4-ba5c-0050569c4f7e}\', mount point:'C:\Program Files\NetApp\SnapManager for Exchange\SnapMgrMountPoint\MPDisk114\' |
..
04/09-10:35:29.251 | PID:4744 | TID:9132 | DrvLetEdit.cpp@673 | | Start removeVolumeMountPoint(), volumeName:'\\?\Volume{22b06c02-dc83-11e4-ba5c-0050569c4f7e}\', mount point:'C:\Program Files\NetApp\SnapManager for Exchange\SnapMgrMountPoint\MPDisk114\' |
..
04/09-10:35:29.251 | PID:4744 | TID:9132 | DrvLetEdit.cpp@677 | | Finish removeVolumeMountPoint() |
..
04/09-10:35:30.297 | PID:4744 | TID:9132 | FCPVdisk.cpp@4325 | | CFCPVdisk::DeleteRawDeviceMapping() Lun Serial No=B7sUt+C0gEVP |
04/09-10:35:58.063 | PID:4744 | TID:5156 | System.cpp@4738 | | MonitorCacheThrdProc(): remaining time=9 minute(s) |
04/09-10:37:01.302 | PID:4744 | TID:9132 | FCPVdisk.cpp@4345 | | Failed to delete the virtual Disk through raw device mapping, , error code: '0x80004005', error description: 'Failed to delete virtual disk: Unable to connect to the remote server' |
Same error again, no more detail.
Nothing further in the Windows Event logs surround this event to indicate connectivity issues, or other. Same with Filer logs.
So the questions I'm currently looking for answers on are:
1. "Unable to connect to remote server" - is the remote server vCenter?
2. Any bugs / known issues that could relate to this - I've trawled support resources and can't find anything to pin this on