Two nights ago one of our production SQL servers crashed and was rebooted. The only errors in the Windows log files around the time that it crashed are for OntapDSM and MPIO. The errors relate to a multipathing and missing disk/lun (DSM ID: 0300041e) but i am having trouble identifying just which LUN this relates to?
I cannot resolve the DSM ID in the error logs to any drive on the server or LUN on our filers.
DSM ID: 0300041e- seems an odd ID # as the other DSM ID #s in the output of the "dsmcli path list" command shows the them to end in a number and not a letter.
Example: C:\Users\srvspeback>dsmcli path list
Path Info for W-XMJJ/ZolmA: Number of Paths: 4 DSM ID NexusID Initiator Address Target Portal ====== ======= ================= ============= 03000500 03000502 21:00:00:24:ff:03:b8:39 50:0a:09:84:99:cb:9e:85 Slot:v.0a 03000400 03000401 21:00:00:24:ff:03:b8:39 50:0a:09:84:89:cb:9e:85 Slot:0a 02000500 02000502 21:00:00:24:ff:03:b7:ab 50:0a:09:83:99:cb:9e:85 Slot:v.0c 02000400 02000401 21:00:00:24:ff:03:b7:ab 50:0a:09:83:89:cb:9e:85 Slot:0c
The errors in the Windows Event logs are as follows:
The computer has rebooted from a bugcheck. The bugcheck was: 0x000000d1 (0x0000000000000000, 0x0000000000000002, 0x0000000000000008, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: 020111-28922-01. All paths have failed. \Device\MPIODisk118 will be removed. ONTAP reported that the LUN on DSM ID 0300041e is not supported. The data section of this log entry contains additional information. DSM ID 0300041e has initiated a fail-over. All paths have failed. \Device\MPIODisk118 will be removed.
I have logged a case with Netapp (2001979460) regarding this issue. They have acknowledged that there is an issue with OntapDSM version 3.3.x and have advised us to upgrade to OntapDSM version 3.4 as the issue is resolved in this release.
The OntapDSM 3.4 release notes specify that Windows Hotfixes be installed prior to upgrade (particularly MS KB 979743). This hotfix is required because of an issue with MS MPIO. See Release notes attached.
We are running Windows 2008 R2 servers (SQL 2010 cluster) in both test and prod environment. Each environment has LUNs on the same filers. Snapdrive 6.3 and Snapmanager for SQL.
We have installed the Hotfix and OntapDSM 3.4 onto the servers in our test environment and are monitoring to see if the above errors re-occur. I will update this discussiion with outcome.
This has been a tricky issue so i hope this information helps others out.
After installing the OntapDSM 3.4 app and the Hotfix and rebooting the servers we are still seeing the errors in the Windows Event logs. I have escalated this to Netapp Support and they have asked for OntapWinDC diag tool to be run on the SQL Cluster servers. I will then upload them to ftp site for analysis.
I have been trying to organise the various support teams to get this sorted out but through leave or sickness i have been thwarted so far.
The Windows Server team are not yet able (or willing) to install SP1 for Windows Server 2008 R2. I believe they are testing this in our test environment.
I tried a few things with the Backup team inclucing excluding the backup of C: and 😧 drives. These drives are the Windows System drive (C:) and the other is the drive to install various apps onto (if needed). The errors still persisted.
I then asked the Windows Server team to unisntall the Netbackup client from the hosts (with the permission of the Backup team). This resolved the errors (i.e. we got rid of the errors all together). I have now had the Netbackup client re-installed on the host and the errors have come back.
What i am trying to do now is get consensus on whether the teams require a back of the System State, C: and D:. If they do not need these backups then I will ask the teams to completely unistall the Netbackup client.
However, this seems a strange way to fix this issue as there must be other organisations out there that run Netbackup client on servers where snapshotting of SQL LUNs is also done. If so can anyone tell me how they handled this issue (if they saw it) in their own environments?
This issues surfaces when Netbackups are occuring (around 7pm each night). The problems is when Netbackup creates and mounts its snapshots for backup purposes. OntapDSM sees the new LUNs and queries the Storage System about them but when the Storage system returns no details about them, OntapDSM throws the error about LUN that is not supported (Event ID: 61085).
"ONTAP reported that the LUN on DSM ID 0300041e is not supported. The data section of this log entry contains additional information"
Thanks Wayne !!! This information you posted is very useful.
I remember having and issue with the VSS writers showing 2 of them waiting for completion whenever I executed vssadmin list writers. When I engaged VERITAS, they stated this may be issue caused by the OS…. The job always might miss the five following files.
This is what is being reported from VERITAS's NetBackup application:
10/28/2009 1:51:39 PM - Warning bpbrm(pid=3664) from client serversp02: WRN - can't open object: Shadow Copy Components:\System State\System Files\System Files (BEDS 0xE0009421: No component files present on the snapshot.)
10/28/2009 1:51:41 PM - Warning bpbrm(pid=3664) from client serversp02: WRN - can't open object: Shadow Copy Components:\System State\Automated System Recovery\BCD\BCD (BEDS 0xE0009421: No component files present on the snapshot.)
10/28/2009 1:51:42 PM - Warning bpbrm(pid=3664) from client serversp02: WRN - can't open object: Shadow Copy Components:\System State\COM+ Class Registration Database\COM+ REGDB (BEDS 0xE0009421: No component files present on the snapshot.)
10/28/2009 1:51:43 PM - Warning bpbrm(pid=3664) from client serversp02: WRN - can't open object: Shadow Copy Components:\System State\Registry\Registry (BEDS 0xE0009421: No component files present on the snapshot.)
10/28/2009 1:51:45 PM - Warning bpbrm(pid=3664) from client serversp02: WRN - can't open object: Shadow Copy Components:\System Service\Windows Management Instrumentation\WMI (BEDS 0xE0009421: No component files present on the snapshot.)
After the deleting of the Shadows and Shadow Storage all the above errors were gone with exception of the following:
serversp02: WRN - can't open object: Shadow Copy Components:\System State\System Files\System Files (BEDS 0xE000FEDD: A failure occurred accessing the object list.) 11/5/2009 4:44:00 PM - end writing; write time: 00:30:43
This issue was was really being caused by the earlier version of VERITAS NetBackup being used and then they stated that it needed to be updated to the most current version.
When VERITAS first started supporting Windows 2008 at 6.5.2 a number of issues were reported, but were not fixed until 6.5.4. NetBackup cannot successfully perform a DR of Windows 2008 box until 6.5.4. Any previous version will result in error when performing a DR on the box
What could be happening in your case with whatever version of Netbackup being used, is that after a backup of the System State is done it's not resetting the flags it puts on the LUN to lock it then OnTapDSM generates the Event ID: 61085.
Message was edited by: miketexas
Sorry, forgot to include the supporting documentation from VERITAS. Even though you may have the latest Netbackup Client, you may want to contact VERITAS Technical Support to see if they have a NEW fix that addresses your issue with OnTapDSM.
Fixed in 6.5.4 (ET1506354) A potential for data loss has been discovered in NetBackup Enterprise Server when backing up hard links on a Windows 2008 Server or Vista client, if these hard links are also Shadow Copy Components. This does not affect user data at this time.
Was there ever a resolution to this problem? I have a similar setup windows server 2008 r2 and sql cluster. Every so often our main sql cluster loses access to the netapp luns and they go offline obviously wreaking havoc to our system and causing sql to go offline. I end up having to reinitiate the host to the luns but this seems to happen during heavy work loads times.
I have snadrive 6.3 and smsql 5.1 I'm not quite sure what dsm version I'm running.
This issue was caused (for us anyway) by "Ghost" LUNs (.rws luns) remaining on the server when they should have been removed. As stipulated above this was caused by Netapp snapshot process and Netbackup process overlapping and when Netbackup saw the Netapp .rws luns it knew nothing about them and threw the errors (mentioned ablove). It then froze the .rws luns in place and Data Ontap/Snapdrive could not remove them.
The process we followed to fix this was to upgrade the ONtapDSM and Netback software and exclude the Netapp LUNs from Netbackup backups schedules.