Data Backup and Recovery

DSM error prior to SQL cluster reboot

waynehapu
11,707 Views

Hi All,

Two nights ago one of our production SQL servers crashed and was rebooted. The only errors in the Windows log files around the time that it crashed are for OntapDSM and MPIO. The errors relate to a multipathing and missing disk/lun (DSM ID: 0300041e) but i am having trouble identifying just which LUN this relates to?

I cannot resolve the DSM ID in the error logs to any drive on the server or LUN on our filers.

DSM ID: 0300041e - seems an odd ID # as the other DSM ID #s in the output of the "dsmcli path list" command shows the them to end in a number and not a letter.

Example:
C:\Users\srvspeback>dsmcli path list

Path Info for W-XMJJ/ZolmA:
Number of Paths: 4
DSM ID        NexusID       Initiator Address               Target Portal
====== ======= ================= =============
03000500 03000502 21:00:00:24:ff:03:b8:39 50:0a:09:84:99:cb:9e:85 Slot:v.0a
03000400 03000401 21:00:00:24:ff:03:b8:39 50:0a:09:84:89:cb:9e:85 Slot:0a
02000500 02000502 21:00:00:24:ff:03:b7:ab 50:0a:09:83:99:cb:9e:85 Slot:v.0c
02000400 02000401 21:00:00:24:ff:03:b7:ab 50:0a:09:83:89:cb:9e:85 Slot:0c

=====================================================================================================

Platforms:

SQL server Win2K8 R2

DSM version is 3.3.25186

LUNs via FC (snapdrive 6.3)

Data Ontap 7.3.4 (FAS3140)

=====================================================================================================

Any help would be much appreciated - Thanks in advance.

Kind Regards

Wayne H

     ==========================================================

The errors in the Windows Event logs are as follows:

The computer has rebooted from a bugcheck. The bugcheck was: 0x000000d1 (0x0000000000000000, 0x0000000000000002, 0x0000000000000008, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: 020111-28922-01.
All paths have failed. \Device\MPIODisk118 will be removed.
ONTAP reported that the LUN on DSM ID 0300041e is not supported. The data section of this log entry contains additional information.
DSM ID 0300041e has initiated a fail-over.
All paths have failed. \Device\MPIODisk118 will be removed.


Log Name: System
Source: mpio
Date: 1/02/2011 7:11:50 PM
Event
ID: 16
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: hppewscl2xx.xx

Description:
A fail-over on \Device\MPIODisk118 occurred.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="mpio" />
<EventID Qualifiers="49160">16</EventID>
<Level>2</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2011-02-01T08:11:50.443267400Z" />
<EventRecordID>47381</EventRecordID>
<Channel>System</Channel>
<Computer>hppewscl2xx.xx</Computer>
<Security />
</System>
<EventData>
<Data>\Device\MPIODisk118</Data>
<Binary>000008000100000000000000100008C00200000000000000000000000000000000000000000000000104000300000000</Binary>
</EventData>
</Event>

Log Name: System
Source: mpio
Date: 1/02/2011 7:11:50 PM
Event ID: 23
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: hppewscl2xx.xx

Description:
All paths have failed. \Device\MPIODisk118 will be removed.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="mpio" />
<EventID Qualifiers="49160">23</EventID>
<Level>2</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2011-02-01T08:11:50.443267400Z" />
<EventRecordID>47380</EventRecordID>
<Channel>System</Channel>
<Computer>hppewscl2xx.xx</Computer>
<Security />
</System>
<EventData>
<Data>\Device\MPIODisk118</Data>
<Binary>000000000100000000000000170008C0170000000E0000C000000000000000000000000000000000</Binary>
</EventData>
</Event>

Log Name: System
Source: ontapdsm
Date: 1/02/2011 7:11:50 PM
Event ID: 61077
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: hppewscl2xx.xx

Description:
DSM ID 0300041e has initiated a fail-
over.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="ontapdsm" />
<EventID Qualifiers="33024">61077</EventID>
<Level>3</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2011-02-01T08:11:50.256067400Z" />
<EventRecordID>47379</EventRecordID>
<Channel>System</Channel>
<Computer>hppewscl2xx.xx</Computer>
<Security />
</System>
<EventData>
<Data>
</Data>
<Data>0300041e</Data>
<Binary>0F002C00020054000000000095EE00810400000000000000000000000000000000000000000000001E04000301040003850100C00A0520000000840205250000000000000000000000000000000000007200FFFF</Binary>
</EventData>
</Event>

Log Name: System
Source: ontapdsm
Date: 1/02/2011 7:11:50 PM
Event ID: 61085
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: hppewscl2xx.xx

Description:
ONTAP reported that the LUN on DSM ID 0300041e is not supported. The data section of this log entry contains additional information.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="ontapdsm" />
<EventID Qualifiers="49408">61085</EventID>
<Level>2</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2011-02-01T08:11:50.256067400Z" />
<EventRecordID>47378</EventRecordID>
<Channel>System</Channel>
<Computer>hppewscl2xx.xx</Computer>
<Security />
</System>
<EventData>
<Data>
</Data>
<Data>0300041e</Data>
<Binary>0F002C0002005400000000009DEE00C10400000000000000000000000000000000000000000000001E04000301040003850100C00A0520000000840205250000000000000000000000000000000000007200FFFF</Binary>
</EventData>
</Event>


10 REPLIES 10

waynehapu
11,641 Views

...Update....

I have logged a case with Netapp (2001979460) regarding this issue. They have acknowledged that there is an issue with OntapDSM version 3.3.x and have advised us to upgrade to OntapDSM version 3.4 as the issue is resolved in this release.

The OntapDSM 3.4 release notes specify that Windows Hotfixes be installed prior to upgrade (particularly MS KB 979743). This hotfix is required because of an issue with MS MPIO. See Release notes attached.

We are running Windows 2008 R2 servers (SQL 2010 cluster) in both test and prod environment. Each environment has LUNs on the same filers. Snapdrive 6.3 and Snapmanager for SQL.

We have installed the Hotfix and OntapDSM 3.4 onto the servers in our test environment and are monitoring to see if the above errors re-occur. I will update this discussiion with outcome.

This has been a tricky issue so i hope this information helps others out.

Cheers.

WH

miketexas
11,641 Views

Wayne, you posted the following:

We have installed the Hotfix and OntapDSM 3.4 onto the servers in our test environment and are monitoring to see if the above errors re-occur. I will update this discussiion with outcome.

Do you have any updates on your testing/monitoring after updating the OntapDSM and MPIO?

Thanks!

waynehapu
11,641 Views

Hi there,

After installing the OntapDSM 3.4 app and the Hotfix and rebooting the servers we are still seeing the errors in the Windows Event logs. I have escalated this to Netapp Support and they have asked for OntapWinDC diag tool to be run on the SQL Cluster servers. I will then upload them to ftp site for analysis.

WH

miketexas
11,641 Views

Hi Wayne,

Has NetApp provided you any updates yet? Have you tried installing Windows Server 2008 R2 SP1 to see if it makes any difference or alleviates the situation?

Regards,

Mike

waynehapu
11,641 Views

Hi Mike,

Sorry about delay in replying.

I have been trying to organise the various support teams to get this sorted out but through leave or sickness i have been thwarted so far.

The Windows Server team are not yet able (or willing) to install SP1 for Windows Server 2008 R2. I believe they are testing this in our test environment.

I tried a few things with the Backup team inclucing excluding the backup of C: and 😧 drives. These drives are the Windows System drive (C:) and the other is the drive to install various apps onto (if needed). The errors still persisted.

I then asked the Windows Server team to unisntall the Netbackup client from the hosts (with the permission of the Backup team). This resolved the errors (i.e. we got rid of the errors all together). I have now had the Netbackup client re-installed on the host and the errors have come back.

What i am trying to do now is get consensus on whether the teams require a back of the System State, C: and D:. If they do not need these backups then I will ask the teams to completely unistall the Netbackup client.

However, this seems a strange way to fix this issue as there must be other organisations out there that run Netbackup client on servers where snapshotting of SQL LUNs is also done. If so can anyone tell me how they handled this issue (if they saw it) in their own environments?

Cheers

WayneH

waynehapu
11,641 Views

Hi Mike,

I missed out part of my findings:

This issues surfaces when Netbackups are occuring (around 7pm each night). The problems is when Netbackup creates and mounts its snapshots for backup purposes. OntapDSM sees the new LUNs and queries the Storage System about them but when the Storage system returns no details about them, OntapDSM throws the error about LUN that is not supported (Event ID: 61085).

"ONTAP reported that the LUN on DSM ID 0300041e is not supported. The data section of this log entry contains additional information"

Hope this helps..

Regards

WayneH

miketexas
11,641 Views

Thanks Wayne !!!  This information you posted is very useful.

I remember having and issue with the VSS writers showing 2 of them waiting for completion whenever I executed vssadmin list writers. When I engaged VERITAS, they stated this may be issue caused by the OS….  The job always might miss the five following files.

This is what is being reported from VERITAS's NetBackup application:

10/28/2009 1:51:39 PM - Warning bpbrm(pid=3664) from client serversp02: WRN - can't open object: Shadow Copy Components:\System State\System Files\System Files (BEDS 0xE0009421: No component files present on the snapshot.)

10/28/2009 1:51:41 PM - Warning bpbrm(pid=3664) from client serversp02: WRN - can't open object: Shadow Copy Components:\System State\Automated System Recovery\BCD\BCD (BEDS 0xE0009421: No component files present on the snapshot.)

10/28/2009 1:51:42 PM - Warning bpbrm(pid=3664) from client serversp02: WRN - can't open object: Shadow Copy Components:\System State\COM+ Class Registration Database\COM+ REGDB (BEDS 0xE0009421: No component files present on the snapshot.)

10/28/2009 1:51:43 PM - Warning bpbrm(pid=3664) from client serversp02: WRN - can't open object: Shadow Copy Components:\System State\Registry\Registry (BEDS 0xE0009421: No component files present on the snapshot.)

10/28/2009 1:51:45 PM - Warning bpbrm(pid=3664) from client serversp02: WRN - can't open object: Shadow Copy Components:\System Service\Windows Management Instrumentation\WMI (BEDS 0xE0009421: No component files present on the snapshot.)


After the deleting of the Shadows and Shadow Storage all the above errors were gone with exception of the following:

serversp02: WRN - can't open object: Shadow Copy Components:\System State\System Files\System Files (BEDS 0xE000FEDD: A failure occurred accessing the object list.) 11/5/2009 4:44:00 PM - end writing; write time: 00:30:43

This issue was was really being caused by the earlier version of VERITAS NetBackup being used and  then they stated that it needed to be updated to the most current version.

When VERITAS first started supporting Windows 2008 at 6.5.2 a number of issues were reported, but were not fixed until 6.5.4. NetBackup cannot successfully perform a DR of Windows 2008 box until 6.5.4. Any previous version will result in error when performing a DR on the box

What could be happening in your case with whatever version of Netbackup being used, is that after a backup of the System State is done it's not resetting the flags it puts on the LUN to lock it then OnTapDSM generates the Event ID: 61085.

Message was edited by: miketexas

Sorry, forgot to include the supporting documentation from VERITAS.  Even though you may have the latest Netbackup Client, you may want to contact VERITAS Technical Support to see if they have a NEW fix that addresses your issue with OnTapDSM.

Fixed in 6.5.4 (ET1506354) A potential for data loss has been discovered in NetBackup Enterprise Server when backing up hard links on a Windows 2008 Server or Vista client, if these hard links are also Shadow Copy Components. This does not affect user data at this time.

http://support.veritas.com/docs/318083

Fixed in 6.5.4 (ET1469321) When performing a full restore to a Windows 2008 or Vista client, the restore job fails in recovering Boot Configuration Data (BCD), preventing the system from booting.

http://support.veritas.com/docs/321132

Fixed in 6.5.4 (ET1432045) BUG REPORT: After a DR Restore of a Windows 2008 Client, the Event Viewer Service will not start.

http://support.veritas.com/docs/315024

waynehapu
11,641 Views

Thanks Mike,

We are using Netbackup client 7.0.1. I will circulate your email to the backup team for their comment.

Thanks again.

Regards

Wayne Hapuku

Enterprise Storage Administrator

Dept of Education and Training (NSW)

Email: Wayne.Hapuku@det.nsw.edu.au<mailto:Wayne.Hapuku@det.nsw.edu.au>

Deskphone: 02 9942 9715

dopp10
11,641 Views

Was there ever a resolution to this problem?  I have a similar setup windows server 2008 r2 and sql cluster.  Every so often our main sql cluster loses access to the netapp luns and they go offline obviously wreaking havoc to our system and causing sql to go offline.  I end up having to reinitiate the host to the luns but this seems to happen during heavy work loads times.

I have snadrive 6.3 and smsql 5.1 I'm not quite sure what dsm version I'm running.

waynehapu
8,906 Views

Hi dopp10,

This issue was caused (for us anyway) by "Ghost" LUNs (.rws luns) remaining on the server when they should have been removed. As stipulated above this was caused by Netapp snapshot process and Netbackup process overlapping and when Netbackup saw the Netapp .rws luns it knew nothing about them and threw the errors (mentioned ablove). It then froze the .rws luns in place and Data Ontap/Snapdrive could not remove them.

The process we followed to fix this was to upgrade the ONtapDSM and Netback software and exclude the Netapp LUNs from Netbackup backups schedules.

Hope this helps.

Public