Roundtable: DR for Microsoft Applications with VMware SRM and NetApp

by NetApp Staff on ‎2010-04-23 11:28 AM

The February 2010 issue of Tech OnTap featured an article on virtualizing Microsoft applications using VMware®, NetApp®, and Cisco technologies. As a follow-on to that article, Tech OnTap sat down with Wen Yu of VMware and Larry Touchette from NetApp to dig into the details of disaster recovery for Microsoft® applications and learn why the adoption rate for DR is higher in VMware/NetApp environments

Tech OnTap: People seem hesitant to use online disaster recovery widely in Microsoft application environments. What are the factors that you find contribute to this?

Wen Yu (VMware): At VMware, we typically find that there are three key reasons. First, and perhaps biggest, is the cost involved with doing DR. You don’t just need a second facility—you need a number of additional servers, network gear, and twice the storage. These costs can be prohibitive regardless of whether you’re working with physical or virtual servers.

Second, there has traditionally been a high degree of complexity associated with performing DR, especially in physical server environments and even more so when you try to implement DR across multiple applications. You can end up with a confusing combination of products and technologies to get the job done. A lot of products out there also require that you have an almost identical configuration on both sides, adding to the cost.

Finally, the network bandwidth required to achieve an adequate RPO can be a limitation for many. A lot of Windows® shops may not have the necessary bandwidth in place to do replication and may hesitate to invest in the bandwidth needed to make it feasible.

The joint solution that NetApp, Cisco, and VMware created addresses a lot of these issues.

Larry Touchette (NetApp): To elaborate on Wen’s last thought, NetApp and VMware take a lot of the cost and complexity out of DR, so you can deploy a solution that covers a much greater number of applications—your entire VMware environment if you want. Some joint customers have been able to offset, or even completely finance, storage for a DR environment with the money saved by using NetApp deduplication on primary VMware storage. Anyone who’s been reading Tech OnTap fairly regularly knows about the benefits of NetApp deduplication in combination with VMware. [This article is a good starting point to learn more about VMware DR and dedupe.—TOT Editor]

The adoption of DR in joint VMware/NetApp environments is quite high. I think these factors are driving that higher adoption.

TOT: Why would someone choose to use VMware Site Recovery Manager (SRM) in a VMware/NetApp DR configuration?

Wen: The most critical part of DR for virtualized application servers is the execution of the steps necessary to connect, inventory, reconfigure, and power up virtual machines at your DR site. Manual execution of these tasks can be complicated and error prone, especially when you’ve got dependencies that require one VM to start before another. Scripts can be written to try to automate DR processes and address these problems, but they are often costly to implement and difficult to maintain.

Site Recovery Manager simplifies the management of the entire DR process, including discovery and configuration, failover, and DR testing. The recovery plan you create during the setup phase of SRM configuration allows you to preconfigure your entire plan. Built-in discovery capabilities and close integration with vCenter accelerate the process.

Once a plan exists, it can be executed automatically with minimal user intervention. SRM enables all necessary steps to be performed and virtual machines to be started in the correct order. For example, virtual machines supporting the infrastructure such as Active Directory® (AD) and DNS servers can be started first, followed by database servers, application servers, and then Web servers.

The ability to perform testing is another big advantage. With most DR solutions it is nearly impossible to test without disrupting normal production operations and ongoing replication. SRM and NetApp make it easy and efficient to perform DR testing. For example, one thing you have to do is create an isolated testing network (so that you don’t inadvertently have two active instances of each VM on your corporate network). SRM automates the process so your tests stay isolated.

Larry: Using NetApp FlexClone® technology in combination with SRM DR testing, you can bring up your DR site and run tests there without using a huge amount of additional storage and without disrupting ongoing replication between sites or operations at the primary site. This gives you an easy way to run tests to validate DR without impacting the production site or agreed-upon SLAs.

Some replication solutions require two times the capacity to create replicas of storage at the DR site so that replication can continue while you’re performing the test. This wastes a lot of time and reduces the length of time you can keep the test environment around or how often you can perform tests. Using FlexClone significantly reduces the amount of storage needed and accelerates the process.

Incremental storage requirement for DR testing with FlexClone.

Figure 1) Incremental storage requirement for DR testing with FlexClone.

TOT: So what are the major considerations for someone who wants to deploy a DR solution using VMware SRM and NetApp storage?

Wen: From the standpoint of SRM, there are a number of considerations. First of all, you’ve got to have a VMware vCenter server at each site, along with a Microsoft SQL Server to store the SRM database and servers running supported versions of ESX.

The primary and recovery sites must be connected by a reliable IP network, and the recovery site should also have access to the same public and private networks as the primary site. Last but not least, the recovery site should have up-to-date Active Directory and DNS servers.

When it comes to the actual replication between sites, SRM relies on storage—in this case, NetApp—to do that. Customers running tier-1 applications can achieve a zero RPO by configuring SnapMirror to replicate synchronously. In addition to replication, maintaining consistency at both the OS and application level is key.

Larry: NetApp uses a number of components to provide consistent replication for both the VMs themselves and for Microsoft applications (Exchange, SQL Server, and SharePoint Server). The key consideration for both VMs and applications is that it’s not enough to simply replicate the data periodically; it has to be in an application-consistent state from which each component can be restarted. We’ve described the whole approach in some detail in a recent tech report. Wen reviewed this TR to make sure we had the VMware information correct.

The VMs reside in shared datastores, either VMFS (FC or iSCSI LUNs) or NFS. NetApp SnapManager for Virtual Infrastructure provides consistent Snapshot™ copies and replication for VM data.

A key design element is that we keep application data separate from VM datastores by storing it in physical-mode RDM LUNs (either iSCSI or FC). This allows us to use the NetApp SnapManager suite of products to create consistent recovery points for each application, and we can also have different replication schedules for each application to accommodate different RPOs by creating different numbers of recovery points.

Replication architecture.

Figure 2) Replication architecture.

TOT: We did a lot of work to make it possible to have multiple recovery points from which to restart applications. Can you tell our readers a little more about that?

Larry: NetApp SnapManager products for SQL Server, Exchange, and SharePoint increase flexibility by allowing the creation and verification of multiple recovery points replicated to the recovery site. The SnapManager applications create full backups, which are verified to be application consistent, plus more frequent backups that include only the incremental logs of changes that have occurred between full backups. These incremental backups are referred to as frequent recovery point, or FRP, backups. Adjusting the time between FRP backups provides the flexibility to set the desired RPO for each application separately.

If any issues are detected with the recovered application data at the recovery site, individual applications may be reverted to any previous recovery point. SnapManager can roll forward any uncommitted database logs if the applications are reverted to a previous recovery point to prevent the loss of any new data that was written at the recovery site after failover.

SRM allows you to insert custom commands into your recovery plan. We use this capability to execute a command in the recovery plan that configures SnapDrive® to enable VMs running at the DR site to see the full history of backups that were replicated from the production site. For those with access to the NetApp NOW™ (NetApp on the Web) site, this process is described more fully in KB56952.

TOT: Can one of you explain the importance of Active Directory in an SRM environment?

Wen: Microsoft applications are highly dependent on Active Directory and DNS for correct operation, so it’s really critical to have this configured correctly at your recovery site. When you perform DR testing, you also have to be certain to provide a correctly configured and up-to-date Active Directory server on the isolated test network. When you fail back to the primary site from the recovery site, you again have to be certain to deal correctly with Active Directory/DNS servers. If you fail to do so, you may experience update sequence number (USN) rollback problems and Active Directory database corruption. These problems are described more fully in Microsoft knowledge base article 875495.

The easiest way to make sure that Active Directory is correct at the recovery site is to maintain at least one Active Directory server at the recovery site that is synchronized with the primary site.

For DR testing, you have to clone this AD server just prior to running the DR test. Once the cloning is done, but before powering on the VM, make sure the cloned AD server is connected only to the DR test network. After the AD VM is powered on in the test network, five FSMO (Flexible Single Master Operation) roles in the Active Directory forest must be seized according to the procedure described in Microsoft knowledge base article 255504.

This cloning process is not necessary when a real failover occurs, but seizing the FSMO roles is still required and must be done manually. Once you’ve recovered from your disaster—whatever it is—and prior to failback, you must reestablish Active Directory services at the original site. This can be done by recovering the AD servers at that site and forcing them to resynchronize with the newer AD servers at the recovery site or by establishing new AD servers.

All of these actions are covered in a fair amount of detail in NetApp TR-3822, which Larry mentioned previously.

TOT: To wrap up, can you both talk a little about the methods available for failing back to the original site?

Wen: As I just suggested, the first step once your original site is up and running is to get Active Directory up. SRM doesn’t provide fully automated reversal and failback yet, but we still recommend that you use SRM to do the failback by reconfiguring the software to fail in the opposite direction.

Larry: In order to fail back you’ve got to synchronize the data between the recovery site and the original site. SnapMirror relationships are easy to reverse and resynchronize. The resynchronization process will depend somewhat on the failure that occurred. If the original storage wasn’t destroyed in the disaster, SnapMirror will only have to replicate the delta—the changes that occurred while the original site was offline. Otherwise, a full resync will be required. Of course, NetApp deduplication and SnapMirror compression can reduce the WAN impact in either case. Dedupe reduces the total amount of data in your VMware environment by eliminating the duplication that results from having many, many copies of the same guest operating systems, and compression makes sure that any data that is transmitted over the WAN uses the least bandwidth possible.

We hope the above information summarized from a roundtable has been helpful, and we would love to hear what you think about this article. For complete details on the topics discussed, see TR-3822.

Community CenterGot opinions about VMware storage provisioning?

Ask questions, exchange ideas, and share your thoughts online in NetApp Communities.
Author Alt Text

Wen Yu
Senior Technical Alliance Manager
VMware

Wen has been with VMware for over five years, supporting and evangelizing virtualization products for continuous availability, disaster recovery, and desktop. He is currently a member of the Infrastructure Alliance Technology Team.

Author Alt Text

Larry Touchette
Technical Marketing Engineer
NetApp

Larry has been with NetApp for nine years, supporting, implementing, and designing NetApp storage and disaster recovery solutions. He is currently a part of the NetApp Server Virtualization Technical Marketing Team.

Explore
Comments

The problem we are having when engineering this solution specifically  for SQL Server is how the system databases are replicated to the DR  site.  We have vSphere 4.0 Update 1 at both locations and have SRM  working well.  We are using NFS datastores for the OS/binaries and ESX  ISCSI Software Initiators RDM LUNs for the application data as detailed  in TR-3822.  The issue I have come across that I cannot get an answer to  is how to replicate my System databases (which reside on their own  LUN).  Because of the inability (I understand this is an MS limitation)  to take a snapshot of the System databases (this is instead backed up to  the Snapinfo LUN), I am unable to bring up those databases in a  quiesced state on the DR side via SRM.  I could manually take snapshots  and then SnapMirror them over to the DR site but this is not quiesced  and would potentially be unusable.  The answer NetApp has provided is to  restore the System databases once over to the DR site but because you  have a non-quiesced copy of those databases, you are put into a "chicken  or egg" situation.  You need the system databases online to perform a  restore of the system databases but since they are not stable, you  cannot start the SQL Services to perform the restore.

The couple  recommendations NetApp made were to do a repair of the system databases  on the DR side during a failover and then do a restore....I would think  this is far from an optimal DR solution and almost impossible to  script.  The other idea offered was to perform a database "pause"  nightly and take a snapshot and replicate that system LUN to the DR  site.  I am not sure of the effect of this database "pause" on the  entire SQL instance.  This also does not cleanly tie into SMSQL as you  would expect.

As  a result of these limitations, we are now considering a DoubleTake DR  solution for our SQL Servers which is a shame with all of this great SAN  replication and virutal infrastructure.

The questions I have are:

  1. What  are other folks doing with respect to failing over their SQL Server from  one vSphere environment to another?
  2. Why isn't this MAJOR gotcha  documented in any TR reports?  Has this issue not been raised by  anyone?
  3. Am I missing a more simplistic solution to this problem  (I hope this is the case) or do I really need to look at a third party  solution for my DR despite the investment made in NetApp and VMware?

Any  assistance you can provide would be very greatly appreciated.

Thanks,

Joe

larryet Former NetApp Employee

FYI Joe's last comment has been responded to in the thread for virtualized MS Apps...

http://communities.netapp.com/docs/DOC-5171#comment-3407

Larry

Warning!

This NetApp Community is public and open website that is indexed by search engines such as Google. Participation in the NetApp Community is voluntary. All content posted on the NetApp Community is publicly viewable and available. This includes the rich text editor which is not encrypted for https.

In accordance to our Code of Conduct and Community Terms of Use DO NOT post or attach the following:

  • Software files (compressed or uncompressed)
  • Files that require an End User License Agreement (EULA)
  • Confidential information
  • Personal data you do not want publicly available
  • Another’s personally identifiable information
  • Copyrighted materials without the permission of the copyright owner

Files and content that do not abide by the Community Terms of Use or Code of Conduct will be removed. Continued non-compliance may result in NetApp Community account restrictions or termination.