Wednesday, November 24, 2010
High-Availability depends on how quickly you can recover a production system after an incident that has caused a failure. This requires planning, and documentation. If you get a Disaster Recovery Plan wrong, it can make an incident into a catastrophe for the business. Hugo Shebbeare discusses some essentials, describes a typical system, and provides sample documentation.(une traduction en Français de ce Plan de Relève est déjà en cours, et sera publié prochainement.)
In this article, I'll lay out the technical details of implementing a simple Disaster Recovery Plan (DRP) for production applications running Microsoft SQL Server. My goal is to provide you with generic documentation to use as the basis of your own production system failover strategy. You will, of course, need to alter it with your own details and keep it updated any time that changes are made to your production systems, but this should give you a good departure point from which to build your own strategy.
I'll describe the steps to follow in the event of the failure of a database production system, and annotate the process as I go along. This is largely based on a Disaster Recovery Plan I had to design recently (all the better for you to download and personalize), so it is deliberately written in the style of a business strategy document. I’ll also explain the advantages of automatic restoration of compressed backup files from a failover server. I'll also be the first to admit that this topic might seem a little dry, but having a DRS will make it worth the read – I promise.
Part 1 of this article will describe the basic steps necessary to set up a ‘hot' standby server (the recovery method I used when drafting this DRP), and Part 2 is an annotated transcript of Disaster Recovery document, including steps to be taken in the event of a disaster, and information for the unfortunate DBA tasked with recovering from it. Here we go:
PART 1 - Automatic Restoration of backup files to a failover server
SQL Servers' norecovery mode keeps the database stable and ready to accept the changes you're progressively applying as part of the backup process. This means that it's only necessary to apply the latest differential or log backup before the database is ready to be accessed by users and applications.
The disaster recovery method used is to have a ‘hot' standby server (SQL2), which is already installed, stable and, most importantly, is an exact copy of the production server's configuration. The standby server should already have the most recent operational databases fully-restored in norecovery mode.
Implementing a Hot Standby Server
After SQL Server has been installed on the failover server, you need to check that Robocopy is installed in the sysroot\windows\system32 folder. Secondly, Red Gate's SQL Backup software must connect to the server and be configured by clicking the small grey square next to server listing in left pane – this is for instance auto-configuration, if it has not been done already.
Figure 1 - SQL Backup's auto-configuration system.
Robocopy is much better than ( the soon-to-be-discontinued) Xcopy, by the way. And since Windows Server 2003, Robocopy has been the recommended / future-proofed tool of choice. As far as I know, Xcopy will no longer be available in future versions of Windows Server.
Next, for the stored procedures that execute Robocopy (we place these procedures in a local database on each server called DBA_tools), you need to allow the advanced option xp_cmdshell to run:
-- To allow advanced options to be changed.
EXEC sp_configure 'show advanced options', 1
GO -- To update the currently configured value -- for advanced options.
-- To enable the feature.
EXEC sp_configure 'xp_cmdshell', 1
-- To update the currently configured value for this feature.
In order to copy the backup files, each database on the standby server needs a database-specific SQL Server Agent job running Robocopy at the required interval to copy full and differential backups from the production server to the standby server. These jobs can be run at whatever frequency needed, be it daily, hourly or even more often if your operations require it.
Robocopy is the first step in all automated restore jobs, unless you want to add validation steps prior to the backup file copy. The following example copies all differential database backups from a production server to a DRP server:
EXEC dbo.usp_RoboCopy '\\PRODserver\drive$\ProdServerBackupShare\Diff', '\\DRPserver\Drive$\ProdServerDbBackupFolder\Diff', 'database1_* database2_*'
-- This case just handles the differential folder, so we’re assuming you’ll also have -- a Tlog and Full folder.
A database-specific SQL Server Job will restore these backups daily to the hot standby server (DRP) using stored procedures specifically created for this setup, such as:
usp_DB_Restore_Master or usp_DB_Restore_Master_Multi
A consideration for the DBA regarding the level of database recovery
If you are currently in Simple Recovery mode, and provided there are regular Transaction Log and differential backups (as in, several times a day), you can switch your recovery model over to Bulk-Logged in production to restore up to a specific point in time. This will naturally minimize the amount of data lost from the work session prior to any downtime.
Full Recovery mode is recommended for critical databases that require auditing compliance.
In the event of failure, the most recent log or differential backup is ready to be applied to the standby database sitting in norecovery mode, and you’re up and running quickly with minimal down-time.
An alternative method for a much smaller database, where the total restore time is below five minutes, is to apply the complete restore every hour to the failover server, in which case you don’t need to worry about norecovery mode.
PART 2 - Instructions to follow in the event of a disaster to the production system
If you haven’t heard from them directly already, please contact FIRST LINE DBA SUPPORT at [INSERT NUMBER] or SECONDARY DBA at [INSERT NUMBER]
After the production/original data publisher server failure (SQL1), the restore / backup-subscriber server (SQL2) will be used as the primary database server (a.k.a. DRP server). Inform everyone in the department by E-mail (It's also worth thinking about who will inform internal/external clients).
Once the switch occurs to the DRP server and the downtime of SQL1 actually happens, all application connection strings need to be changed to access SQL2. The CGI should handle this step automatically.
Disable Automatic Restore SQL Agents on SQL2.
Disable all SQL Agent jobs on failed server SQL1 if possible.
Enable all maintenance and backup jobs on newly active server SQL2
Please note that restoring a log backup is not possible if the production database recovery model is set to Simple. For fine-grained restoration, the database needs to have been using the Full recovery model - Thankfully, the default setting is the Full Recovery model. If point in time recoveries are requested by management on a regular basis, then we can also change the database recovery level to Bulk-Logged, if space is an issue, and Full otherwise - Perhaps with deserved hesitation from the side of the Database Administrators, as Bulk-logged recovery is much more space efficient.
How the automation of the restore from compressed backup is benefitial to your production environment. Ideally you should keep two full backups, one on the Test server and one on the DRP server. Having this second copy of the production databases will allow you to do some useful and intensive work which you don’t want to have to run on live databases, such as a full DBCC CheckDB – console commands that can check the integrity of your exact database restore copy.
A log of what has been restored shall be placed in the following directory:
As soon as a restore is completed, we should have an automatic purge of old backups – done perhaps every week (maximum 14 days manually, or automatically in a SQL Server Maintenance Plan), and which can be automated using a batch file or PowerShell Script.
To ensure a smooth restore process, we should read the restore parameters directly from the backup log system tables - such as BackupHistory, BackupSet, BackupFile or
Backuplog - unless a backuplog table is explicitly created in a local database or exists in msdb. This is to ensure that the essential restore parameters (such as the backup file name and position) are immediately available
Posted by Renil Pillai at 7:26 AM
Did you know that Digital Storm have outed their new gaming PC called Hailstorm that is relentlessly equipped by liquid-chilled Sub-Zero cooling system (TEC peltier coolers) to avoid toasting its powerful Core i7-980X processor. The gaming rig also has 6GB of DDR3 memory, a trio of NVIDIA GeForce GTX 580 (1.5GB) GPUs and a 1200-watt power supply to cover all the power you need. I have been a good boy since birth, and I don't think my mom would be so happy to shoulder $6,903 worth of my gaming fantasy.
Posted by Renil Pillai at 6:45 AM