The better prepared an IT infrastructure is against a disaster, the less likely it seems to occur. This would not appear to make much sense in the case of natural disasters such as earthquakes, floods, or tornados. In the years of knowing IT specialists who believe they can command the physical elements, but we have yet to see it demonstrated. When it comes to more localized events such as broken water pipes, fires, or gas leaks, being fully prepared to deal with their consequences to your infrastructure can reduce the probability of their causing adverse impacts to your computer systems.
We'll discuss how to plan for, and recover from, either a localized or a natural disaster. We begin with a definition of disaster recovery, which leads us into the steps required to design and test a disaster recovery plan. We will explain the difference between disaster recovery and contingency planning and business continuity. Some of the more nightmarish events that can be associated with poorly tested recovery plans are presented as well as some tips on how to make testing more effective.
Definition of Disaster Recovery Disaster Recovery: A methodology to ensure the continuous operation of critical business systems in the event of widespread or localized disasters to an infrastructure environment.
There are several key phrases in this definition. The continuous operation of critical business systems is another way of saying business continuity, meaning that a disaster of any kind will not substantially interrupt those processes required to keep a company in business. Widespread disasters are normally major natural disasters such as floods, earthquakes, or tornadoes.
It's interesting to note that most all major telephone companies and line carriers will not enter into formal SLA's about the availability of their services because they say they cannot control due to natural disasters or human negligence such as backhoes digging up telephone lines. The key point of the definition is that disaster recovery is a methodology involving planning, preparation, testing, and continual updating.
There are 13 steps required to develop an effective disaster recovery process. Disaster Recovery (DR) is a process within a process in that we are including steps that involve contracting for outside services. Depending on the size and scope of the shop, not every disaster recovery process will require this type of service provider. A sizable percentage of businesses (shops) do utilize this kind of service.
Step 1: Acquire executive support The acquisition of executive support is the first step necessary for developing a robust DR process. There are many resources required to design and maintain an effective program, and these all need funding approval from senior management to initiate the effort and to see it through to completion. Another reason is that managers are the first to be notified in the event of an actual disaster. This sets off a chain of events involving management decisions about deploying the IT recovery team, declaring an emergency to the disaster recovery service provider, notifying facilities and physical security, and taking whatever emergency preparedness actions may be necessary. By involving management early in the design process, and by securing their emotional as well as financial buy-in, you increase the likelihood of management understanding and flawlessly executing its roles when a disaster does occur.
There are several other responsibilities of a disaster recovery executive sponsor:
1. selecting a process owner 2. acquiring support from managers of the cross-functional team, such as direct reports, peers within IT, or outside of IT 3. demonstrate ongoing support by requesting and reviewing frequent progress reports, offering suggestions for improvement, inquiries on elements of plan, and resolving issues of conflict
Step 2: Select a Process Owner The process owner for disaster recovery must assemble and lead the cross-functional team in such diverse activities as preparing the business impact analysis, identifying and prioritizing requirements, developing business continuity strategies, selecting an outside service provider, and conducting realistic tests of the process. The finished plan must be well documented and updated regularly. The process owner must be able to:
1. effectively communicate with IT executives, IT customers, and IT developers 2. provide knowledge of network and systems software and components, applications, backup systems, software/hardware configurations, database systems, desktop hardware/software 3. think and plan strategically and tactically
This means designing a process that keeps the strategic business priorities of the company in mind when deciding which processes need to be recovered first.
Step 3: Assemble a cross-functional team Members of appropriate departments from several areas inside and outside of IT should be assembled into a cross-functional design team. The specific departments vary from shop to shop, but assume they would be from:
1. computer ops 2. apps dev 3. key customer depts. 4. facilities 5. data security 6. physical security 7. network ops 8. server and sys admins 9. database admins
This team will work on requirements, conduct a business impact analysis, select an outside service provider, design the final overall recovery process, identify members of the recovery team, conduct tests of the recovery process, and document the plan.
Step 4: Conduct a Business Impact Analysis Even the most thorough of disaster recovery plans will not be able to cost justify the expense of including every business process and application in the recovery. An inventory and prioritization of critical business processes should be taken representing the entire company. Processes that need to resume within 24 hours preventing serious business impact, such as loss of revenue or major impact to customers are rated A as a priority. Those processes that need to resume within 72 hours are rated B, and greater than 72 hours are rated C. These identifications and prioritizations will be used to propose business continuity strategies.
Step 5: Identify and Prioritize Requirements One of the first activities of the cross-functional team is to brainstorm the identity of requirements for the process, such as the business, technical, and logistical. Business includes defining the specific criteria for declaring a disaster and determining which processes are to be recovered and in what time frame. Technical include what type of platforms will be eligible as recovery devices for servers, disks, and desktops and how much bandwidth. Logistical include the amount of time allowed to declare a disaster and transportation arrangements at both the disaster site and the recovery site.
Step 6: Assess possible business continuity strategies Based on the business impact of the analysis and the list of prioritized requirements, the cross-functional team should propose and assess several alternative business continuity strategies. These will likely include alternative remote sites within the company and geographic hot sites supplied by an outside provider.
Step7: Develop an RFP for outside services The cross-functional team should develop an RFP, presuming the shop is large and requires outside provider services involving business continuity, which proposes an outside provider supply disaster recovery services. Options should include multiple-year pricing, guaranteed minimum amount of time to become operational, cost of testing, provisions for local networking, and types of onsite support provided. Criteria should be weighted to facilitate the evaluation process.
Step 8: Evaluate Proposals and Select the best offering The weighted criteria established by the cross-functional team are now used by them to evaluate the responses to the RFP. The winning proposal should go to the bidder who provides the greatest overall benefit to the company, not simply to the lowest cost provider.
Step 9: Choose Participants and clarify their Roles for the Recovery Team The cross-functional team chooses the individuals who will participate in the recovery activities after any declared disaster. The recovery team may be similar to the cross-functional team but should not be identical. Additional members should include key customers and outside service provider that reflect the business impact analysis and its executive sponsor. Ponce the recovery team has been selected, it is imperative that each individual's role and responsibility be clearly defined, documented, and communicated.
Step 10: Document the Disaster Recovery Plan The cross-functional team is to document the disaster recovery plan for use by the recovery team, which will the n have responsibility for maintaining its accuracy, accessibility, and distribution. Documentation of the plan must also include up-to-date configuration diagrams of the hardware, software, and network components involved in the recovery.
Step 11: Plan and execute regularly scheduled tests of the plan Disaster recovery plan should be tested a minimum of once per year. During the test, a checklist should be maintained to record the disposition and duration of every task that was performed for later comparison to those of the planned tasks. World-class infrastructures with disaster recovery programs test at least twice per year. When first starting out, particularly for complex environments, consider developing a test plan that spans up to three years-every 6 months test can become progressively more involved-starting with program and restores, followed by processing loads and print tests, then initial network connectivity tests, and eventually full network and desktop load and functionality tests. Dry runs are widely thoroughly planned, communicated, and given high visibility.
Step 12: Conduct a lessons-learned post mortem after each test. The intent of the lessons-learned post mortem is to review exactly how the test was executed as well as to identify what went well, what needs to be improved, and what enhancements or efficiencies could be added to improve future tests.
Step 13: Continually maintain, update, and improve the plan An infrastructure environment is forever changing. New apps, expanded databases, additional network links, and upgraded server platforms are just some to name a few that render the most thorough disaster recovery plans inaccurate, incomplete, or obsolete. A constant vigil must be maintained to keep the plan up to date and effective. Many changes in personnel affecting training, documentation, and even budgeting for tests are some additional concerns to keep in mind when maintaining a disaster recovery plan.
In assessing and streamlining an infrastructure's disaster recovery process a worksheet should be used to assess the overall quality, efficiency, and effectiveness of a disaster recovery process. It is used to weigh in the factors for service metrics to gauge the efficiency of the process.
We hope this article will provide pertinent information on the planning of your infrastructures Disaster Recovery process.
|
 |