disaster recovery

February 02, 2009

Linux, Where's Your VSS?

Its been nearly six years since Microsoft released Windows Server 2003 and its accompanying Volume Shadow Copy Service (VSS).  Most people think of VSS as Microsoft’s storage volume snapshot.  Yes it includes snapshot, but it’s much more than just snapshot, it is a snapshot framework that includes a rich set of APIs to allow third-party software integration.  This enables storage array vendors such as EMC or Network Appliance the ability to integrate their array based snapshot solutions with VSS.  It also enables application vendors the ability to construct their applications to participate in the snapshot process so that the volume snapshot of that application’s data is a consistent image.   Here’s an architectural diagram of VSS (courtesy Microsoft):

 

MS VSS

The framework of VSS is the most important aspect of this feature.  It forms the basis for enabling data protection using server-less backup.  In the architectural diagram, applications are the “Writers” and server-less backup software or snapshot management consoles are the “Requestor.”  Microsoft includes providers for its storage subsystem and storage vendors can plug-in their own providers into the framework.

The nice thing about VSS is that applications, especially databases, are given a signal to make their data in storage consistent.  They can flush buffers and transactions as if they were cleanly shutting down and then signal back that they are ready.  The snapshot then gets a good looking image of the application’s data, one that the application knows it can easily access without requiring lengthy repairs or playing re-do logs.
So now let’s go back to the beginning of my blog.  It’s been six years since Microsoft released this feature.  Storage demands have continued growing and don’t look to stop anytime soon.  The number of server-less backup products available for Windows Server continues to grow and enterprises have been deploying these products over the past couple of years to shorten the dreaded “backup window.” 

But where’s Linux?
Linux distributions don’t have any equivalent feature.  Yes, modern Linux distributions include the dm-snapshot module, but that just does the snapshot, there is no framework for application developers, storage vendors, or backup vendors to integrate with.  The result has been a few one-off solutions focused on a particular application stack, but for the most part applications just suffer from inconsistent snapshots on Linux.  These are what we call crash consistent, meaning that the application is responsible for figuring out the state of its data within the snapshot and getting it to a position it can begin running.  Most databases can take upwards of twenty minutes to mount a crash consistent snapshot.

So this is a call to the Linux community:  Please band together and architect a snapshot framework for all Linux application developers, storage developers, and users to benefit from.

[Posted by: Richard Jones]

August 21, 2008

Microsoft’s licensing revisions and its effect on disaster recovery

Chris Wolf and I were briefed by Microsoft on their licensing revisions for their server based applications that make the licensing terms more favorable to virtualized environments.  Chris blogged on these details here.  There’s another aspect of the licensing revisions that I want to talk about: Disaster Recovery in your Business Continuity plans.

Prior to the revisions, a user could not transfer an application license to another physical server more often than once every 90 days.  Legally, this didn’t allow for disaster recovery testing with only one license for your application instance, nor for any type of disaster event that would result in failing a service over to a recovery site for less than 90 days.  You would need to either stay at your failover site for 90 days minimum, or you would need to purchase additional licenses for your recovery sites, even though they would not be in use except during a disaster event or testing.

The new licensing terms are more favorable to disaster recovery and business continuity practices, but not for everyone.  The new terms define the concept of a “server farm” which is composed of no more than two data centers that are no further than four time zones apart from each other.  You can transfer a license between physical machines as part of virtual mobility, disaster recovery, or business continuity within your server farm as often as you like.

My first reaction to the "server farm" restrictions were "what?!?" and "why would you do that?".  So I questioned Microsoft during the briefing as to why they chose to define a “server farm” as no more than two data centers located within four time zones of each other.  Their answer was clear:  They do not want customers transferring a software license around the world within a 24 hour period as in a “follow the sun” fashion.  They regard this as misuse of a license.  Personally, I see this restriction as unnecessary.  As long distance data transmission latencies decrease, the business division of a world wide company in Japan can use the same license as the division in New York just by both remotely accessing a common data center in Colorado.  Why make the restriction in the first place?  I know of a smaller company located in the Phoenix, AZ area that placed and now remotely manages their production servers in a co-location center on the Isle-of-Man (British Isles) to save money.

So what does this mean for you the customer?  It has a series of implications.  For those who have disaster recovery centers within four time zones of your production data center, you now only need one license per covered Microsoft server application instance.  This can translate into savings in your business continuity plan.  However, those of you who have off-shored – or are looking to off-shore – your disaster recovery solution, you may not be able to reap the benefits of this licensing revision.  You will want to take this into account when calculating your potential savings by off-shoring – it may change your plans.  Those of you who have already setup disaster recovery data centers of your own across an ocean, such as a primary data center in New York City with a recovery center in Ireland, will not be able to take advantage of these licensing revisions between those data centers.

Note that Microsoft specifically indicated that they want to discourage license movement to “follow the sun”, as you can reap the benefits of the licensing revision in a long distance recovery site if you setup a data center in  North America with a disaster recovery center in South America, for example.

[Posted by Richard Jones]

January 31, 2008

Netapp's purchase of Onaro and other musings

Netapp’s purchase of Onaro (maker of storage network management software) closed Tuesday, 1/29. See the previous announcement here. The big  fish eats the little fish….Netapp has about 6600 employees, Onaro has about 50.

Onaro claims penetration into 32% of the Fortune 50 – that will include some mighty big storage area networks (SANs). Having Netapp backing them up should definitely put some muscle behind Onaro’s development teams. 

Onaro, for those not in the know, monitors the coming and goings of a SAN, measuring the traffic through, say a fibre port, and keeping track of what’s connected to what. One of the neat things about the software is its ability to verify that the leg bone really is connected to the knee bone… for instance, say you want to launch a new VM or hook a server to the SAN but you’re not sure everything is connected correctly to do that – well, that’s were Onaro’s software comes in. It provides a picture of the parts and pieces in a SAN so that you can tell what’s what and lets you know when things change. Nice.

Maybe not coincidently, Onaro recently announced expanded support for server virtualization and NAS network storage environments – adding “end to end” visibility to the NAS topology as well as the SAN.

So now put it all together and you’ve got a pretty decent way to peer into the bowl of SAN and NAS spaghetti and keep tabs on all of it. And maybe best of all, given that Onaro’s software provides topology intelligence, when the datacenter folks start switching SAN cables or equipment around, Onaro’s software will help prevent inadvertently bringing down a VM environment and getting to find out just how good that disaster recovery plan really is.

Now the question – will Onaro disappear into the Netapp labyrinth? And how will customers, not inclined to deal with Netapp, get at it? This is often an issue when small independent outfits get recognized for their good works and end up in the hands of a big player.  Certainly Onaro is a nice gem to add to Netapp’s collection, assuming Onaro scales down to smaller environments,  but one has to wonder what will happen when non Netapp shops come looking for this snazzy piece of software.

All in all this could be a big win for Netapp. It ties nicely into ReplicatorX, for instance, making sure the necessary SAN topology is in place to create replicates. As long as NetApp can keep the current Onaro base happy and bring in new players, who may not want the whole Netapp package, life will be good.

On a slight tangent, given that Onaro can help insure levels of service within a SAN topology, here’s a suggestion: marry the Onaro offering to a business continuity platform such as offered by (cleverly named) Continuity Software - a little startup (aren’t they all) out of Israel.

Think consolidation.

Continuity Software sells a package intended to scan an environment for databases and email to, first of all, know that they are there, and then secondly, to point out the sometimes embarrassing fact that some of those data sets are not being replicated as thought. It easy to get in this situation when moving data hither and yon in the course of doing business as is likely when deploying virtual machines and new storage.

Replication schedules may be in place but incomplete – leaving out data sets that have been added or moved. Imagine if an admin can not only get a handle on the SAN topology hooking servers and storage together but also, as an added bonus, ensure that data protect schemes are properly replicating the right stuff connected by that topology.

So, as Onaro tracks a SAN’s topology, performance bottlenecks and changes to it, in similar fashion, business continuity software could, as offered by Continuity Software, be tracking data additions and changes and whether that data is properly hooked into a protection scheme. From a business continuity consolidation point of view, combining these two functions into a common pane of glass could be a nice step towards SAN management consolidation and simplification.

Comments?

[Posted by Gene Ruth]

September 13, 2007

Are you ready for VMware's Site Recovery Manager?

I was intrigued with the product announcement VMware made at VMWorld conference about their Site Recovery Manager (SRM) Product

While the concepts behind SRM are not new (various vendors have offered business continuity tools to automate site to site service fail over with a "single button" in recent years), it is new to the x86 virtualization solution space.  SRM includes tools to organize the virtual resources into a proper recovery sequence to reduce possible administrative mistakes, integrates with storage arrays to leverage array capabilities such as replication and LUN management between primary and recovery sites, and includes a test mode. 

The SRM product has all the fundamental features needed for streamlining a disaster recovery solution.  I had the opportunity to conduct in-depth market and product research with my previous employer to develop a similar solution (Novell's Business Continuity Clustering [BCC] product, that first shipped August, 2004) and was pleased to see that VMware's findings of required features matched very well. However, I had found an interesting problem in BCC product deployment and acceptance in the industry.  While senior IT directors liked the concepts and ideas of the product, it had the effect of forcing the IT storage management team to work with the IT server and applications management teams.  Interestingly enough, this became more of a show stopper than I had ever anticipated.I fear VMware's SRM product may face similar issues.

I'm not alone in my findings.  Attending the Wednesday Keynote at VMWorld, the attendees were entertained by Cisco's CEO John Chambers.  John is a very engaging speaker, capturing the interest of the audience, especially my interest. One of John's main points is that IT organizations must change for the future.  They must move from siloed organizations to collaborative organizations.  John has been forcing this transition to occur within Cisco's internal IT department, and while painful, it has produced results.  John claimed that Cisco has saved over $200 million in IT costs, has delayed data center expansion needs by 4 years, and is able to respond to new technology needs in days as opposed to months.  He attributed these improvements/savings not only to use of technology, but to organizational change:  Siloed -> Collaborative. John illustrates that CEO buy-in and drive is required for such a change.

In working with organizations attempting to implement disaster recovery or business continuity solutions, I've found that only those that can break down their organizational silos are most effective in building proven solutions.

So if you are looking at Site Recovery Manager, I'd say that you should first look at your IT organization.  If you are a siloed organization (the storage team does their own thing, etc.), then I expect you may not realize any value or possibly only marginal value from SRM.  However, if your IT organization has transformed into a collaborative team, focused on managing business processes and the applications/services that feed those processes, then SRM can significantly help your organization achieve a leaner more reliable disaster recovery strategy.

[Posted by Richard Jones]

June 24, 2007

The Disaster Recovery Test was a Disaster!

As my colleagues and I have talked with a number of businesses, we've been surprised how many have told us that they have tried to test their disaster recovery plans, failed, and simply gave up. They said the experience was too demoralizing and more pressing issues awaited them.

This is frightening, especially when you read the many surveys on disaster preparedness and see huge variations in the responses. I saw one such survey that indicated nearly 75% had tested disaster recovery plans, yet another indicated slightly under 50% were prepared. A recent survey showed that executives and IT staffers were not on the same page, or not communicating. Whatever the reason, they were not singing the same tune about the importance of Disaster Recovery Preparedness.

I have to ask myself: "Why are so many unprepared?" "Why do the surveys show such broad differences?"

Let's explore the survey issue first. There are a number of reasons surveys can produce great variations in results. Limiting to only a specific market (such as one vertical market segment or geographical region), asking specific questions to "lead the witness", or simply an invalid sample size. While this can explain some variation, it can't explain all. I'm of the opinion that beyond surveys, many companies don't have a concrete understanding of their disaster recovery preparedness. When asked by a surveyor, they take their best guess.

Without hard metrics and testing, there is really no way to obtain good data on preparedness.

Now the second issue, so why are so many unprepared? There are certain vertical industries which have been required by government regulations for decades to prove their preparedness, such as the financial services sector. Much like commercial airline pilots or emergency medical technicians, they are required to re-certify and take continuing training and education to prove their preparedness. But for everyone else, only recently has pressure increased for disaster preparedness. HIPAA and Sarbanes-Oxley indicate that data must be "available" but with no time frame specified. Business failures resulting from disasters have made the news, adding some pressure to the fear drivers for DR preparedness.

A number have pushed off planning due to budget constraints and other more pressing issues they are faced with. Frankly, these are only excuses; possibly due to lack of really knowing how to attack the problem. The world has taught us that "given a will, there is a way". So how to nurture that will?

Smart business executives have recognized the importance of an all-encompassing plan with continued testing and education of their whole organization, not just the IT department. Focusing on business process recovery is required for success.

What's really neat are the new technologies on the market, such as server virtualization, which can significantly simplify and speed the IT recovery process. Virtualization has proven to reduce the costs of disaster recovery, and more importantly, make the IT recovery process more deterministic: no more failed disaster recovery tests.

So if you are one who is unprepared, now is the time to start. Don't try to "eat the elephant" in one sitting, you will fail. Start one step at a time and be persistent.

Posted by: Richard Jones

  • Burton Group Free Resources Stay Connected Stay Connected Stay Connected Stay Connected


Catalyst Conference 2009


Blog powered by TypePad