24/7 Support

+420 246 035 835
  • security
  • data center

How to Take Your Project from Disaster to Recovery

Author Ondřej Flídr
aka when life gives you lemons, you need to know where the salt and tequila are.

Almost everyone who runs an online project has probably experienced it: an SMS or automated call from monitoring that something is not working. Your beloved project is ailing and needs your attention.

It doesn't even have to be a catastrophe like a data centre fire or an earthquake - even a burst pipe in a building, an admin error on a network in India or an unprofessional user intervention can disable systems.

At vshosting~ we do everything we can to prevent such situations - read more about this in our article about our uncompromising security. Nevertheless, not all risk factors and realities are within our control and the occurrence of sudden situations cannot be excluded. The scenario is always the same. Suddenly, without any warning, everything goes out and the monitoring lights up red. Nothing works. And now what?

You can complain, grumble, panic… or activate your disaster recovery plans. Scenarios you've written down in times of calm in case something goes very, very wrong. Documents you hoped you'd never have to read.

What disaster recovery plans are and what they should look like

Disaster recovery plans are documents that clearly describe how to get your project back to working order. They take many forms and cover many possible scenarios. They can be simple lists of crisis credentials or complex procedures that involve actions across multiple departments in the company.

They should always include:

- what needs to be done
- who should do it
- how to do it
- where backups are available
- in what order to upload services
- when to restore which data

It's a good idea to include login details for crisis accounts so you have them handy. The goal of disaster recovery plans is always to minimize the business impact of the "disaster" and get the project back on its feet as quickly as possible. They must be written clearly and concisely. Because if you have to follow them, it certainly won't be in times of peace and quiet.

It is common that disaster recovery plans are not purely IT in nature and overlap with other departments in the company. In addition to the technical management of the problem, you also need to maintain contact with customers or provide documentation for the legal department. Therefore, when writing disaster recovery plans, you need to enlist the help of representatives from other departments and consult with them on the individual steps. You need to make sure that the procedures make sense to everyone and that you haven't forgotten any non-technical impact.

We can base their design on the principles of aviation. If there is a problem on the plane, the procedure is "Aviate, Navigate, Communicate", i.e. keep the plane in the air, find out where you are and where you can go, and then deal with communication.

In IT this can be used similarly. First, try to save what you can (i.e. minimize direct impacts, even by shutting down systems if needed). Get the system to a stable enough state, and then secure communications outwards. If you have multiple administrators, the process can and should be parallelized. Can you imagine a worse situation than an admin being pestered with questions by his colleagues in sales and customer support, asking what to tell clients?

Ready to secure your data? Visit our backup and disaster recovery page to get started.

Step 1: Keep the project "afloat"

When the plane hits the ground, it's over. You've lost the plane, the crew, the cargo, and the passengers. That's why you need to keep it in the air at all costs.

In IT it works the same way. If you lose one service, you need to minimize the impact on others. Prevent a cascading effect. You can't afford to lose everything. Even at the cost of downtime. If you lose all your data and all your systems, it's a complete existential risk to the company. Short-term project downtime is a much lesser evil than long-term data recovery.

A typical example of this is a bug in an application that saves corrupted data during an update. In this case, it is much better to shut down the application and prevent more data from being destroyed than to leave it running and work on issuing a fix.

Step 2: Get your bearings

Well done, you have successfully prevented the destruction of all data in the first step. However, the damage is still done, the application is inaccessible, clients are getting angry. This is a good time to start figuring out what happened, what the overall impact is, and develop a recovery plan.

Can I repair the app with a simple fix? Do I need to restore any data from a backup? How long will it take to return to minimum functional state? And how long will it take for a complete fix? How much data have you irretrievably lost? These are all questions that are on the table right now that you need to know the answer to quickly.

Step 3: Communicate with your customers

At this stage you already know what has happened, what you need to do and what the implications are for your company. Even how long it will take to repair. At this point you can communicate to customers an estimate of the time required.

We recommend that one staff member be designated as the communication liaison. Someone who will talk to sales, support and customers from the beginning. He will also serve as a shield standing in front of the admins who are saving the day. This role is a good one to assign to someone in project management, in an agile world this would be the project owner. And it's also important to make sure everyone in the company knows who to contact with questions.

Communication outward is undoubtedly important, and in a crisis even more so. However, it should not be at the expense of operational recovery. Especially for sales and support, it is always the case that the information we convey to them must be current and valid at a given point in time. But it doesn't have to be 100% precise and unchanging. We always build on what we know at a given point in time.

The situation may seem easy to fix at the outset, but further investigation may reveal much more dramatic consequences. Or vice versa. That's why we recommend never communicating deadlines and estimates as definitive. It is always necessary to say "but the situation may change", and you should always allow for a significant extension of the recovery period just to be safe. Everyone in the company needs to take this into account.

Crisis management competences

In calm times, it is understandable that large investments or infrastructure interventions need to be consulted with the management and planned thoroughly. You can take longer to select the most suitable supplier, negotiate good business terms with them, analyse and test the impact of the solution. But once the crisis hits, prudence largely goes by the wayside.

The rescue team needs to be empowered to make decisions quickly and not be concerned with entirely optimal efficiency or economy. If a key piece of hardware "dies" and you don't have a replacement in stock, there is no room for negotiation with suppliers. You need to take the company card, get in the car and head to the nearest store that has a new piece in stock. And that comes at a higher cost. Is long downtime cheaper for your project?

If you're operating in the cloud, the recovery team must have the power to launch additional instances immediately - regardless of the infrastructure budget. Alternatively, set budget limits in advance that you can't go beyond, but consider the cost for every minute your project is down.

Backups, backups, backups

Backups and their recovery are an integral part of any disaster recovery plan. And not just having backups, but testing their functionality regularly and setting the optimal backup frequency. Regardless of whether you have infrastructure in the cloud or not.

Even in the cloud, you need to back it up!

We often see the opinion that infrastructure and data in the cloud do not need to be backed up. We hear: "that's why we pay for the cloud, so we don't have to spend money on backups". This approach is flawed and can lead to the bankruptcy of a company.

In the cloud, you're buying computing capacity, storage space and related services, but you're not buying the security of your data. You're only buying the security of knowing that if there's a problem, you can restore your data back to the cloud and start a new set of servers. However, it is your responsibility to have a copy of the data to restore, to have information about the configuration of the servers, and to have the application deployment procedure described.

The cloud gives you the platform, but you give it the business value. The cloud is able to help you do this with geo-replication or a backup service. You have to be interested in using it, including making sense of it and having a data recovery process in place.

Backup quality testing and optimal backup frequency

You also need to be sure that you can restore your backups. All too often, a company has backed up beautifully, but when it came to the need for a restore, the backups were unusable. And it may not just be a fault in the backup itself. Backups may be inaccessible or incomplete due to a problem. That's why they need to be tested regularly. To test that you are able to recover data from backups, as well as that you are backing up everything you need - and in sufficient quantities.

You will never have 100% of your data backed up, you will always lose something. The question remains whether you can lose data in a week, a day or an hour. How much data are you willing to lose? Or how much money is your company willing to invest in data backup and recovery? There is a non-linear relationship here - if backing up once a day costs XY kč, backing up twice a day doesn't cost 2 XY kč, it costs 5 XY or 10 XY. You need to ask yourself if the data outside the backup is worth that much, or if it is more profitable to sacrifice the data already.

What to take away from this

Nothing is 100% and mistakes happen. Every admin has experienced a similar situation and it's never pleasant. Still, you can always prepare for the situation in advance and minimize its impact on the company's operations.

It's not free and it's not easy. And in quiet times it may seem like it does nothing for you, but it will prevent big problems in the future. And it's always worth it. Think about it.

Want advice on the optimal backup mode for your project? Contact our experts: consultation@vshosting.eu. They will prepare a tailor-made solution for you free of charge.