Disaster recovery is a term synonymous with the IT industry meaning the continuity of business processes following a catastrophic failure through natural disaster or human error.
It has also long been vital in our industry too: the digitisation of assets is a big commitment for broadcasters both financially and logistically so when a digital archive is installed it’s imperative that the stored material will survive in the event of any technological meltdown.
Disaster Recovery is a wide-reaching term and in this industry the complexities of disaster recovery techniques and technologies deployed vary depending on the size of the failure, or potential failure.
A post production facility’s perception of disaster recovery, for example, will be different to that of a transmission playout provider; taking that a stage further a playout provider that transmits content for third parties will have a different perception again.
A company trying to manoeuvre its way through the disaster recovery process from an archive point of view should first execute a Failure Mode Effect Analysis (FMEA) policy to analyse potential failure in small to large components.
Small components could include the likes of disk or tape drives or power supplies, while large components include tape libraries, networks or in a worst case scenario complete denial of access - site meltdown.
By looking at the solutions around a FMEA a structure can be put in place to counter these failures taking each component in turn. In the event of a small failure a broadcaster can increase protection, for example by adding more than the minimum number of drives and RAID protected servers.
In the event of a large component failure protection would include more than one tape library, redundant disk storage and an infrastructure that means that all applications aren’t hosted on a single server, providing a distributed approach to controlling the software environment.
A broadcaster can also take a multi-site distributed approach to its archive. Instead of putting all of its content in one site, distribute to multiple locations and link those sites together.
By taking a distributed approach it’s imperative to link the metadata with the content. It’s all very well sending content to a remote site, but if the building that houses the database is lost and all the assets are elsewhere they’re as good as useless. The database in the remote site needs to be synchronised with the main site.
A very simple, cost-effective solution in protecting against the loss of a library is to have a standalone back up in addition to the main library.
This can be connected to the archive management system and if the electronics that control the main library fail, tapes can be physically removed and replaced in the back-up. Because the archive management system knows about the existence of that tape barcode it simply recognises that it’s now in an alternative system.
Another option is to take a distributed approach to maintaining content across multiple sites. This scenario highlights a fundamental flaw because there’s still one database at the centre of that. So how do you protect the database?
The answer is to build a clustered architecture around the database so that multiple host servers are attached to the storage that holds the database.
Secondly, provide RAID protection for the storage that holds the database and the database engines hosted on a clustered architecture. By taking this approach the broadcaster also creates regular back-ups of its database as part of standard archive schedules.
The ultimate disaster recovery scenario would be to have two completely separate content management systems with their own archives at independent sites. In this instance the archive is used locally on both sites.
They are archives in their own right and they exist today servicing the local needs for playout automation and also production environments. Because they are owned by the same company they have the opportunity to link the two.
This then becomes intelligent disaster recovery, because the broadcaster can then set up rules engines at both sites that define which content is transferred between sites.
This can, however, raise rights management issues with the transfer of content for disaster recovery purposes. A facility might have the rights for a movie but it may only have the rights to hold two copies across its organisation for a limited period.
If the plan is to play out three times, the facility needs to make sure the content is available; to achieve this, one copy remains on the main site and rather than make the duplicate copy on that same site, duplication takes place at the remote facility, once again utilising a rules engine to define how and when the material is moved across.
Similarly, if content is deleted from one archive, rules can be defined to make sure that it is deleted from the other. Again it’s really important that the databases are synchronised. The archives are not mirrors of each other because there are defined business process rules moving content from one to the other. A model is evolving that will see content pushed in both directions where site A is the redundant site for site B and vice versa.
Digitising material and creating a digital workflow is just the start. Ensuring that material is always protected in the event of a failure of any size is imperative.
Likely causes of archive loss
Small component failure:
- Disk drives
- Tape drives
- Power supplies
- Network adapters etc
Large component failure:
- Tape libraries
- Disk storage arrays
- Host application servers
- Complete denial of access - site meltdown