The disaster outlined in this paper was a major fire event in one of two main data centres at The University of Queensland’s main campus at St Lucia, Brisbane, which occurred at 4.30am on Sunday 22 August 2010. The fire was caused by a UPS (uninterpretable power supply), located in the data centre, catching fire. The resulting power failure cut off the power supply to the banks of servers that host the University’s digital services with the fire releasing acrid smoke and particulates into this sensitive environment.
I will discuss how this fire event affected The University of Queensland Art Museum’s day-to-day business operations, digital records and record keeping systems – which were impacted for up to four weeks. Additionally, it is important to note that the fire was several buildings away from The University of Queensland Art Museum (UQAM), and that no Collection items were in any physical danger. Firstly, a bit of background …
UQAM digital environment
The University’s extensive computing services support the teaching, research and administration needs of a large organisation supporting more than 6,000 staff and over 35,000 students across a multi-campus network. The system features complex requirements for operations, security, network capabilities and storage. The UQAM Service Level Agreement through IT Services (ITS) provides the supply and maintenance of hardware including computers and printers, purchasing and installing software, email service, storage, support through the help desk, and the supply of specialised servers to host the Collection Management System (CMS) and web interface for the online catalogue.
The University’s IT service is run from two main data centres, each in a different building, on the St Lucia campus. A data centre is, effectively, a room full of equipment racks stacked with banks of servers that are connected to each other and networked across the campuses by many kilometres of wiring.
UQAM’s digital records, including the CMS, are hosted on Virtual Servers (VMs). This means that, rather than dedicated servers, which are expensive to establish and upgrade, the data is placed in a partitioned space on a bank of linked servers. Apart from cost savings, the main advantage to the end user is that at times of higher activity, for instance when multiple users are running large reports, processes from active servers can be redirected to utilise computational power more efficiently, thereby providing a faster service to the end user. Additionally, unused space on the VM Cluster can be made available for maintenance, for instance to duplicate systems during testing and upgrades, and to allow for the speedy expansion of server space to accommodate increased volumes of data, rather than purchasing and installing physical hardware.
UQAM’s CMS is hosted on a specialised production server, running Red Hat Enterprise Linux 5 named Homer, which is located in a VM cluster in one data centre on the St Lucia campus. Homer hosts the CMS service (the software) populated by the data in the form of text and multimedia files.1 The system and data is backed up to a staging server then de-staged to tape in the tape library in the opposite data centre once per week, at 4am on Saturday, with incremental data (i.e. additions or changes to data) backed up daily. This system provides security for the CMS by providing a dedicated server that is backed up in a secondary location. (See diagram 1)
Development of the UQAM Collection Management System (CMS)
Previously, UQAM’s CMS was on Filemaker Pro with limited capabilities … there were no images attached to records, and report and search options were basic. The data from Filemaker Pro was migrated into the KE Emu Collection Management System on 15 August 2007. The more sophisticated functions of the CMS assisted with the laborious task, over four years, of editing records, attaching multimedia files and checking the physical objects against digital data and hard-copy files.
When the CMS was purchased, the Assurance and Risk Management Office conducted an audit of the system. The audit reviewed procedures for backing up, access protocols, technical specifications, vendor support and hardware requirements. The last action for the audit was to conduct a test backup restore, which was successfully completed on 12 July 2010.
4:30am Sunday 22 August 2010
The fire event was brought under control quite quickly, and, after emergency services had stabilised the area, ITS staff were on site inspecting the damage as early as 7.30am, which gave them time to plan response and recovery.
8:00am Monday 23 August 2010
I arrived at work, along with many others, and turned my computer on to discover that there was no access to email or the shared drives and CMS. The help desk was inundated with calls … the usual method for communicating system downtime is email, of course, which was not operating.
The helpdesk staff patiently explained the situation to each person ringing in and requested that they communicate to other staff in their workgroup. At this stage, the University community was alerted that the system could be down for quite a few days.
The extent of services across the University was significant, and, needless to say, UQAM was low on the recovery priority list.
- Student systems were first priority. The event occurred in week 7 of second semester, a critical time in the academic year, and the e-learning systems were the first to be recovered and operational within one day.
- Communications was the second priority, with email operating by mid-day Sunday, but not available across the whole organisation, UQAM was without email for 1.5 days. The reinstatement of email enabled ITS to disseminate regular reports, and alerts for downtime on systems undergoing recovery.
- The main administrative systems supporting the University’s operations, such as HR and Finance, were also prioritised.
Impact on day-to-day work at UQAM
Even though email services were available within 1.5 days, the archive and the contacts lists were not accessible. This meant that, even though we could send and receive email, we may not have email addresses or archived email trails of correspondence.
The main shared, networked drives were not available, which meant that we did not have access to working folders, digital archives and pro-formas, such as incoming loan receipts and correspondence templates.
When email and access to shared drives was available, after two days, we were advised not to update documents because, as the systems were being restored, they were on and off line intermittently. Users were provided with warnings prior to services going off line.
When the shared drives were on line, we could copy folders or files onto the local drive on our computers. In doing so, we needed to consider that local drives are of limited size, not networked and need manual backup by the user.
The biggest impediment at the time was that the multifunction networked printers, which also scanned and faxed, could not be used for two weeks. Luckily, there was one small, old computer in the back room which was set up on the (also old) hot desk computer to handle all the printing for a busy office. Staff saved documents onto a memory stick and queued at the hot desk for printing.
Reinstating the networked printers also took more time to resolve. Whereas the drives were managed through the centralised server room, the printer software needed to be reinstalled on individual computers by the administrator, which, at UQ, is the responsibility of ITS staff only, not the user of the machine.
The multifunction printers are also photocopiers, and fortunately, as photocopying is not networked, this function was not affected.
Until the system was fully restored, four weeks from the date of the disaster, staff were advised to maintain a system to record newly created files and updated documents on local machines and copy them into the shared drives when the systems were fully restored.
Unfortunately, some files were lost during the recovery process, highlighting the importance for key documents to also exist as hard copy records.
CMS recovery
The CMS is a specialised service, as opposed to the standard systems outlined above. Generally, systems hosted on a VM cluster are quicker to restore, in comparison to systems that are reliant on specialised physical hardware.2 As UQAM’s CMS is virtualised, even though UQAM was low on the priority list for recovery, we were able to ‘jump the queue’ because the system was not subject to insurance claims, vendor supply warranties or many of the other issues that held up the recovery for higher priority, more complicated systems.
The recovery team for the CMS included UQAM, ITS and the CMS software Vendor. Going into the recovery phase, it was reassuring for all concerned that the test back up restore was successfully completed one month prior to the fire event. The fire affected the server room housing Homer, the sole database host for the CMS. The scheduled tape back-ups were safely housed in the tape library located in the secondary location.
The CMS was restored at 9:18am on the Thursday, after the fire, resulting in downtime of only 3 days. However, even though the full back-up tapes were available, the server could not ‘read’ the incremental back-up tapes. This meant that the system was initially restored with the back-up tapes from the 16 August. Work conducted in the previous week included the yearly reconciliation of the stocktake, with final catalogue edits and location updates recorded on the 22 August update – the culmination of three months’ work. If lost, the data would need to be re-entered from the hard copy working documents.
Finally, one of the Vendors realised that the fault was due to incompatibility between the tape (version 4) and the tape drive (version 3), and this was simply resolved once a version 3 machine was installed in the tape library allowing the incremental back-up to be restored.3 This situation highlighted the issues we face as both hardware and software are constantly upgraded. Different hardware and software may be affected differently under the same set of circumstances, and each disaster provides a different set of circumstances.
Even though the CMS was fully restored within a short time, the use of the CMS was impacted for four weeks, mainly due to the load placed on the tape library to recover such a large system. In this situation, we were advised to maintain the integrity of the last back-up and the CMS was managed as follows:
- CMS users were advised that the CMS was ‘read only’ with limited editing. User rights could be changed, but we chose not to do this.
- Location changes were entered in the hard copy Location Log, until they could be updated on the CMS. The log and the 22 August locations on the CMS were both needed to identify the locations of works.
- Edits on the CMS for existing records were avoided where possible or recorded and noted. This period coincided with the final preparations for a major exhibition of collection and loan works. The final checks of works details and artist’s biographies were added to the CMS to run the exhibition labels, which are produced in house. Usually, the label report is run, edited and then the catalogue updated to run a corrected report. In this instance, once the label report was run, the label report was edited and the changes recorded to update to the CMS once fully operational.
- Skeleton records were added, also with a worksheet report, either printed or handwritten file note and placed in a dedicated hard copy file to check when the CMS was fully operational.
- After four weeks, ITS staff advised that the digital environment was restored and the regular back-ups scheduled.
Review
The subsequent investigation of the fire event concluded that one of the UPS batteries spontaneously combusted. To avoid this occurring in the future, the batteries are now regularly scheduled for replacement and the server room has been re-designed so that the UPS supply is located outside the room, away from the sensitive servers and tape library, and protected by a firewall.
The recording and recovery of the fire event was coordinated by the University’s IT Services team (ITS), and highlighted significant, and previously unidentified factors, impacting on computer systems affected by fire events. For instance, the fire did not destroy the entire server room and, as only one of the UPS power supplies was compromised, the others were still operational. The servers supplied by the fire-damaged UPS shut down almost immediately. However, the others continued to operate, their fans sucking in smoke and particulates which settled on the fragile components inside resulting in these servers sustaining more serious damage. The extent of damage caused by acidic smoke from burning plastic on delicate computer components was hitherto unknown.
The dedicated server hosting the CMS, with back up in a secondary location allowed for successful recovery in this disaster. While the respective buildings are some distance apart, it is feasible that a future disaster could affect both locations.
One month is a long time in a busy work environment, and the projects at UQAM impacted by the disruption to services included the annual Collection stocktake, preparation for two feature exhibitions celebrating the University’s Centenary, a Board meeting and a large volume of gifts to the Collection as a result of the Centenary Gift drive.
The main issue for staff was to work out ‘other ways’ to complete relatively simple tasks with the tools that were available to maintain business continuity. For example, a simple Collection enquiry, normally handled within 5 minutes, involved looking up hard-copy records, photocopying, checking addresses, sending via snail mail and noting that the task was done to update activity logs at a later date – which impacted on staff time.
The two week period with limited access to printing was the most pressing issue to resolve on a day-to-day basis. The small, single-tray printer was frustrating to use across the team. Print tasks varied from letterhead, white and coloured paper and sticker paper for exhibition labels and mailouts. Additionally, it did not print double-sided, at A3 size or collate pages. If there was a ‘next time’, I would be tempted to purchase a number of small, cheap printers connected to staff machines for interim use. However, from a sustainability perspective, this ‘quick fix’ is short sighted, as cheaper printers have higher consumables costs, and they would most likely be discarded once systems are restored. From a business continuity perspective, we were also advised retrospectively that the costs to maintain reasonable services would have been covered by insurance under an ‘increased cost of doing business’ clause.
The reliance at UQAM on digital record keeping for general business meant that some key documents were not accessible, and some documents ‘lost’. This highlighted the need to identify records that should be printed and stored in hard copy as well as exploring the option of moving priority documents to one of the higher security digital archive formats at the University, such as HP TRIM Records Management System.
During the recovery process, ITS staff advised that ‘missing’ files could be recovered if they knew the file name, but, once the file is ‘not there’, unless a file path is included on a document, we generally do not know the exact file name. We were assured that documents or the incremental backups of the documents would be on the tapes in the tape drives, but ITS staff would need to know the file names and dates created or updated to recover them. This issue stressed the importance of logical and consistent naming protocols for digital records.
When the test back-up restore was conducted on the CMS, it did highlight some ‘glitches’, which could be documented in the Service Level Agreement for future implementation. The recovery process also brought to light issues involving Vendor log on authentication and the tape back-up systems (see Endnote 3). This highlights the importance of conducting regular test back-up restores, especially as equipment and software are upgraded so rapidly. Test back-up restores are scheduled yearly, and are also conducted after major hardware and software upgrades.
Even though the fire event was small, in terms of physical ‘damage’ to buildings and the time it took to control the fire, the ‘damage’ to the operations of the University, due to the location of the fire, was significant. At UQAM, ‘downtime’ where no services were available, was limited to 1.5 days, and the total time impacted was four weeks. UQAM is fortunate to be located within a larger organisation that is so heavily dependent on IT services, and which was able to implement the recovery so speedily.
Recently, I was researching early Collection records, which are held in the University’s Archives. As I leafed through the well-worn manila folders, I was struck with the thought that they also trace the history of hard-copy record keeping – handwritten flourishes in rich blue ink – the slick lines of a Biro indicating that I’ve reached the 1960s – smudged carbon copies of typed correspondence on delicate sheets of pink and blue tissue – yellow telegrams – Roneo-printed agendas and minutes – and fading faxes on thin, shiny paper. These records were created with the technologies available at the time, and have specific preservation requirements. Digital technologies continue to exponentially expand the capacity and speed with which we conduct our work, with the technology changing at a rapid rate. This highlights the need for regular revision of the processes available to create, manage and preserve our work.
By Kath Kerswell
Presented at the 2012 ARC Conference
———-
Endnotes
1. Homer is running Red Hat Enterprise Linux 5, which is an operating system, much like Windows or OS X, and the software for the CMS is KE Emu.
2. Note from Christian Unger, ITS, clarifies that the issue here is that physical systems (just like the physical hosts that make up the VM Cluster) require physical hardware, which was damaged in the fire. Because Homer is a single instance, complexity arises. Previously UQAM had two servers hosting the CMS (named Odyssey and Illiad) and thus had spare capacity and in this event would have probably resulted in only a one hour outage. Due to the fact that Homer is unique, failover was not possible and, had it been physical, it would have been slower due to ITS needing to obtain a server and deploy it. The advantage of Homer being virtual is that it could be brought online on spare capacity in the VM Cluster. Had the VM Cluster been wiped out (i.e.: no redundancy in another data center, or insufficient capacity in the redundant part of the cluster), obtaining replacement physical hosts would have been the issue. Thus, in this regard VMs are more flexible because, as long as there is some capacity, it can be put on that capacity. By contrast, a physical host, if there no partner already in place, needs to be obtained and installed. Had there not been capacity for a VM, lack of physical hosts would have remained the constraint.
3. Note from Tim Cridland, ITS, explains that ‘under CMS recovery, what we found with the version 4 vs version 3 hardware was that the LTO4 tape drives would only sometimes read correctly from LTO3 format tapes. There is some backwards compatibility, but the revelation was that the backwards compatibility is not perfect, and there were some instances where the chance of being able to restore from a tape depended on exactly how it was written. In particular what we found was that if an LTO3 tape was last written to by an LTO4 drive, the recovery almost always worked from an LTO4 drive.’