|
Chronology of the Events of October 4th and 5th
Tuesday, September 23, 2003 - a drive failed on a set of hard drives containing major parts of the main web server. Information Services began getting replacement parts and spares ordered.
Saturday, October 4, 2003 at 5:40 am - SRP caused an interruption in power to much of the north side of campus including the AD building - until approximately 6:20 am when the power was restored. At that time, most of the systems were able to return to running status with a few problems. Information Services personnel noticed something had happened around 9:15 am, but at that time, it wasn't clear what had occurred. Information Services personnel arrived on campus around 10:20 am. The systems administrators were called in to look at the servers in the AD building.
Saturday, October 4 at 11 am - Once the systems administrators arrived they found the Battery Backup Unit had a bad power module. The UNIX systems administrator, Mr. Harris, also discovered a set of hard drives attached to the one of the UNIX servers had a problem with an interface board, but it was working through the redundant interface. To correct a problem on the tape backup system, the database server was shut down, the tape backup system was restarted, and they both began functioning normally. Mr. Harris shut down the Mail/WebCT server since it came up before its set of hard drives were ready. After doing this, all services on the Mail/WebCT server were up and running, but a call to Sun Micrososystems on Monday would need to be made to get the interface board replaced. At about noon, Mr Harris began working on the web server. Only the drive that had failed on Tuesday, September 23 indicated it had failed, so he proceeded to bring up the web server. All services were brought up in a few minutes giving access to web and home directory drive space. Since the Full Backup scheduled to start at ten minutes after midnight was interrupted, Mr. Harris then restarted the backup service so that a current full backup would be available.
The Microsoft systems administrator, Mr. Trottier, checked on the Windows servers and applied some critical security updates and performed some routine maintenance while he was on site.
Saturday, October 4 at 1:30 pm - All UNIX and Windows servers in the AD building were functioning normally by 1:30 pm on Saturday afternoon.
Saturday, October 4 at 3:38 pm - A second drive failed in the same set of hard drives that had a drive failure on Tuesday, September 23.
Sunday, October 5 at 9:30 am - Mr. Harris received a call from Information Services personnel indicating there was another problem with the web server. He arrived on campus at about 10:30 am and found the second drive had failed. He tried to see if the volume would come back online with some basic troubleshooting methods, however this was unsuccessful. He then contacted Bill DeHaan at about 11:30 am and brought him up to date on the situation. Mr. Harris informed Bill that he would initialize the set of disks that had failed and restore the information off of tape. The initialization of the set of disks that had failed took approximately one and a half hours.
Sunday, October 5 at 12:12 pm - Mr. Harris sent another e-mail to all MCC employees to let them know about the problem and that it was being worked on. After sending the e-mail and initialized the set of disks, he began recovering the information for the web server volume from tape. He was able to use the backup he just completed from Saturday to restore the information.
Sunday, October 5 at 7:37 pm - Mr. Harris sent another e-mail to all MCC employees to provide a status update.
Sunday, October 5 at 9:39 pm - Mr Harris sent a final e-mail message to all MCC employees to inform them that the all of the directories on the web server had been restored.
Successes experienced during the weekend's events:
- Information Services was able to retreive and restore all of the data on the Web server in a timely manner. This was aided by the fact that a complete full backup was created on the previous day.
- Information Services responded to the problems encountered on Saturday in a timely fashion.
- Information Services kept the campus community informed as to the status of the problems as events unfolded.
What can we improve on or what has been improved:
- Information Services is ordering additional hard drives and one will be installed in each array as a hot-spare in case of failure. In the event of a drive failure, the data will automatically be rebuilt on the hot-spare and the hot-spare used until a replacement drive is installed.
- Information Services will improve the notification systems to enable us to react to system failures in a more timely manner.
- Information Services is evaluating our power and battery backup systems to improve our resistance to power failures.
|