[Biochem-users] Biochem emergency downtime
Borries Demeler
demeler at biochem.uthscsa.edu
Fri Aug 31 09:10:10 CDT 2007
Dear Colleagues,
yesterday we experienced a crash due to multiple harddrives failing on
biochem which forced us to take biochem down overnight and rebuild from
backup. Fortunately, our backup was very recent (11:40 am 8/30/07) so
not much was lost, however if you haven't read your mail between 11:40
am and 6:00 pm you may have lost mail. All mail sent to biochem after
approximately 6:00 pm was queued and will be delivered within the next
2-3 hours.
We apologize for this problem, but the circumstances leading up to
this problem were out of our control. The machine is functional again,
although our current configuration is only temporary. The desired and
final configuation will require spare parts which are currently on order.
If you are interested in the details, please read on:
Originally, our configuration included 4 x 300 GB harddrives which
were configured in software RAID5 mode. This configuration provides
striped access to the harddrives (improving performance) in such a way
that parity information is distributed over all drives. In the event
of one of the harddrives failing, the remaining 3 can take over
and no information is lost, the array will run in degraded mode
until the bad drive is replaced and the array is regenerated.
Precisely this situation happened on Tuesday. We issued a 24 hr notice
to replace one bad drive, for which we had a replacement on hand. This
was done as scheduled last night after work around 6:00pm. The array
started to rebuild as expected, but half-way through the rebuild 2 more
drives generated i/o errors. Obviously, without any parity information
we were not able to recover the information and rebuild the array. We
had to restore the system from backup on one of the good drives and it
is currently functional in this configuration.
At this point, we are not sure if the error resulted from a bad
controller, 3 bad drives, or from the software RAID5 configuration.
We are testing the drives to get to the bottom of this. The biochem
system was restored to run on a single drive, which means we are
vulnerable right now to another disk failure and the performance will
be noticeably slower.
Our remedy plan involved purchasing a new hardware RAID controller
and to replace all drives with new drives. In about a week we
will hopefully have all parts on hand and can restore biochem to normal
operations.
Thank you for your patience, -Borries
---
Borries Demeler, Ph.D.
Assistant Professor
The University of Texas Health Science Center at San Antonio
Dept. of Biochemistry, MC 7760
7703 Floyd Curl Drive, San Antonio, Texas 78229-3901
Voice: 210-567-6592, Fax: 210-567-1136, Email: demeler at biochem.uthscsa.edu
More information about the Biochem-users
mailing list