Summary of events:

  • On Thursday, May 12th, we experienced the simultaneous failure of 2 disks on filer 13. That evening, right before we finished the RAID reconstruction, a third disk failed (on new 'quality' hardware). We therefore stopped the servers so as to avoid the corruption of the volume group and tried to recover the data throughout the night and on Friday.

  • Friday morning, we contacted customers on filer 13 and began the reconstruction of the servers. Given the large number of servers, the process was rather time consuming though should have been completed between Friday night and Saturday...except that the filer used began to experience the same symptoms as filer 13; as did the following ones. Depending on the attitude of the manufacturer involved in the coming days, we are not ruling anything out, given the prejudice caused by their defective hardware.

  • After several hours of setting up new filers from other manufacturer, the reconstruction process picked up again Sunday night, and by Tuesday morning 90% of the customers were ready to go. We then spent a day manually processing the remaining 10% and dealing with the side-affects, notably on GandiMail which suffered (though the situation has returned to normal). We are implementing massive resources to GandiMail, so as to be able to provide an excellent level of service.

  • Today, all the servers on filer 13 have been reconstructed, and all the customers that ordered a server since Thursday have been served.


Damage caused:

  • Customers impacted by filer 13 and the subsequent loss of all their data are in the process of being fully refunded for all of the amount paid fromt their prepaid accounts. We are in beta testing now, but we are intransigent.

  • We have extended the 5-day period of validity for the shared of everyone blocked between Thursday and Monday.



Actions being taken:

  • The changing of the architecture to provide the readily available hosting promises during the launching of the service was decided even before the crash of filer 13. Indeed, if a machine on which your server is hosted fails, the incident will be transparent to you; if a disk (or 2) stop working, the RAID 6 will assure that the incident also goes unnoticed for the customer. We wanted this to even be the case if an entire filer failed. The storage architecture is therefore being modified. The final version of the service will be announced once this takes affect.

  • A backup system is also being prepared that will assure you that all your data will not be stored in the same data center as your main disks. The centers are either completed or being fine-tuned.

  • We holding discussions with the supplier of the defective filers (and I strongly suggest that they recognize the gravity of the situation) and other manufactures in order to take all the necessary measures as required.


I would like to give my sincere thanks to all our customers that, despite the disturbances encountered on this advances technology in beta testing (this must always been kept in mind), continued to post their positive and indispensable support.

I also wish to say to my team that I am proud to work with them. What they have done is admirable, just as much so as what they continue to do. Thank you. (Stephan)