Explanation of the outage that lasted 4 hours
By Ryan on Thursday 2 July 2009, 18:08 - Hosting - Permalink
As some of you may have noticed, we experienced a power outage from 11:30 AM to 3:30 PM (CET) today. We apologise for the inconvenience caused.
The company which manages the datacenter, that houses one of our server rooms performed maintenance this morning on some of their electrical equipment. Unfortunately, a human error by the provider undertaking the maintenance led to a total blackout of the datacenter for several minutes. The outage impacted many major companies on the internet (Dailymotion, Skyblog, Pixmania, etc.), as well as a portion of our customers.
Our system automatically transfers data to another server room in the event of a problem and it was working normally until the threshold was reached too quickly. There are several reasons for this: a massive outage on a scale seldom seen, combined with our choice to build-up the other server locations gradually rather than all at once, a server rebooting process that still requires perfecting and the success of our free server operation...in short, a chain of events caused this outage for some of you.
It goes without saying that we will examine what we could have been done at our end to prevent this, for example, a quicker building up of the other server locations (the hosting operation helps this enormously), and the continued optimization of the rebooting script (in the event that the electricity gods or the datacenter's subcontractor are still out to get us).
Additionally, we will ensure we refund you for today, and are at your disposal for any additional requests you may have. We will be adding one day to the expiration date for all of the affected hosting resources throughout the day.
Please accept our sincere apologies for this inconvenience. We are working to ensure that our infrastructure is built-up in a way which assures good performance and excellent management, so that ALL of you can benefit from our technology.
Technical note: In the event of a Gandi AI server, if your server responds but one of your services (web, ftp, etc.) is not working correctly, you can fix this by simply stopping and starting your server from Gandi's website. In the event that you have an expert server, we recommend that you activate your account's emergency console and execute a fsck command on your disks.












Comments
Guys - my virtual Xen site is still down in spite of various attempts to reboot from the website CP - This is the Nth time this year due to hardware or other problems that I've experienced downtime with Gandi - I think that I'm moving to Slicehost
My server has been down for nearly 24 hours now, it's just not good enough. This should have been fixed long ago.I was expecting more from Gandi, I am very angry about it.
Almost everybody has been managed from 9 this morning as soon as they has arrived with an issue on their server. If you have sent a mail to our customer care you should have received an answer now.
I appreciate the communications and transparency regarding the situation, and the full explanation. But this has really given me pause about hosting any major sites on Gandi, which I was almost ready to do.
Your promotional explanations on the Gandi site regarding hosting suggest that the chain of events you've described should not have happened at all. In other words, the promise of reliability and redundancy, so prominent on the Gandi site, clearly cannot be delivered. A huge disappointment, and Gandi's hosting is not ready for prime time at all. Really unhappy that infrastructure stability was emphasised when the reality was a very unstable hardware architecture and configuration.
FAIL.
Ok the part on the presentation about ufos is probably a little bit exagerated. I concidere the expert in UPS who came and switch off in perfect harmony the power our datacenter 1 as an Alien.
After this kind of event, we have checked the RAID status before restarting any server. A part of the customer affected as been automatically and quickly transfered to the room 2. The other part (still a lot) try to simultanously start on the room 1 (about 3000 servers) this took really too much time at the beginning but we have found where was the problem and the last 80% has been done in about 30 minutes.
Now we are ready for the next Alien assault to react faster in case of.
So yes, we never test this kind of crash with so many servers before and we didn't anticipate the problem that has stuck the full restart, In a way I fully admit we failed on that.
But the promise of reliability, flexibility and redundacy is still in this technology.
We can loose a machine, a rack, several rack but not yet the biggest room we have in complete.
My virtual server is STILL down - 2 days later. This is insane. Can't restart from the console either. Rock on Amazon EC2
Stories of extraterrestrials intervening aside, the claims made by Gandi can not be backed up. It's like promising 100% uptime, unless too many servers go down, or safe backup of 1 TB, only to find that the backup disk is 500 GB. Those promises are then a gamble that you'll never actually need a double set of servers and/or equivalent disk space. Of course, this is how much of the hosting indsutry, by competitive necessity, works (with its "unlimited" resources on offer), but those companies are usually very careful about fronting words like reliability and redundancy in the blurb.
It would be more accurate to say that the every system has a set degree of those factors, which is directly proportional to the "spares" at hand. Given that you have loyal and paying customers (I have been with you since the first week of the beta), I sincerely hope that all those free servers you cite as a contributing factor for the meltdown did not tip the odds for an event like this to happen the way it did. Now, that would actually upset me.
I hope for two things:
1. Monitoring of services. This has been on the wishlist since the start with zero progress. There is obviously a need.
2. A red-alert channel for support. I have only contacted support two or three times. Replies have taken more than 24 hours. Clearly, that equals inadequate support when servers refuse to respond. A filter that channels those really urgent issues into much faster responses would be appreciated. Even plumbers have a hotline for water leaks.
Jordi: Have you contacted our customer care service ? Servers are all up but services on some servers have some difficulties. We have made an automatic fsck on Gandi AI but some Expert servers are locked on this. If you are in expert mode read this part http://www.gandibar.net/post/2009/0...
if you are in AI mode, our customer care works today and will take care of you
Mc: Point 1 is for this week
Point 2 is currently in test. Click on "my server is locked" when you contact the customer care duplicates your demand to the emergency line.
My server came back online rather quickly, but the data disk wasn't mounted anymore. A reboot didn't help either, I had to mount it manually and add it to /etc/fstab. I run my mail and web daemons from there, and they didn't start at reboot. To those that have problems with services that fail to restart, check if your data drive is mounted, even when the web interface tells you the drive is attached to your server.
Every hosting provider I have experience with either professionally or personally has had these sorts of problems to some extend, so it's not that big a deal in my opinion... But Gandi needs to learn from it and I hope all the things that went wrong are properly analysed so next time something big goes down there are less side effects.
After all, there are a lot of bad electricians and plumbers on the market...
my server was down for few hours, and went backonline by itself. I had troubles in the past, and I assume there'll be trouble in the future, but as Gandi isn't a long time hosting company and the virt technology is still a bit young I don't really blame them. I'll pay attention to what the future will tell us and will look for other solutions for sure, but big outages like that do happen sometimes at Gandi or at other places ...
I guess you learned how to handle this kind of outage, I guess we learned (again ?) to have some backup server at hand ...
Good work, but please do better next time
Was there an outage? Neither me or my server seem to have noticed (last reboot was sometime in December).
I do sympathise with you though, it's just one of those things which just jump at you out of the blue. I too had an electrician come around to do some maintenance and lean on a certain panic button which stopped a multi-million dollar operation dead on its tracks. Admittedly the button should have been properly guarded. It was in a cramped space and the electrician in question was a professional wrestler thus a bit of a tight fit. I guess we all learned from that one