Disk problems
By Ryan on Friday 16 May 2008, 12:00 - Gandi - Permalink
Despite extremely low odds considering the type of hardware used, we lost 2 disks on a filer in under 3 minutes yesterday. While we do our best to warn you of any anticipated problems, we again stress the importance (and your contractual obligation) of maintaining an up-to-date backup of your server at an external location in the unlikely event that just such an incident might arise.
At 7:50 PM (GMT) we lost a third disk and all of its data
Our teams spent the night trying to recover the RAID 6 volume of filer 13, though this was eventually deemed not possible.
To be totally transparent, we are not going to hide behind the fact that we are still in the Beta testing phase, or that our hosting contract requires you to maintain a backup of your data on an external machine; The loss of your data is totally unacceptable. Even if what happened had nearly no change of occurring, it did, and those of you that were affected by this will be given a full refund.
At 7:50 PM (GMT) we lost a third disk and all of its data
Our teams spent the night trying to recover the RAID 6 volume of filer 13, though this was eventually deemed not possible. To be totally transparent, we are not going to hide behind the fact that we are still in the Beta testing phase, or that our hosting contract requires you to maintain a backup of your data on an external machine; The loss of your data is totally unacceptable. Even if what happened had nearly no change of occurring, it did, and those of you that were affected by this will be given a full refund.
Let us now go into detail about the changes we are making in the platform's disk architecture.
Right from the start, we have been having rather frequent disk problems with:
- sporadic disk loss (which went unnoticed because of RAID6 - until yesterday),
- temporary freezing that occurred during disk access,
- the fact that a filer failure might lead to the loss of data for our customers.
We therefore decided to extend the Beta testing phase until we have addressed all these issues. Our idea is to, even if it leads to additional costs, change the RAID structure so that your data is constantly replicated on 2 different filers. Work is advancing quite nicely towards this goal, and we hope to soon be able to provide you with good news about this.
In the meantime, what has occurred once can occur again, and we strongly advise you to keep constant backups of your data on a local disk at your own location. For those of you who are uncomfortable with Linux, we are in the process of writing a tutorial that will help do this.
Note: 3:00 PM (GMT) we were forced to reboot all of the servers to correct the snowball effect that was starting to dangerously increase the load on the machines.
Note 2: The filers in question are all from the same manufacturer. Others are currently being configured. A grand thank you to the teams that have been working non-stop since yesterday morning. Keep up the good work, I'm proud of you (Stephan)
Note 3 (Friday 7:35 PM GMT): the VMs that were directly impacted by the failure of filer 13 will not be brought back online probably until Sunday afternoon. In the meantime, please write here if there is an emergency.
Right from the start, we have been having rather frequent disk problems with:
- sporadic disk loss (which went unnoticed because of RAID6 - until yesterday),
- temporary freezing that occurred during disk access,
- the fact that a filer failure might lead to the loss of data for our customers.
We therefore decided to extend the Beta testing phase until we have addressed all these issues. Our idea is to, even if it leads to additional costs, change the RAID structure so that your data is constantly replicated on 2 different filers. Work is advancing quite nicely towards this goal, and we hope to soon be able to provide you with good news about this.
In the meantime, what has occurred once can occur again, and we strongly advise you to keep constant backups of your data on a local disk at your own location. For those of you who are uncomfortable with Linux, we are in the process of writing a tutorial that will help do this.
Note: 3:00 PM (GMT) we were forced to reboot all of the servers to correct the snowball effect that was starting to dangerously increase the load on the machines.
Note 2: The filers in question are all from the same manufacturer. Others are currently being configured. A grand thank you to the teams that have been working non-stop since yesterday morning. Keep up the good work, I'm proud of you (Stephan)
Note 3 (Friday 7:35 PM GMT): the VMs that were directly impacted by the failure of filer 13 will not be brought back online probably until Sunday afternoon. In the meantime, please write here if there is an emergency.
Comments
I had a petition running on one my VM. It only started 2 days ago - so although I have a backup from last night I've lost around 1000 signatures.
This REALLY sucks.
oh yeah - and another thing: where can I see what parts of my VM are affected? The system disk wouldn't be too much of a problem. Is there any chance that my data disk is on a different filer?
How do we know if we were affected? My VM claims to be still running but I can't connect...no emails from Gandi either.
Hi guys,
AFAIK I did not perceive service disruption until one hour ago, this morning around 9:30 GMT. Please keep us informed of ongoing maintenance and thanks for the hard work! (Sh.. happens)
Woodomat: yes, both your vm and your disk were unfortunately affected. I just sent you a personal e-mail with more information.
Nick and enz: if you did not get our mail than you were not affected by the incident. However the e-mail addresses you both used for logging into the Gandi Bar are not tied to any Gandi handles. I encourage you to contact Customer Support with your account details for more information: http://www.gandi.net/faq/form_conta...
Hi Ryan,
I received the mail from Nicolas, and at that time, my server was still doing fine (apparently). Only later I noticed a partial, followed by a complete disruption of service. I opened my 1-share server with IP 92.243.13.39 on May 8th.
Cheers,
Enz
To all: we are experiencing some side effects from the non-responsive filer. We will probably reboot some servers in order to quickly solve the problem and stabilize the platform.
I've seen the shutdown of my machine (92.243.1.73) at Fri May 16 15:53:59 2008, but is still unreachable... Is it normal (more than 25 minutes to reboot) ?...
Hey there,
Seems like some more problems at Gandi now, my disk was unaffected until now but now suddenly all my servers are down. Just before weekend when actually we make any money.
I understand Gandi is doing its best to recover but there must be a way to avoid such situations.
Hey, it is good that these kinds of problems come out during the testing phase - it is a whole lot worse for it to happen in production! Kudos for providing up front information about the situation. It makes me feel confident in choosing you for my hosting when the service is production-ready.
Despite this problem I still like the Gandi hosting.
If this had been any other company, they probably would have hidden behind some PR and I would have left them on the spot, the only reason I would stay with Gandi is this, they tell us exactly what is happening.
Thank You and I hope this problem is a part of the past soon.
Since I did not get any email I would assume that my servers were not affected by this incident.
But as "GT894" asked before - - how long does it take to reboot servers?
GT894 no avg time is more 2 mins for a reboot, but as the 64 servers on your machine reboot at the same time it take longer.
Broadcast message from root (Fri May 16 16:37:03 2008):
The system is going down for system halt NOW!
Saw that while ssh'd in. Hope that's the reboot you talked about. The VM appeared to work fine until then. It's taking a while to do the reboot now, though.
Same for me, like Nicolas mentioned some (or all) of these physical machines have 64 servers and if one server takes 2 min then we are talking about little more than 2 hours.
Lets hope it goes well!
Although the downtime was unexpected, it is certainly nice that Gandi is fully transparent on this matter. As with new technologies, a few glitches may happen from time to time (another recent example was the Debian OpenSSL bug).
Works again and disks are fine.
And me too, I haven't recived any e-mail, and still my server and websites are down!!
Kudos to Gandi being open about this!
Server still not accessible .. Hope it wont last too long
i used my server just for 15 minutes... then it's freezed
guess it's bad time for me to register just before this problem occur.. lol!!
hope this problem will be fixed as soon as possible
More news on the French thread... all reboots may be completed within one hour... They may not be running servers on eeePC
Quite an off-topic comment, but why my server seems to have varying ports open at the moment?
Just now these for example:
21/tcp open ftp
554/tcp closed rtsp
3052/tcp open PowerChute
7070/tcp closed realserver
They aren't definitely my services (and actually nothing seems to be listening them even if they seem to be open when probing with nmap)
And yes, otherwise my server seems to be down as well
(Sorry for double post)
I was wondering why my site went down yesterday. At around 2am last night I got the e-mail (although I didn't' get up until just now). It would of been nice if I had been e-mailed earlier.
In any case, my server was still unresponsive today, I couldn't SSH into 92.243.13.150. Logged into Gandi and tried the big reboot button but it's been stuck on "Server being rebooted (Pending)" for a while now.
I'm using GandiAI for the ease of it, the way I read it was that my problems should of been fixed by now and I could of SSHed in no problem this morning, but this did not appear to happen.
My server started to ping some minutes ago. But the ssh access doesn't work.
Is it possible for the gandi suport to give us an update of what is happening with the general reboot, and the expected time line so that we can also tell our clients what is happening.
Best.
"More news on the French thread... all reboots may be completed within one hour..."
1. I'm not french!!
2. Server still not rebooted!!
full reboot done for 80% of total servers.
Notus: ssh will come, os is starting
Well, this is why it's beta testing. At least Gandi is honest about the problems and trying to find a solution.
This "disk loss" and "temporary freezing" – is it a problem with RAID 6 or some specific hardware issues? Sounds like a substantial problem for hardware that's supposed to protect against data loss.
28. On Friday 16 May 2008, 19:05 by Nicolas (Gandi)
full reboot done for 80% of total servers.
Notus: ssh will come, os is starting
----------
It looks like you are having troubles other then just rebooting, I still can not access the ssh and one hour lookes like more than enought for a OS start.
Continuing the transparency policy you have been having today, can you tell us what problems are you steel facing, why the reboot is taking so long, and if you think more disks that the initial mentioned might be also with problems (even the people that didn't receive the disaster email)
And please be honest if the problem will be only solved monday (hope not) please tell us now and not in one hour intervals (is almost solved) until monday.
I prefer to tell my clients we are down until monday than it is up in one hour and the i is not and they call me again, as I'm writing you again
Keep the good work and thanks for the suport
I can assure you guys that Gandi working hard on it:
All my server are back on track!
My server is still unresponsive as it has been since about 9am AZ time yesterday. It isn't a problem for me because I'm still developing my site, so that is at least fortunate. I'm just ansy to work on my project and I'm sure I'm not the only one. At least we are being kept updated though.
Servers not directly impacted directly by filer 13 and the data loss should all be up and running now...the filer 13 replacement process is still ongoing
Hi,
Is the creation of new data disks also with problems? I'm trying to create a new disk but the operation is taking too long than from the last times.
I'm also trying to stop my server 92.243.5.252 so that I can delete him and create a new one, but it is pending operation and I cannot delete it.
Is it possible also to create new servers at this stage ?
Best
Sorry, new disk and server creations have been put on hold until those affected by filer 13 are back online; We are therefore optimizing our efforts to do so before launching the new server creations.
1) Can somebody explain how a problem on 1 filer (filer 13) can affect the whole platform? Does that mean that every time a filer goes down (hopefully not ever happening again) all servers on Gandi Hosting are affected?
2) I have had 2 server drives fail a few minutes from each other about 10 years ago. They were on raid 1. All data was lost. Ever since then on mission critical project I've been using 6+1. I think this is what you are trying to accomplish as well. Important however is to have a backup tool. Hard disk storage comes very cheap these days. I think it is imperative that Gandi allows for an easy backup and restore tool. If not, just to safeguard against possible occurences like today. Backing up off-site is nice but will result in heavy daily traffic at your data centers as well (unless this is a way to increase profit from data transfers). Since you have multiple data centers I think it should be quite do-able to have an on-site and off-site backup that synchs every day.
Right now you are profiling yourself as a great hosting alternative at an unbeatable price. I would hope that you keep that unbeatable edge by keeping the price the same AND providing a free backup. If you do that instead of increasing the price, I am convinced that your hosting success will be enormeous as people worldwide will swamp to your company.
It may lose you money in the short term, building these extra backup pools but it will make you money on a long-term basis as customers once they move to your company, they stay. Good price, safe data. Clean and simple. If you build your company on that, Gandi will soon be a brand recognized worldwide for excellence in Hosting.
Good luck and thank you for providing us with honest information.
Best regards,
Chris
24hours passed.. nothing happened =_="
Someone can you update me about situation stage? I have a 1-share VM with Debian, but from crash to now it's still inaccessible through SSH. I tried to reboot from web panel (where the VM appears running), but after 3 hours it's pending yet.
I agree with Chris, I'm constructing a server with 10shares and 100Gb of data disk occupied with 300 clients data, that I pretend to lunch when you reach final phase, it will be impossible to have an effective remote backup policy with that amount of data, and will create unnecessary traffic that it will penalize more your infrastructure than my server. And if the data of this server is lost my business will be ruined and I will have 300 mad persons after me.
And for 10€ a share (10x10 = 100€) if something like this happened and I lost data, you can be sure I would terminate the service with you in the next minute, because I could find similar prices in the market, for the price as it is now (6x1060€) I probably took more time deciding If I left or not.
I can add that still now I'm a little scared to go on this with you seeing that one filler sent you all hosting infrastructure down creating a down time of almost 6h for everyone and lost of data for others and a down time of plus 60h and counting.
With your hosting approach with a solid backup policy and with prices a little shorter then the ones that you are planning for the final phase, and I don't have droughts that you will become a would reference in hosting.
Best,
36hours...still nothing
too bad i'm with those 20% server that gandi still not restart
i'm developing a website that has deadline, i need my server to be up or else i will be wasting my time .. i can't do anything
anyhow, is there anyway if i want to move my server to another filler/cluster/place or whatever you may call it?
anyhow, is there anyway if i want to move my server to another filler/cluster/place or whatever you may call it?
--------------
One of my servers stopped Thursday at 19pm +- and is down till now, I was trying to to delete the affected server to create a new one, but gandi as also blocked the creation of new servers so we are stucked.
My server is now in a new state "blocked" can you tell me what is happening.
One thing that is missing from you part (gandi) is an estimation of how long it will take to solve the problem of the affected servers. And please give us the worse scenario and not the vague ASAP
Good luck!
They might not have any spare drives sitting around and needs to order them or whatnot. Ever rebuild TB arrays before (and fsck)? It can take hours and hours...
I suggest that those who have their server up again check their system memory usage. At least my server seems to now have some kind of problem on that side after all servers had the forced reboot. Although with one share the system can see 256MB total ram, there's actually only 138MB usable and after that the system starts swapping... and that isn't nice.
@Chris Boyle, I had that also before the disk troubles but it wasn't that severe. Before the forced reboot I had about 80% of 256MB usable and now it's down to 54%. That 80% was still enough to run the services I need but 54% starts to limit quite much. I asked about that memory issue two weeks ago on #gandi-hosting (@freenode) and got this reply from aegiap (Nicolas Chipaux) of Gandi:
"It's related to Xen memory page. We are investigating to fix this problem."
So that would indicate that they at least know about the problem but it would be nice to have it fixed since I don't see much point for me to buy another share to get more memory when the current share isn't having the memory it should.
I should have checked here before I sent another support ticket. Looks like I'm not the only one who still cant create or install anything.
Welp, it's been 3 days now and I still can't access my site. Good luck with everything but I'm moving on.
I am still having problems with my site, and the support department does not answer my emails (I don't know why I continue trying, only 15% of my emails to them are answered)
Problems means that the total amount of RAM assigned to it is so low that the services can't start. I am just using 3 shares. Tried rebooting the server but nothing changed.
Could you (Gandi) please update me about the status of the recovering? My clients are getting nervous