Lottery Post Web Server Update

Day by day I continue making progress on the Lottery Post Web server, in my attempt to restore things back to normal.

(For those who weren't aware of the issue, I'd suggest reading this post first.)

It's been a very long week since this whole thing started, so I guess I'll just start by giving some insight into the problem, and then talk about where we're at.

Last Monday the primary Lottery Post web server died right before my eyes. Normally problems happen when I'm off doing something else and I discover it several minutes later, but this time was unique in that I was right in the middle of working on something on the server, and ..... blink ..... it stopped working.

I called the data center and asked them to help me cycle the power on the machine, and when it didn't start back up I knew something was wrong, in a bad way. Even if the server had blue-screened (crashed) a power cycle would at least bring it back up.

So my first course of action was to bring up the site (and a bunch of other sites that I host on the same server) on a backup server. Fortunately, I have been keeping a "warm spare" ready ever since the site had some problems a few months ago, so it wasn't quite as daunting as it could have been.

All told, between the time it took at various attempts restarting the primary web server, bringing up fresh code on the backup, and a bunch of tasks required to get it operational, Lottery Post was down for about two hours. Not too bad, all things considered. It could have been much worse.

After that, I spent a couple days trying to diagnose exactly what went wrong, forming a plan to get everything fixed and back in original condition, and continuing work with the backup server to get additional sites running and as many parts of Lottery Post operational as possible. It's at times like this when I realize just how gargantuan and complex this site is. There are so many working pieces that you can't really consider them all at once — you need to tackle them piece by piece.

By Wednesday I decided that this problem would require a trip to the data center, which is much more involved than some people would think. Many assume that Lottery Post is run on a computer that I can walk across the room to get to, but in reality Lottery Post is hosted on a series of computers located in a data center in a different part of the country from me. It's a major trip (and expense) for me to go there, so I avoid it except when it's really necessary.

Also on Wednesday I determined the root cause of the failure: two-thirds of the hard disks in the web server's RAID-5 array went bad. It's something with minuscule odds of ever happening — having two of three hard drives failing simultaneously — so my guess is that one drive failed and then the other drive failed soon after. Maybe the subsequent failures happened when the remaining drives were picking up the load from the first failed drive.

RAID-5 arrays are great things to have, because they allow a server to survive without interruption if up to one-third of the hard drives fail. (For example, in a 9-drive array, up to three drives can fail without causing a problem.) Hard drives rarely fail anyway, so having the ability for one drive out of three to fail means you can theoretically go without ever having a server die from hard drive failure.

Of course, theory is one thing, and reality is another.

So I caught and early Thursday morning flight, and by the afternoon I was in the data center. (I'll avoid saying exactly where the data center is located.) I spent the next 48 hours working with the existing servers, installing another new server, and taking the opportunity to install some new equipment that should help me to remotely diagnose and fix problems in the future much quicker than I can today.

Things went as planned in the data center, and everything is in good shape there.

The one big missing piece in all of this is the RAID-5 array from the primary web server — the disks that failed. I sent the disks to a data recovery company, which spent several days diagnosing them. Apparently at least one of the disks actually had something mechanically wrong with it — not just a failed sector or two. The analysis took longer than expected because they had to remove the platters from the drive and place them in a different housing, and then they do very detailed sector-by-sector analysis on each disk.

By Monday (yesterday) I finally got an answer from them. They believe they will be able to extract all, or nearly all, of the files on the disks. I don't expect that any member who uploaded files to the server will lose any of them. That's the good news.

The bad news (at least for me) is that extraction will take a couple weeks and comes at a very steep price. So while I'm happy that I'll probably be able to set everything straight within a couple weeks, I could literally buy a few extra servers with the cost of recovering the data. There's no "company" that pays for it — I do. To say it is painful is an understatement.

There were some other twists and turns that made things interesting this week — like when I found over the weekend that the network interface cards in the new web server were malfunctioning and causing crashes. I guess the new server didn't want to get left out of the action.

If you noticed (up until this afternoon) that Lottery Post was freezing up every now and then, that was the network card crashing the server. I was able to get a new NIC (network interface card) shipped to the data center and installed by this afternoon, and so far everything is running well since then.

So that's my update, for anyone who may be wondering what's going on. Things should be relatively stable for the next couple weeks, and then I hope to get everyone's uploaded files from the data recovery folks, and re-loaded to the server.

If anyone has questions pertaining to their individual situation, please send me a PM. Let's keep specific questions like that in private, and off the forums and blogs. If there are general or larger questions and/or comments, feel free to leave a comment here.

11 Comments:

Thanks for the update Todd.

I guess the new server didn't want to get left out of the action.

At least you still have some humor left.

By Tenaj, at 10:05 PM
Todd thanks. I have been dealing with restoring a mirror set (old OS/2 architecture and hardware), my problems are minor compared to what you are dealing with. We truly appreciate your work and talent, data recovery from a catastrophic failure(s) is nerve racking. It is also especially difficult when your data center is in geographically different location. Thanks again.

By jarasan, at 10:18 PM
WOW! Thanx...lol

I'm speechless. All I can say is thank you for your never ending passion to keep this site alive. Traveling to another place to make sure it's done correctly is an exceptional quality on your part.

Afterall, the Northrop name is in league with innovation such as the colaboration of the unveiling of the stealth bomber. It runs in your blood...lol

By pacattack05, at 10:32 PM
Wow thats a lot of work boss to keep things going... sorry to hear that there was a major malfunction. Glad you are able to piece it together.

I woun't pretend to understand the whole idea but if i may go overboard here isn't there a way to back up Lottery Post and your other sites in some way like a tape reel or something like that. In the company i used to work for they had thousands of files for our machine shops and at the end of the day they would always tell us not to mess with their central unit as it was backing up files.

By four4me, at 1:25 AM
@four4me: I can't afford to hire an IT staff for managing stuff like backups, and indeed it costs a lot in time and money to manage tape backups. So I do different kinds of backups, which is why we're in as good shape as we are. When an active server goes down -- a server that actively collects new data, like Lottery Post does -- you *always* lose something, it's just a matter of what and how much. In any environment a server going down is disruptive. For someone like me, who basically manages everything himself, having the ability to bring up a bunch of sites within two hours with minimal losses after the web server dies is really pretty good. Most web sites run by a single person could be down fror days before they are able to get a similar site running again.

I'm not knocking your question -- it's a good question. I'm just trying to explain how in my environment I do some things that are completely different than corporations with big IT budgets running tapes, but have similar effectiveness. I'm sure the next time something goes wrong I'll be in even better shape, because each time you experience something new you can learn new ways and methods of being prepared for anything.

By Todd, at 8:15 AM
Thank you for everything you do. Don't understand much of what it is, but not sure I really want to.
This has to be a real hobby for you with all the work and expense. The thought crossed my mind that you either have a really good paying job, or you hit the lottery sometime back. LOL.
Anyway, just know that I like this site, and appreciate it and what you do.

By rcbbuckeye, at 8:42 AM
@rcbbuckeye: I wish I did, but I'm not so lucky as that. I do what I do because I made a personal commitment to do so, that's all. Thanks for the compliments.

By Todd, at 8:49 AM
Thank you very much for your hard work ... as always good wishes you way to restore everything perfectly so it gives no more problem ... ever!

By konane, at 10:05 AM
Wow! I guess I haven't been on here as much as I thought, lately. I couldn't even tell there was anything wrong. A telltale sign of a good webmaster. You have every right to give yourself a major pat on the back, Todd. Very nice job !

By spy153, at 11:26 AM
Todd

i just currious how to recover data when some hard drive failed to detect?

well it pretty easy when there are a burned chip on hard drive mainboard, we can use the same mainboard from another hard drive but when a mechanical error (maybe the disk controlled) like this picture

http://www.fotosearch.com/comp/UNY/UNY789/the-inside-of-a-computer-hard-disk-drive-~-u11640557.jpg

i've no idea how to recover it :( i though hard drive manufacture or maybe military had a device to detect any malfunction hard drive just put the platter and start to copy the data :)

By sysp34, at 1:08 PM
@sysp34: There are data recovery companies worldwide. You should find one or more that are close to you and speak with them. Use the one you feel is the most professional, and uses the best technology to recover everything possible.

By Todd, at 1:55 PM


<< Home

Lottery Post Journal

Tuesday, February 5, 2008

Lottery Post Web Server Update

11 Comments:

About My Blog

About Me

Previous