Lottery Post Journal

Frozen in Grand Central Station

This is a pretty cool stunt!

http://www.maniacworld.com/frozen-in-grand-central-station.html

 

Enjoyed Kentucky

I just spent four days in Kentucky (Louisville area), and I wanted to mention to all the members from that state what a nice place it is.

The people were very friendly, and the weather even cooperated.  (I left just before the snow hit on Monday.)

Seems like a nice place to live.

Lottery Post Web Server Update

Day by day I continue making progress on the Lottery Post Web server, in my attempt to restore things back to normal.

(For those who weren't aware of the issue, I'd suggest reading this post first.)

It's been a very long week since this whole thing started, so I guess I'll just start by giving some insight into the problem, and then talk about where we're at.

Last Monday the primary Lottery Post web server died right before my eyes.  Normally problems happen when I'm off doing something else and I discover it several minutes later, but this time was unique in that I was right in the middle of working on something on the server, and ..... blink ..... it stopped working.

I called the data center and asked them to help me cycle the power on the machine, and when it didn't start back up I knew something was wrong, in a bad way.  Even if the server had blue-screened (crashed) a power cycle would at least bring it back up.

So my first course of action was to bring up the site (and a bunch of other sites that I host on the same server) on a backup server.  Fortunately, I have been keeping a "warm spare" ready ever since the site had some problems a few months ago, so it wasn't quite as daunting as it could have been.

All told, between the time it took at various attempts restarting the primary web server, bringing up fresh code on the backup, and a bunch of tasks required to get it operational, Lottery Post was down for about two hours.  Not too bad, all things considered.  It could have been much worse.

After that, I spent a couple days trying to diagnose exactly what went wrong, forming a plan to get everything fixed and back in original condition, and continuing work with the backup server to get additional sites running and as many parts of Lottery Post operational as possible.  It's at times like this when I realize just how gargantuan and complex this site is.  There are so many working pieces that you can't really consider them all at once — you need to tackle them piece by piece.

By Wednesday I decided that this problem would require a trip to the data center, which is much more involved than some people would think.  Many assume that Lottery Post is run on a computer that I can walk across the room to get to, but in reality Lottery Post is hosted on a series of computers located in a data center in a different part of the country from me.  It's a major trip (and expense) for me to go there, so I avoid it except when it's really necessary.

Also on Wednesday I determined the root cause of the failure: two-thirds of the hard disks in the web server's RAID-5 array went bad.  It's something with minuscule odds of ever happening — having two of three hard drives failing simultaneously — so my guess is that one drive failed and then the other drive failed soon after.  Maybe the subsequent failures happened when the remaining drives were picking up the load from the first failed drive.

RAID-5 arrays are great things to have, because they allow a server to survive without interruption if up to one-third of the hard drives fail.  (For example, in a 9-drive array, up to three drives can fail without causing a problem.)  Hard drives rarely fail anyway, so having the ability for one drive out of three to fail means you can theoretically go without ever having a server die from hard drive failure.

Of course, theory is one thing, and reality is another.

So I caught and early Thursday morning flight, and by the afternoon I was in the data center.  (I'll avoid saying exactly where the data center is located.)  I spent the next 48 hours working with the existing servers, installing another new server, and taking the opportunity to install some new equipment that should help me to remotely diagnose and fix problems in the future much quicker than I can today.

Things went as planned in the data center, and everything is in good shape there.

The one big missing piece in all of this is the RAID-5 array from the primary web server — the disks that failed.  I sent the disks to a data recovery company, which spent several days diagnosing them.  Apparently at least one of the disks actually had something mechanically wrong with it — not just a failed sector or two.  The analysis took longer than expected because they had to remove the platters from the drive and place them in a different housing, and then they do very detailed sector-by-sector analysis on each disk.

By Monday (yesterday) I finally got an answer from them.  They believe they will be able to extract all, or nearly all, of the files on the disks.  I don't expect that any member who uploaded files to the server will lose any of them.  That's the good news.

The bad news (at least for me) is that extraction will take a couple weeks and comes at a very steep price.  So while I'm happy that I'll probably be able to set everything straight within a couple weeks, I could literally buy a few extra servers with the cost of recovering the data.  There's no "company" that pays for it — I do.  To say it is painful is an understatement.

There were some other twists and turns that made things interesting this week — like when I found over the weekend that the network interface cards in the new web server were malfunctioning and causing crashes.  I guess the new server didn't want to get left out of the action.

If you noticed (up until this afternoon) that Lottery Post was freezing up every now and then, that was the network card crashing the server.  I was able to get a new NIC (network interface card) shipped to the data center and installed by this afternoon, and so far everything is running well since then.

So that's my update, for anyone who may be wondering what's going on.  Things should be relatively stable for the next couple weeks, and then I hope to get everyone's uploaded files from the data recovery folks, and re-loaded to the server.

If anyone has questions pertaining to their individual situation, please send me a PM.  Let's keep specific questions like that in private, and off the forums and blogs.  If there are general or larger questions and/or comments, feel free to leave a comment here.