I was really looking forward, after 3 months of ludicrous intensity, to having a bit of a celebration with the launch of 5.1.
Except that one tiny oversight – aka the spark – and a secretly ticking time bomb, conspired to make this evening one of the most fraught periods we’ve ever had.
One of the ‘plumbing’ features (as in it wasn’t a feature you could use, but it worked with our server behind the scenes) that was only of utility to paid users, was accidentally left turned on for free users (from our early freemium years).
That meant that by the time 5.1 had finished auto-updating for all our users, we had massively greater demand on our servers than we expected.
Cue very unhappy servers that slowed to a crawl under the unexpected burden.
Why did that affect Firefox?
One of our early developers had unwisely made one of ActiveInbox’s critical requests to the server lock the browser up until it got a response.
While technically this is a bad thing to do, no one spotted it because when our server was healthy and responding instantly, you’d never notice.
But as soon as the server slowed down, Firefox started feeling like it was locking up for seconds at a time.
How did we tackle it?
Tom, Adhip & I frantically began trying to understand what was happening. Once we had it understood, we began fixing everything.
But we were in a slightly unusual situation in that discovering & developing the fix didn’t instantly make everything better. We had to wait a few hours for everyone to upgrade to the new version (5.1.2) before the demand on our server started to level off, and individuals were able to work smoothly again.
What’s the moral of the story?
First, never assume that even with 1000s of beta testers stress testing a new version, that launch day will go smoothly. We’ll always over-power our servers on launch day in future, just in case.
We don’t actually like doing ‘big’ launches and this has further increased our caution: there’s simply too much shock with changes, and too much that can go wrong with big new systems. We’ll be switching back to a mostly incremental approach. (Although I confess I did at least enjoy the response to unveiling something like 5.1, which we’re massively proud of). The benefit to you is that we’ll be rolling out little refinements faster!
And as we continue to change the way our servers work, today has been undeniably educational. We’ve learnt more in the last few hours than in the last 3 months. We’ll walk more confidently down the path of revamping the servers to robustly handle unexpected demand (if you’re interested, we’ll do this primarily by breaking it up into independent, optimised components; and building on Amazon’s world class infrastructure).
Ultimately, I (Andy) just want to apologise. The reason we were so frantic (and I had the doubly harrowing job of trying to relay what was going on to the forum while simultaneously trying to figure it all out), is because I hate letting you all down, even for an afternoon.
And my gratitude to everyone on the forum who first reported it, and then patiently gave us updates as time went on. As ever, you were wonderfully helpful – thank you!
This post was written by andymitchell9496