Site icon ActiveInbox Blog

Handling The Outage Today

Hi all,

It’s been a few weeks since I posted an update (more on why in a mo), but I wanted to write this partly to give me a little perspective, so we can improve our emergency processes in the future; but mostly to say sorry.

Even though we technically had no warning of the Gmail change, I still feel an immense tightening of the chest imagining everyone sat there wondering why ActiveInbox isn’t loading, and the frustration that causes.

And perhaps more so that after 10 years, I still haven’t anticipated and safe-guarded against every conceivable way that Gmail can break ActiveInbox. Hopefully this post mortem will take us another step closer.

What Caused The Outage?

A small pleasantry in one of our minor features – the ability to show your name against your calendars when picking a due date – relied on a piece of data buried deep in Gmail.

Today, Gmail did a small update that broke that request for your name, which set of a series of escalating events that tripped up ActiveInbox and stopped it loading.

How Was It Detected

Around 10am, we started getting the first notifications that two people couldn’t load ActiveInbox. Past experiences mean we react very quickly to these types of issues, as it’s often a “canary in the mine” suggestion that Gmail is changing, and we have a limited window before many people are affected.

As a consequence, by 11am I was doing a screenshare with Dale in the UK (one of our oldest customers), who very kindly let me run my diagnostics tools on his Gmail to find the problem.

How Quickly Was Everyone Informed

Lisa tweeted while I was talking to Dale, that we were aware of the problem; and began responding to everyone who emailed in. The Get Satisfaction post that Dale had started became our official channel around 2pm.

How Long Did The Fix Take?

As a team we stopped everything to tackle this, and the actual fix took about 2 hours, and was published to Chrome Web Store as soon as we were done.

However, frustratingly, in recent months Chrome has slowed down our release of updates from 30 minutes to 24 hours. This has been the biggest toll on our responsiveness.

Is there a workaround in the meantime?

Joeri Cohen found that by going back to the old Gmail it would work (because it didn’t use include the damaging Gmail change). Very kindly, that info was shared on the forum thread – thank you Joeri!

A more basic solution was to access your Gmail tasks via labels, because that’s how ActiveInbox works (it tries to store as much data as it can entirely within Gmail).

E.g. for your Low Priority items, look for the label “!Low Priority”. Or for items due today, look for the label ZD/20180710 (10th July 2018).

How Could We Handle It Better In The Future

We’ve had time to reflect on how we could improve to reduce the chance of this happening in the future. (As engineers, we never say never – but we want an extremely high likelihood of perfect running).

In terms of raw development speed, I don’t think we could have actually fixed it any faster than we did, and I’m immensely grateful to Dale.

The bottleneck at present is in getting updates distributed. To reduce this, we’re going to try to minimise the causes of our emergency responses:

Anything Else?

You may also notice I’ve been a little quiet for the last 5 weeks. It’s because, after the major Gmail change of a few months ago, we’re still dealing with the aftershocks, and I’ve had to go back to coding to help out the rest of the team.

The good news is, as a consequence of what we’ve been working on, another major improvement to the ActiveInbox code is about to begin testing. It will include:

Exit mobile version