Incident Report

All US communities were offline for several hours on 17-Nov

  • 18 November 2022
  • 4 replies
  • 202 views
All US communities were offline for several hours on 17-Nov
Userlevel 3

We had a major incident tonight that led all US communities to be offline for more than 6 hours - you can find an Incident link to this here.


We are dearly sorry for this unacceptable lapse in service. At inSided, we realize that incidents happen - however, in this case, our out-of-office-hours response was not good enough. US customers were left with no service, acknowledgment from us, or any way to contact us. We are taking this seriously with the highest possible priority and will, of course, learn from this to prevent any form of repeat in the future. We have outlined several points we can act upon quickly to reduce a significant point of failure. However, we need to improve our internal incident escalation process should anything like this happen again out of hours.
 

Brief overview:

We had a down time of 6 hours and 47 minutes in total which impacted all US hosted customers and took all communities completely offline.

This occurred due to a critical failure in a US based infrastructure. Mitigating actions were taken late, due to lapses in the incident-procedures in place.

The end result was that between 00:00 UTC and 06:47 UTC (01:00-07:47 CET; 4pm-10:47pm Pacific) all users were faced with an unbranded, technical 502 error message.

This post mortem will give an overview of the timeline, incident breakdown, actions taken and action planned - this can be found here: https://status.insided.com/incidents/25j27dc1n57d


4 replies

Hello Alex.  Thank you very much for the candid company response.  During this outage I did check multiple times on the Insided Status site and it was showing as active with no issues, which did leave me to do more troubleshooting on my end in an attempt to remedy what I believed to be an issue on my end.  Is the status update page a manual update process for the Insided team?  If so what is the average update time from incident occurrence to status page update?  Thank you.

Userlevel 4
Badge +2

This might be very obvious tip but if you have problems it is always good idea to check does other communities (same region) work or not. No matter what status page says.

Userlevel 3

Hi @Jesse.Wilson.
The Status Page is a combination of automated and manual steps.
The first step is automated, meaning when a customer calls the support number, it triggers an automated process that alerts the Out Of Office Hours team (OOH) so that the complete incident process starts.
The OOH team was not alerted. There was an inexcusable break in the incident process.
As you can imagine, Friday was not a good day at the office, most of all because the most important is to serve our customers with a reliable platform, and we did not.
It requires a much deeper dive into the complete incident response process that we already started.
I'll add a longer post in the first week of December, with more detail on the incident, the action items, and the learnings.

Userlevel 5
Badge +3

 

This might be very obvious tip but if you have problems it is always good idea to check does other communities (same region) work or not. No matter what status page says.

 

I also like to use this free online tool which instantly checks if a Website is down for everyone, or if it’s just a problem in your settings. 

DownForEveryoneOrJustMe.com 

Reply