(Simulation) Main database failure causing outage
Incident Report for SalesCandy
Postmortem

What happened?

This was a simulation only, there was no downtime and we did performed a restore of production database.

In the afternoon of Wednesday, 26 January 2021 we hosted a full drill simulation test, a requirement of the Business Continuity Management, all Candyholics to engage in a brainstorming discussion around a test scenario to discover and evaluate a given plan.

This Business Continuity full drill simulation test was practicing about the disaster recovery of main SalesCandy database (CandyBase). We were testing the design and effectiveness of the full plan with a simulation disaster event and simulated restoration procedure. Our goals are to evaluate participants' understanding of the plan, their roles, and experience how a test is carried out. We were also evaluate the communication and reporting components of the plan internally and externally, to test the assumptions and preparedness of the plan, e.g. ensuring each participant has appropriate access to perform their roles. From this event, we are able to evaluate the plan design and identify improvements needed.

In the beginning, the system was tested is functional in a normal working day at SalesCandy, when the CandyBase service failed. As a result, all users are unable to use the SalesCandy system. However, the database server itself is still up and running normally, showing no errors in monitoring.

Around 3.35PM, we started to receive reports from users that they are unable to login to the platform. The technical team was oblivious to the problem until CST submitted reports, received that something was wrong. In this scenario, we were simulating an event where a developer made a mistake and caused the data in the production database to be corrupted.

The operation was taking a long time to complete under the circumstances, which the CST reported at 3.42PM. This time additionally, the staging database will be switched off during failover, so we will need PRP to send leads to check if the leads will all go into the database after the database is back up. The operation completed almost in one hour and thirty-three minutes, at 5.15PM. By this time, we were observing that performance was back to normal and continued monitoring and fine-tuning the configuration.

At 5.11PM we had observed no further issues and officially declared the incident as solved.

The same evening, we continued observations and checked whether our database handling functioned as expected during the incident.

Learnings

On further investigation, we found that this happened because one of our developer had mistakenly use production database for his test, resulting in overwriting wrong data. In order to fix this issue, we had restored the database to earlier point, a backup at 4.55PM.

In future we will be more careful and have more restrictions on database access. The necessary action has been taken internally to this mistake made by the developer.

Posted Feb 08, 2022 - 15:38 GMT+08:00

Resolved
This incident has been resolved.
Posted Jan 26, 2022 - 17:11 GMT+08:00
Monitoring
A fix has been implemented and we are monitoring the results
Posted Jan 26, 2022 - 16:55 GMT+08:00
Update
We are continuing to work on a fix for this issue.
Posted Jan 26, 2022 - 16:11 GMT+08:00
Update
We are continuing to work on a fix for this issue.
Posted Jan 26, 2022 - 16:04 GMT+08:00
Identified
DEV team replaced the user table on the server with a wrong test table, we are continuing to work on a fix for this issue.
Posted Jan 26, 2022 - 16:03 GMT+08:00
Update
We are continuing to investigate this issue.
Posted Jan 26, 2022 - 16:01 GMT+08:00
Update
We are continuing to investigate this issue.
Posted Jan 26, 2022 - 15:58 GMT+08:00
Investigating
We received 20 reports from users reporting that they are unable to login into their account, both manager portal and mobil app. CST tried to login into account, CST can login but CST noticed that the data here is not the data we have in production and CST is unable to find client account
Posted Jan 26, 2022 - 15:54 GMT+08:00
This incident affected: Lead Sources (CandyNumber, CandyPixel, Email Parser, Facebook Lead Ads, iProperty API, Lead Source API, WhatsApp Business API), Mobile Application (Authentication (Logins), Real-time Messaging, Management API), Manager Portal (Authentication (Logins), Application), and Graph API, CandySync Webhook, Lead Routing.