(Simulation) Users having trouble with login
Incident Report for SalesCandy
Postmortem

What happened?

This was a simulation only, there was no downtime and we did not actually perform a restore of production database.

In the afternoon of Thursday, 29 December 2021 we hosted a tabletop test, a requirement of the Business Continuity Management, all Candyholics to engage in a brainstorming discussion around a test scenario to discover and evaluate a given plan.

This Business Continuity tabletop test was practicing about the disaster recovery of main SalesCandy database (CandyBase). We were testing the design and effectiveness of the full plan with a simulation disaster event and simulated restoration procedure. Our goals are to evaluate participants' understanding of the plan, their roles, and experience how a test is carried out. We were also evaluate the communication and reporting components of the plan internally and externally, to test the assumptions and preparedness of the plan, e.g. ensuring each participant has appropriate access to perform their roles. From this event, we are able to evaluate the plan design and identify improvements needed before the next test which will be a full drill.

In the beginning, the system was tested is functional in a normal working day at SalesCandy, when the CandyBase service failed. As a result, all users are unable to use the SalesCandy system. However, the database server itself is still up and running normally, showing no errors in monitoring.

Around 3PM, we started to receive reports from users that they are unable to login to the platform. The technical team was oblivious to the problem until CST submitted reports, received that something was wrong. In this scenario, we were simulating an event where a developer made a mistake and caused the data in the production database to be corrupted.

The operation was taking a long time to complete under the circumstances, which the CST reported at 3.10PM. The operation completed almost in one hour, at 3.50PM. By this time, we were observing that performance was back to normal and continued monitoring and fine-tuning the configuration.

At 4PM we had observed no further issues and officially declared the incident as solved.

The same evening, we continued observations and checked whether our database handling functioned as expected during the incident.

Learnings

On further investigation, we found that this happened because one of our developer had mistakenly use production database for his test, resulting in overwriting wrong data. In order to fix this issue, we had restored the database to earlier point, a backup at 2PM.

In future we will be more careful and have more restrictions on database access. The necessary action has been taken internally to this mistake made by the developer.

Posted Dec 30, 2021 - 13:38 GMT+08:00

Resolved
CandyBase has restored through restore DB cluster, and the database has been updated.

We were observing that performance was back to normal and continued monitoring and fine-tuning the configuration.

At 15.59PM we had observed no further issues and officially declared the incident as solved.
Posted Dec 29, 2021 - 15:59 GMT+08:00
Update
The same issue identified, we are continuing to investigate this issue. The issue has been identified and a fix is being implemented.
Posted Dec 29, 2021 - 15:50 GMT+08:00
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Dec 29, 2021 - 15:48 GMT+08:00
Identified
The issue has been identified and a fix is being implemented.
Posted Dec 29, 2021 - 15:48 GMT+08:00
Update
We are continuing to investigate this issue.
Posted Dec 29, 2021 - 15:45 GMT+08:00
Update
We are continuing to investigate this issue.
Posted Dec 29, 2021 - 15:35 GMT+08:00
Investigating
Currently we received 20 reports from users reporting that they are unable to login into their account, both manager portal and mobil app. CST tried to login into my account, CST can login but CST noticed that the data here is not the data we have in production and CST is unable to find client account
Posted Dec 29, 2021 - 15:31 GMT+08:00
This incident affected: Lead Sources (CandyNumber, CandyPixel, Email Parser, Facebook Lead Ads, iProperty API, Lead Source API, WhatsApp Business API), Mobile Application (Authentication (Logins), Real-time Messaging, Management API), Manager Portal (Authentication (Logins), Application), and Graph API, CandySync Webhook, Lead Routing.