Database stability issues
Incident Report for SalesCandy
Postmortem

What happened?

On 20th May at around 1.30PM (MYT) all users actively using the Manager Portal and Mobile App to experience authentication errors i.e. they were logged out and unable to log back in.

This event was caused by our primary database which experienced very high load and rebooted itself. Although the functions automatically failed-over to the secondary database--which normally would have meant only short service interruption--the high-load conditions prior to the event caused some of the database tables to be stuck in a deadlock state and could not be updated, until a manual action was taken to reset the database.

Unfortunately this situation was only known after we received escalated user reports of not being able to log in, as our monitoring system had reported that the database was active and was receiving connections normally. However, at the time we were already investigating some abnormalities after the initial restart event, such as lower than expected CPU usage and were able to respond quickly.

After it was clear that the database needed to be reset and rebooted again, we took action and service was restored at 1.50PM.

Only functions related to authentication and authorisation, i.e. login and permissions checking were affected. Other functions such as lead ingestion and routing were unaffected.

Learnings

On further investigation, we found that high load event that was found to be caused some new functions related to authentication and authorisations that were introduced in our most recent deployment on 18th May. These new functions have been more demanding on our database resources, where we had noticed some elevated error rates on 19th May caused by hitting resource limitations. These were causing intermittent API request failures on that day. That night, we upgraded to our database servers to match the higher demand; however, the new functions, which now did not encounter errors and were able to complete, were doing so relatively slowly, and crucially, getting slower as the day’s load increased. At some point, requests to functions were executing over our applications' request timeout period, whereupon the applications started to retry their requests, which contributed even more load and finally triggered the database’s safety mechanisms which caused it to reboot.

We have now completed optimisations of the new functions which have greatly reduced execution time and the loads on the database. Users will notice improved response times in the Manager Portal and Mobile App (v3.3+), as well as be assured of the stability of the database.

Posted May 24, 2021 - 18:32 GMT+08:00

Resolved
This incident has been resolved.
Posted May 20, 2021 - 15:05 GMT+08:00
Monitoring
The database issue was solved and services are back to normal. We are investigating the root causes.
Posted May 20, 2021 - 13:53 GMT+08:00
Identified
The issue has been identified and a fix is being implemented.
Posted May 20, 2021 - 13:39 GMT+08:00
Investigating
We are investigating a database instability issue causing disruptions to Manager Portal and Mobile Application operations
Posted May 20, 2021 - 13:32 GMT+08:00
This incident affected: Manager Portal (Application) and Mobile Application (Management API).