Core database connectivity issues

Incident Report for SalesCandy

Postmortem

What Happened?

12.30pm MYT

We received reports from users that new leads were not being routed to salespersons. On checking, it was discovered that the routing service was unable to perform queries to the core database (“CandyBase”), therefore unable to complete its tasks.

It was unclear but possible that other services or functions were also experiencing similar errors at this time.

2.45pm MYT

The cause of the error was found to be a permissions error on the database, which happened because yesterday, a database administrator’s user account was removed from the database, causing other database users to be unable access the orphaned stored objects (e.g. database views and stored procedures) that the user had created.

2.50pm MYT

The affected object permissions have been reset and the affected services were restored to working order. The incident was considered closed at 3.00pm MYT.

Learnings

This incident was a new experience for our team because we were not aware that removing a user from a database could have this impact. It was learned that our database can be improved to be more resilient against this type of errors by a better understanding of MySQL Stored Object Access Control. In typical situations, user accounts would not be as highly privileged to be able to create such objects; and such objects would be owned by a system account which is unlikely to be deleted. Almost all our stored objects are owned by such an account, but in this case just one particular database view happened to have been created by this user, probably by mistake.

As seen from the timeline, the issue was solved within 5 minutes after the root cause was understood. However the issue was not detected earlier, and when it did it then took more than 2 hours to be analysed, because:

The error was unexpected because the deleted user account was not used by any codebase, hence no testing or monitoring was deemed necessary after the change.
The error escaped detection because the routing service appeared to behave normally; the service user was still able to connect to the database and perform most functions, until it reaches this error – at the time it tries to assign a user to a lead, which was a relatively infrequent event – so it was only after some time when unassigned leads had accumulated that the problem became apparent.
As mentioned in (2), the service user was still able to connect to the database, but the error logs showed that access was denied to the user in a way that appeared random (i.e. we thought that the database was dropping connections randomly, implying a networking or configuration issue). It was a red herring that complicated the root cause analysis.

Hopefully with this new experience, similar issues can be avoided and diagnosed more quickly in the future.

From the learnings of this event, the following tasks were undertaken and completed:

The privileges of the remaining administrators' accounts have now been reduced so that they would not be able to create objects that are prone to be orphaned when their accounts are inevitably deleted.
The access control settings of all stored objects were reviewed and altered where applicable to avoid similar incidents in the future.

Posted Jan 05, 2023 - 23:52 GMT+08:00

Resolved

This incident has been resolved.

Posted Jan 05, 2023 - 15:02 GMT+08:00

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 05, 2023 - 14:50 GMT+08:00

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 05, 2023 - 14:42 GMT+08:00

Update

We are continuing to investigate this issue.

Posted Jan 05, 2023 - 13:22 GMT+08:00

Investigating

We are investigating database connectivity issues causing lead routing not to work for some tenants

Posted Jan 05, 2023 - 12:31 GMT+08:00

This incident affected: Lead Routing.