12.30pm MYT
We received reports from users that new leads were not being routed to salespersons. On checking, it was discovered that the routing service was unable to perform queries to the core database (“CandyBase”), therefore unable to complete its tasks.
It was unclear but possible that other services or functions were also experiencing similar errors at this time.
2.45pm MYT
The cause of the error was found to be a permissions error on the database, which happened because yesterday, a database administrator’s user account was removed from the database, causing other database users to be unable access the orphaned stored objects (e.g. database views and stored procedures) that the user had created.
2.50pm MYT
The affected object permissions have been reset and the affected services were restored to working order. The incident was considered closed at 3.00pm MYT.
This incident was a new experience for our team because we were not aware that removing a user from a database could have this impact. It was learned that our database can be improved to be more resilient against this type of errors by a better understanding of MySQL Stored Object Access Control. In typical situations, user accounts would not be as highly privileged to be able to create such objects; and such objects would be owned by a system account which is unlikely to be deleted. Almost all our stored objects are owned by such an account, but in this case just one particular database view happened to have been created by this user, probably by mistake.
As seen from the timeline, the issue was solved within 5 minutes after the root cause was understood. However the issue was not detected earlier, and when it did it then took more than 2 hours to be analysed, because:
Hopefully with this new experience, similar issues can be avoided and diagnosed more quickly in the future.
From the learnings of this event, the following tasks were undertaken and completed: