Database issue causing inaccurate reporting data
Incident Report for SalesCandy
Postmortem

What Happened?

12.11pm MYT

We received reports from Manager Portal users that the dashboard summary data was not matching expected numbers when compared with lead lists. Upon investigation it was found that the database responsible for reporting functions (“CandyLake”) was in an unhealthy state. One of its nodes had been restarted automatically following a routine maintenance event but an issue caused some data to be lost, hence the inaccurate numbers.

3.46pm MYT

We have restored the database from a recent snapshot before the data loss and started a reindexing exercise to update the database back to the latest dataset.

8.00pm MYT

The database restoration was completed and we started to observe for any performance issues.

9.00pm MYT

The database and reports appears to be performing as expected and the incident is considered resolved.

Learnings

  • Our database is a cluster comprising multiple nodes which replicate data across more than one node to reduce the risks of data loss. However, data loss events can happen for various reasons, and in this case due to an issue after the database node was restarted and failed to fully recover data from replicas from other nodes.
  • In this instance it was necessary to recover the data from automated snapshots and then reindex data to ensure the database has all the latest data.
  • We were able to diagnose the issue and solve it using observational methods and prebuilt scripts which had been created in anticipation of such an event.
  • The database was successfully restored without any complications.
Posted Jan 02, 2023 - 22:00 GMT+08:00

Resolved
This incident has been resolved.
Posted Jan 02, 2023 - 21:08 GMT+08:00
Update
The database recovery activity has been completed. We are monitoring for further issues.
Posted Jan 02, 2023 - 20:07 GMT+08:00
Monitoring
The root cause has been fixed and the database was restored from a previous snapshot. We are reindexing the database to sync with fresh data. This will take several hours.
Posted Jan 02, 2023 - 15:10 GMT+08:00
Identified
We have received reports that the Manager Portal dashboard is showing inaccurate data. On investigation it was found that our database holding the data for reporting is experiencing an issue that is causing it to fail to ingest new data.

The issue only affects reporting features from Mobile App and Manager Portal that are generated from anonymised leads and lead logs data.

We are currently working to correct the issue.
Posted Jan 02, 2023 - 14:31 GMT+08:00
This incident affected: Manager Portal (Application).