What Happened?

We encountered a performance issue with the database backing our lead routing and real-time messaging services. This resulted in intermittent errors from functions using the database, which impacted at least three users who reported different issues. The incident necessitated the routing service to be temporarily halted and a data restoration exercise to mitigate any loss of data.

The incident was finally resolved by addressing a malfunctioning component and improving the overall database performance.

We apologise for any inconvenience to our customers. Please contact support@salescandy.com if you notice the following data inconsistencies due to the incident:

More or less leads in routing than is expected being reported in Manager Portal
Dropped or Won leads being rerouted to other salesperson
Undelivered messages and notifications
Erratic app status updates (frequent blinking of “red” light showing connection errors)

In some cases such as undelivered messages, we regret that they may be unrecoverable within a practical timeframe as they may have not been successfully entered into the database during the performance issue.

Timeline of Events

On this day we received user reports such as:

A lead being routed to another user after it was already accepted by another user
A user receiving repetitive messages every minute

On our observation dashboards, we noted increased error rates from functions that depend on this particular database, and abnormal metrics from database.

Oct 14, 9.15pm The incident was declared by Reza Rosli, once the errors that were reported by the users mentioned above have been traced to this database issue.

The root cause was found at a Lambda function that processes one of several message queues that feed into the database. Sometime on October 13th, the function failed to encrypt the message contents it received within the time allocated, causing it to timeout and not return the appropriate response – but not before it had retransmitted the data and saved it to its database. In the case of this particular queue, messages remained in the queue and will be processed again by the function with the same error, ad infinitum.

The situation caused increased memory and CPU pressure to the database, until it started giving errors due to out-of-memory conditions. This started to impact SalesCandy’s lead routing and realtime messaging reliability.

9.15pm The routing service (“Carousel”) was stopped to reduce pressure on the database.

9.42pm The malfunctioning Lambda function was then given extra time to complete successfully and the root cause was solved. The message queue was cleared.

9.47pm The routing service was restarted; observation continued.

10.29pm The performance was noted to be suboptimal but tolerable – however it did not worsen. The incident was considered resolved. Further investigation continued to determine the impact on the data within the database, and to tune the database according to its new performance envelope.

Oct 15, 4.00am It was found that, As the database was an in-memory database, it had started to evict data as it encountered its memory limits. The data was repopulated in a recovery process, which completed at 4am.

6.00am New database settings have been effected. Observation continued to monitor its performance.

3.26pm The database performance did not improve, and we observed that the dependent services were still impacted, although at a lower rate than previously. An observation ticket as a follow-up to this incident was created while we investigated a solution to get the database back to optimal performance. https://salescandy.statuspage.io/incidents/y6ty27tggbmt

5.30pm The database performance was restored and we now observed that the error rates of the dependent services had significantly reduced. The incident is considered closed.

Learnings

The error that started this chain of events revealed a weak spot in our system. It also revealed a systematic bug that allowed a lead to be routed to another salesperson after it’s been assigned, which was also fixed during this event.

Critical message queues such as the one being used where the error happened should be more carefully monitored and secured from inadvertently causing floods. In this case, it was encountering the error at a low rate (once a minute) and did not trigger throttling (what we thought would have prevented it). However, the slow rate was enough to destabilise the impacted database. We are looking into improving this process.

The database was under-provisioned, as is in essentially the same single-shard configuration as it was first deployed, whereas its usage patterns had changed drastically from that early time. While there was more than enough space available in memory for data storage, it did not have enough memory reserved for query overheads, and as the ever larger amount of data accumulated , which resulted in more queries – and its balance was tipped. This was solved by tuning various parameters, as well as adding an additional shard to the database. This ensures the database has enough capacity to handle more data as they will inevitably come.

An observational issue was noted; the database memory use was reported as an average value (a misleadingly low number), we failed to see the peaks when it used more than its allocated memory and relate it to other symptomatic issues. It was a case of looking at the wrong metrics that prevented us to recognise the problem earlier.

The root cause analysis of this incident was a complex exercise due to the non obvious nature of the error – the database is highly active with many concurrent clients; any of which was the possible cause of the performance degradation. We had learned a bit more about the database’s behaviour under stress due to this incident and the contributing factors. This will be addressed in future optimisation activities.

‌

Reported by

Reza Rosli, CTO

Posted Oct 14, 2022 - 20:56 GMT+08:00

Resolved

This incident has been resolved.

Posted Oct 14, 2022 - 02:29 GMT+08:00

Monitoring

The routing service is now restarted and we are monitoring for further issues.

Posted Oct 14, 2022 - 01:47 GMT+08:00

Update

The root cause of the issue has been identified and solved. However, the routing service will remain offline until further testing.

Posted Oct 14, 2022 - 01:42 GMT+08:00

Update

We are continuing to work on a fix for this issue.

Posted Oct 14, 2022 - 01:16 GMT+08:00

Identified

We have noticed an issue with the database backing our leads routing system and we have disabled routing until the database issue is resolved. We are working to restore the service as soon as we can.

Posted Oct 14, 2022 - 01:15 GMT+08:00

This incident affected: Lead Routing.