What happened?

Our routing service comprises many individual processes (workers) which continuously processing incoming leads to be routed to each tenant’s users they are dedicated to. To ensure they are able to continuously run around the clock, the workers are built to restart automatically on detected failures, and also supervised by a watchdog service which would restart the workers when they are unresponsive or failed to restart themselves.

Yesterday we experienced high database load event in a related service that the routing workers depend on. The event automatically triggered an auto-scaling process which added additional database instances to handle the additional load. During this event, some of the workers were interrupted mid-processing, and became idle -- instead of halting or crashing which would have triggered the automatic restart script to restore them to operational state. Compounding the error, the watchdog continued to see the processes as active and responsive, hence did not restart the processes.

This resulted in the affected routing workers to be inactive, which went unnoticed because it happened after routing hours when the workers are normally idle. As the today’s routing hours started, and was noticed by our users as the day went by and they were not receiving new leads as usual. Our users then reported to the Customer Success Team on the issue, which was at once escalated to the Development team.

Once we received the user reports the issue was immediately found and we were able to restore the service by manually restarting the affected workers.

Learnings

We have identified where the automatic restart and watchdog had failed and will be improving them in a future update, as well as improving our ability to detect and respond to such events earlier.

Posted Apr 08, 2021 - 15:16 GMT+08:00

Resolved

The root cause has been identified and a fix will be deployed to avoid repeat occurrences of the event.

Posted Apr 08, 2021 - 14:35 GMT+08:00

Monitoring

The routing is now restored to normal. We are monitoring while investigating the root cause.

Posted Apr 08, 2021 - 14:24 GMT+08:00

Update

We are continuing to investigate this issue.

Posted Apr 08, 2021 - 14:15 GMT+08:00

Investigating

We have received reports that leads are not routing for a subset of clients. Early checks found the routing process has hanged for some clients and work normally after restart. We are investigating the root cause.

Posted Apr 08, 2021 - 14:14 GMT+08:00

This incident affected: Lead Routing.