Our routing service comprises many individual processes (workers) which continuously processing incoming leads to be routed to each tenant’s users they are dedicated to. To ensure they are able to continuously run around the clock, the workers are built to restart automatically on detected failures, and also supervised by a watchdog service which would restart the workers when they are unresponsive or failed to restart themselves.
Yesterday we experienced high database load event in a related service that the routing workers depend on. The event automatically triggered an auto-scaling process which added additional database instances to handle the additional load. During this event, some of the workers were interrupted mid-processing, and became idle -- instead of halting or crashing which would have triggered the automatic restart script to restore them to operational state. Compounding the error, the watchdog continued to see the processes as active and responsive, hence did not restart the processes.
This resulted in the affected routing workers to be inactive, which went unnoticed because it happened after routing hours when the workers are normally idle. As the today’s routing hours started, and was noticed by our users as the day went by and they were not receiving new leads as usual. Our users then reported to the Customer Success Team on the issue, which was at once escalated to the Development team.
Once we received the user reports the issue was immediately found and we were able to restore the service by manually restarting the affected workers.
We have identified where the automatic restart and watchdog had failed and will be improving them in a future update, as well as improving our ability to detect and respond to such events earlier.