Websocket stability issues
Incident Report for SalesCandy
Postmortem

What happened?

On 25 May around 9.30AM (MYT) we started receiving reports from users experiencing connection issues to our real-time lead routing websocket service. In effect, the impacted users were not able to receive new leads as a stable connection and their availability status cannot be determined.

Around 10AM it became clear that the issue was impacting all our users and was escalated. We found that the database cluster backing the websocket and lead routing service had exceeded its memory limits and was suffering severe performance degradation. This meant user connection information to not be able to be updated, and therefore caused the connection issues to be observed.

We upgraded the database cluster to allocate more memory, and the issue was temporarily resolved at around 11.30AM, however noted that the memory usage that caused it was higher than should be. We continued to troubleshoot the issue and at 4PM implemented the necessary fixes to prevent further occurrences of this event.

Learnings

This database (AWS ElasticCache Redis) is a high-performance in-memory database which is designed to handle high numbers of connections and is extremely fast. It is the perfect database for our use case which requires us to support thousands of always-on connections and messages per second. However, the limitation of this database is that all the data it handles must be able to fit inside its allocated memory limits or it would fail catastrophically.

On Friday, 22 May, we had deployed an update to our websockets service, which added a new functionalities to support a soon-to-be announced new feature. The additions involved storing messages for users while they were offline, which will then be sent the moment they reconnect to the websocket. We also attached debugging logs to each message, so that we can observe the performance of the system and identify causes of issues such as undelivered messages.

The deployment had gone well, and the system performed without obvious issues over the weekend and on Monday. However, there was a bug in the code which caused it to continuously retry sending some types of messages, and debug logs started to accumulate exponentially in storage due to the debug logs generated from the retry events. During Monday night when many users were offline the number of messages had accumulated to a point when, on Tuesday morning, like every morning, we sent out thousands of messages to update the applications e.g. routing hours starting, last night’s events, etc. The immediate addition of so many messages to the accumulated messages amounted to more memory than was available to the database and triggered its unstability.

We have since solved this issue by fixing the bug which was the cause of the message accumulation, and from the data we collected, had revised our sizing estimates for the database. We have now upgraded the database and from monitoring to this point in time, are satisfied that it has enough capacity and will remain stable for the foreseeable future.

We are fortunate that no data was lost during this event, as the failure was isolated to only a few services depending on this database, and the error management mechanisms of the services had handled the database outage as expected. We would also like to assure our users that this database does not handle any critical data and the outage does not have any significant impact other than the inconveniences at the time.

Posted May 26, 2021 - 21:58 GMT+08:00

Resolved
This incident has been resolved.
Posted May 25, 2021 - 12:38 GMT+08:00
Update
We are continuing to monitor for any further issues.
Posted May 25, 2021 - 11:33 GMT+08:00
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 25, 2021 - 11:33 GMT+08:00
Identified
We have observed that the Redis database cluster backing the websocket service has been unstable since this morning. We are troubleshooting.
Posted May 25, 2021 - 10:26 GMT+08:00
Investigating
Users are reporting unstable connections to the websocket affecting the ability to receive new leads
Posted May 25, 2021 - 10:24 GMT+08:00
This incident affected: Lead Routing.