Service interruption: database performance issues

Incident Report for SalesCandy

Postmortem

What happened?

In the morning of Friday, 2 July 2021 we started receiving reports since 9AM from users saying they are experiencing intermittent slow performance from their applications.

We were already actively monitoring our Elasticsearch cluster which houses the leads database responsible for this functionality because of some changes which were done during the previous night.

In the beginning, the reports did not match with our system health monitoring metrics, which did not show any issues, and our testing did not reproduce the errors the users reported, so we attributed the causes of the issue elsewhere, as network instability was a more likely and more frequently the cause of the issues that were reported.

As the day went on, at around 11AM we started receiving more reports from users, and at the same time we had observed the CPU usage of our Elasticsearch cluster had abruptly increased to a critical limit.

This resulted in users unable to load their leads data in both their mobile application and the manager portal.

At 11.07AM an incident was declared, and reported here. It was clear that the performance issue was linked to the CPU overloading issue experienced by the Elasticsearch cluster, and the cause must be related to the changes done the previous night.

By 11.55AM we had increased the CPU available and managed to get the cluster working again, but the cluster was still working slowly. Investigation was underway to understand the extent of the impact to other services related to the issue and to determine what was the root cause of the issue – we needed to be clear whether the cause was a misconfiguration or a malfunction as a result of the change (with correct configuration), or some external factor.

By 1.00PM we had understood that the cause of the issue was a misconfiguration of the Elasticsearch cluster which demanded much higher CPU resources that was reasonable. We started an operation to reindex the cluster with a new configuration to solve the issue.

The operation was taking a long time to complete under the circumstances, which we reported at 1.48PM. The operation completed after almost two hours, which we reported at 2.57PM. By this time, we were observing that performance was back to normal and continued monitoring and fine-tuning the configuration, including returning the original (lower) CPU allocation back to the cluster.

At 6.24PM we had observed no further issues and officially declared the incident as solved.

The same evening, and yesterday (3 July 2021) we continued observations and checked whether our error-handling mechanisms functioned as expected during the incident.

In particular, it is important that all new and updated leads and lead logs were correctly updated even in the case of a major issue such as this. On this note, our system is designed to tolerate a complete outage to this ElasticSearch cluster for up to 96 hours, and it was able to function even with a severely degraded performance event such as this event.

We are currently satisfied that our system worked as intended and did not find any evidence to suggest there were any data loss. However, please let us know if you notice any issues with your data.

We sincerely apologise for the mistake and the disruptions it caused our users. We hope that we have learned from this mistake and it would not happen again.

Learnings

The performance of leads and lead logs listings, search and reporting functions in SalesCandy are critical to the experience of SalesCandy’s customers, where any delays or errors will be instantly noticed by our users and impact their daily work.

In the early days of SalesCandy, when we were using a relational database (MySQL), we had encountered challenges where these functions may take several seconds, and even several minutes in the case of reports generations.

We solved these challenges when we rebuilt SalesCandy by including the Elastic Stack comprising Elasticsearch, Kibana, and Logstash as a key set of technologies that is able to deliver the functionality and blazing speeds we desired for our customers.

In our architecture, these functions are implemented as separate microservices which depend on Elasticsearch as a search and analytics engine. These microservices are independent from data ingestion and update microservices which utilise a different database system optimised for write operations but are not suitable for reading. In normal operations, newly written data are replicated (indexed) to Elasticsearch within seconds, depending on the overall load to our system.

The performance of this replication process is very important to ensure our users are able to access newly written data as soon as it is written; as does the performance of the listings and reports generation, not least because of its impact to our users' experience if their lead lists are too slow to load. As such, these functions and underlying technology have been the subject of continuous improvements as SalesCandy’s user base grows – and with it, the volume of data that we are handling.

Over time, we have been able to continuously find ways to improve the reliability and reduce the execution times of each function, while also balancing the resources needed to run them. In most instances, these updates, even some complex ones have been successfully executed without any adverse impact. But every day we face new challenges alongside our growth, new requirements, and also changes in the technology landscape.

Our most recent update was an especially challenging one to execute. We are at a point where we need to double our capacity to split our analytics functions from the operational (e.g. listings) functions between two different Elasticsearch clusters. This is because these functions were creating resource contention issues that were manifesting as intermittent slowdowns to our users, CPU usage spikes and increased memory usage that were growing close to our allocated resource limits. Furthermore, we are in the process of implementing new features which would place increasing demands on our Elasticsearch architecture. In the meantime, our knowledge and experience with the Elastic Stack had increased so that we knew we had to make some changes such as re-sharding and remapping our indices in Elasticsearch in order to ensure future scalability and reliability in the system. Finally, the result of the remapping necessitates a raft of minor changes in the affected services which need to modified to adapt to the new mappings.

It was not an issue that was easily solved by simply increasing the resources e.g. CPU and memory available to our cluster, because without making the changes we needed to make, we would not be able to fully utilise the resources and we would end up increasing costs without the benefit of improved performance. Technical readers familiar with the Elastic Stack may know the difficulties of managing index mappings, node balancing, and shard allocations which are highly dependent on individual use-cases, characteristics of the data, and query patterns; and our use-case of Elasticsearch seems to be unique against the larger landscape of other users who use it purely for time-series data, meaning we could learn little from other people. An important aspect this exercise here, is that this was our first Elasticsearch cluster in production, and this was about to be our biggest change to the cluster we have ever made since its deployment.

We have been carefully preparing to make these changes over the last quarter, starting with an extensive literature review of Elasticsearch documentation and determining the changes we needed to make and finally in the last month we had started to implement the changes. Our new Elasticsearch cluster dedicated to analytics was deployed in mid-June and was used as the testing bed for the changes we were planning to our existing Elasticsearch cluster. We redesigned our index mappings and practiced the procedures that were needed to be done to migrate our indices while avoiding downtime and data loss, and then made the changes that needed to be done to the related functions. We then monitored and compared the performance of both clusters as new data were written and read from them, and we were happy to see that the changes worked as intended.

The crucial mistake we did was that in our testing, we worked with a subset of our complete dataset, and we did only light testing, i.e. we not test how the cluster would perform with a full load – something that was difficult to organise given our limited resources, and decided as unnecessary. Elasticsearch performance is highly sensitive to a large number of factors which we do not yet fully understand, and we had misunderstood how Elasticsearch handles querying across multiple shards and underestimated its impact on CPU usage, especially when we are both heavily writing to and reading from these shards at the same time. We also underestimated how quickly we could recover in case something went wrong, which was the risk of the limited testing opportunities available to us.

As it happened, we deployed our new indices on 1 July 2021, where we had our replication service write to both our old indices and new indices and were prepared to stop the writing in case the CPU or memory demands exceeded expectations – they did not, and we considered it safe to perform the migration operation that night. In our previous practice runs we had determined that we could perform this migration seamlessly (in fact it had been done twice before in a smaller scale), and did not provide advance notice of any possible interruptions. This was done at the quietest possible time, at 3.26AM on 2 July and completed without issues at 3.42AM, including deployments to all the related services that were impacted by the mapping changes. We then proceeded to test the change and noted no complications.

This was still the case as the daily normal load started as users start logging in as early as 7AM. Our monitoring showed that the cluster was performing as expected. It was only since 9AM when a majority of our users have logged in and we started receiving a trickle of reports of errors, but these reports seemed to be not connected with the change. In a normal day, this was also when we would start receiving a lot of new leads data, and the CPU usage steadily climbed, but within expectations. At the same time, we were testing and saw that there were no issues on our side, It was only clear that we had a problem when we suddenly received a lot of reports at once, and we also experienced the problems reported. Our Elasticsearch cluster CPU usage had suddenly exploded to extreme levels without warning.

Our mistake was that while we had been focusing on the impact of numbers of shards to memory usage, we did not consider how they interact with the CPU. This is not something that had been written widely about when we did our research. We had configured our new indices to use more than the number of shards that we had CPU cores able to support both reading and writing. We saw this manifesting as symptoms similar to when a hard disk has become highly fragmented and its performance becomes extremely slow. Under the hood, as we understand it now, this had caused a situation that another Elasticsearch user who had experienced this same issue calls a “segment explosion”, when the available CPU could not keep up with the disk processes involved in between writing and reading – which amounts to the same thing.

To mitigate the issue, first we increased the index.refresh_interval value of the indices from 1s to 30s. This would reduce the number of segments being created from the write operations. Then we increased the size of the cluster to be able to allocate more CPU power in order to allow the cluster to continue operating, albeit with a much slower response times (users would see their updates appearing much later, and still experienced timeout errors intermittently). Then we proceeded to perform a reindexing operation, to move the over-sharded indices to new indices with fewer shards. Unfortunately this took a long time, because not only it was happening during our peak loading period, but the reindexing operation itself adds to the load. Hence the same operation that took 20 minutes the night before then took two hours on the day. Finally when the reindexing operation was completed, we switched over to the new indices by updating their aliases. The problem was solved, and we were also able to resize our cluster back to the smaller size we had configured it with.

All things considered, we can attribute this error to our lack of experience with managing Elasticsearch and lack of understanding of its internal workings, although going through this event has greatly improved our understanding of it. Some things were done right: we had learned enough to anticipate this event and also work out how to recover from it; we had our cluster set up in a way to be able to be upgraded and resized without downtime; the migrations done did not incur downtimes despite one of them was done in a highly stressed environment; and our replication service was able to run correctly even with the issues we were experiencing, that resulted in no data loss for our customers.

A large proportion of users also (possibly) did not directly experience this issue, because, when our users are mainly dependent on mobile data connectivity, our mobile application is designed to be able to work offline, i.e. they were still able to access and update their leads from the application’s offline storage and continue working although their updates were slow to be reflected in their apps. This was in fact one of the reasons why we were slow to recognise that there was a problem, because unless we had logged out and logged back in, which would have required a full fresh download of all listings, the listings were always available and updatable on the application to the user.

To conclude, the changes we made are now stable and highly performant, but it is now the weekend where we normally see lower loads overall. We will continue to observe over the next few days how our Elasticsearch cluster will behave with higher loads, but now we are better prepared to handle any further problems. We would be doing this many more times in the future as our dataset grows, and we have to reoptimise our Elasticsearch, and we hope next time we would not be making the same mistakes.

We have only started to fully explore the power of our Elastic Stack, and this is only the first step in a series of changes we will be introducing in the form of new features and improvements in this upcoming quarter and beyond, and we are excited for what we will be bringing to our customers soon.

Posted Jul 04, 2021 - 19:18 GMT+08:00

Resolved

This incident has been resolved. We will publish the post-mortem report soon.

Posted Jul 02, 2021 - 18:24 GMT+08:00

Monitoring

We have addressed the core performance issue and we are monitoring the outcome.

Posted Jul 02, 2021 - 14:57 GMT+08:00

Update

Our working hypothesis now is that our reconfigured search index from last night's database maintenance activity (https://salescandy.statuspage.io/incidents/fvkdcgdmd43b) is the cause of the current performance issues. The behaviour of the production system is suboptimal compared to simulation environments which use a smaller dataset. We are performing a reindexing and reconfiguration operation to get the system back to optimum performance.

Unfortunately this is taking a longer time than ideal, due to the fact that the source index is performing slowly, and also because the system is currently in normal peak operating hours and limiting the full resources available to complete the task faster.

Our main priority is to perform this task as fast as we can while avoiding options that risk any disruption to our customers and data loss. We apologise for this inconvenience.

Posted Jul 02, 2021 - 13:48 GMT+08:00

Update

We are continuing to work on a fix for this issue.

Posted Jul 02, 2021 - 12:21 GMT+08:00

Update

We are continuing to work on a fix for this issue.

Posted Jul 02, 2021 - 12:00 GMT+08:00

Update

We are continuing to work on the issue. The impacted services are operational however performance is degraded.

Posted Jul 02, 2021 - 11:55 GMT+08:00

Identified

The issue has been identified and a fix is being implemented.

Posted Jul 02, 2021 - 11:33 GMT+08:00

Update

We are observing very high CPU usage on our main database cluster and working to isolate the cause

Posted Jul 02, 2021 - 11:09 GMT+08:00

Investigating

We are currently investigating this issue.

Posted Jul 02, 2021 - 11:07 GMT+08:00

This incident affected: Manager Portal (Application), Mobile Application (Management API), and Graph API.