Users unable to login using mobile app
Incident Report for SalesCandy
Postmortem

What happened?

In the morning of Friday, 4 November 2021 we started receiving reports since 9AM from users saying they were not able to login using the mobile application after inputting the One Time PIN. Users get a “Forbidden” error response from the server. The issue impacted all users using the mobile application only.

Our investigation found that the error was caused by a standard Web Application Firewall (WAF) rule which was working correctly since 8PM yesterday evening until noticed today from user reports. This rule enforces that the size of the request body should not exceed a certain size. Until today, the requests were able to pass the firewall as they were within the limit set; however, now we have found that the request body size has increased beyond the limit, causing the requests to be blocked by the rule.

In further investigation, we have determined that this was a false positive event. The increase in request body size is normal and acceptable due to variations of the data collected and sent to the API in the login process and have adjusted the firewall rule accordingly to allow the requests to be processed normally.

Learnings

SalesCandy employs a layered security approach where we have implemented protections at various levels of our system, including a Web Application Firewall to protect our public-facing API. A rule such as the request body size limiting is necessary as one of the ways to guard against requests carrying unexpected data in their payloads which could be invalid or malicious, and it is one of many other rules implemented in the firewall.

In most cases, we are able to test the rules as we develop new features for our applications which may change the properties of the requests, and make adjustments to the rules during testing, hence make necessary changes ahead of production to ensure they do not impact our users.

In this specific case, there has been no material change to the firewall rules, the authentication service, and also the login flow in the mobile application since at least July 2021 so we did not expect any issues. However, there is additional complexity in that the data provided with the login request from the mobile application includes data derived from third-parties such as from Google’s SafetyNet Attestation API, or from our New Relic monitoring solution, and also metadata from the user devices (e.g. device model and app version codes). We had not accounted for the event that such data can increase in size without notice, and did not provide sufficient tolerance in the allowed limit to handle such an event. In this case it had increased and in result had violated our rule and precipitated the incident.

It is difficult to predict in advance when such events can happen again. After all, despite whatever limit and tolerance we provide (both of which should be as small as practically possible), the complex software environment makes it inevitable that at some point we may encounter the issue again. In any case, it is also difficult to determine if any alerts from the firewall from a system we know is historically functioning well are true or false positives unless we receive feedback from users such as what received today.

From this event, we have learned that we should improve our monitoring techniques in order to be able to detect and respond to such issues earlier, preventing them if possible. We have already reviewed the payload size of all requests across our API to check them against the firewall rule, and will look into creating automated alerts should there be changes that approach the limits we have set.

Posted Nov 04, 2021 - 13:35 GMT+08:00

Resolved
This incident has been resolved.
Posted Nov 04, 2021 - 12:01 GMT+08:00
Update
We are continuing to monitor for any further issues.
Posted Nov 04, 2021 - 11:26 GMT+08:00
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 04, 2021 - 11:25 GMT+08:00
Investigating
We have received reports of users not able to login using the mobile app. Login to manager portal (web) is operational. We are investigating the issue.
Posted Nov 04, 2021 - 11:11 GMT+08:00
This incident affected: Mobile Application (Authentication (Logins)).