This was a simulation only, there was no downtime and we did performed a restore of production database.
In the afternoon of Wednesday, 26 January 2021 we hosted a full drill simulation test, a requirement of the Business Continuity Management, all Candyholics to engage in a brainstorming discussion around a test scenario to discover and evaluate a given plan.
This Business Continuity full drill simulation test was practicing about the disaster recovery of main SalesCandy database (CandyBase). We were testing the design and effectiveness of the full plan with a simulation disaster event and simulated restoration procedure. Our goals are to evaluate participants' understanding of the plan, their roles, and experience how a test is carried out. We were also evaluate the communication and reporting components of the plan internally and externally, to test the assumptions and preparedness of the plan, e.g. ensuring each participant has appropriate access to perform their roles. From this event, we are able to evaluate the plan design and identify improvements needed.
In the beginning, the system was tested is functional in a normal working day at SalesCandy, when the CandyBase service failed. As a result, all users are unable to use the SalesCandy system. However, the database server itself is still up and running normally, showing no errors in monitoring.
Around 3.35PM, we started to receive reports from users that they are unable to login to the platform. The technical team was oblivious to the problem until CST submitted reports, received that something was wrong. In this scenario, we were simulating an event where a developer made a mistake and caused the data in the production database to be corrupted.
The operation was taking a long time to complete under the circumstances, which the CST reported at 3.42PM. This time additionally, the staging database will be switched off during failover, so we will need PRP to send leads to check if the leads will all go into the database after the database is back up. The operation completed almost in one hour and thirty-three minutes, at 5.15PM. By this time, we were observing that performance was back to normal and continued monitoring and fine-tuning the configuration.
At 5.11PM we had observed no further issues and officially declared the incident as solved.
The same evening, we continued observations and checked whether our database handling functioned as expected during the incident.
On further investigation, we found that this happened because one of our developer had mistakenly use production database for his test, resulting in overwriting wrong data. In order to fix this issue, we had restored the database to earlier point, a backup at 4.55PM.
In future we will be more careful and have more restrictions on database access. The necessary action has been taken internally to this mistake made by the developer.