...
Thursday 03.17.16
Our alerting system identified an issue with the replication of one of our databases. We use
MySQL in a master-master-slave configuration and the secondary master node had gotten too far behind the primary master node, so a full reseed was required to restart replication. Maintenance was scheduled to repair this issue over the weekend.
Sunday 03.20.16, 10:30 PM PDT: Disaster Strikes
A system administrator was tasked with reseeding our secondary master. As a part of any replication seed, it starts with a command to drop all tables from the schema. Unfortunately, they failed to sever the master-master link before executing the restore and the drop table commands cascaded to our primary node, and ultimately to the slaves. This happened in a matter of seconds and all of our data was deleted. Our engineering team was immediately notified that the application had failed as our alerting system triggered once again. We immediately kicked off a restoration process from our last daily backup which occurred on late Saturday evening.
Because we use replication, MySQL produces binary logs (binlogs) of every query that is produced at the master. The combination of our Saturday backup and our retention policy of these binlogs (~2-3 days of data) would give us enough overlap to perform a full and complete restore of all customer data up to the point of disaster. This gave us some level of comfort.
Monday morning to Monday evening, 03.21.16
Every company should have a backup and recovery plan.
We did. The last time we tested it, our restoration took 10-12h. An inconvenient amount of downtime, but not catastrophic.
We watched the restore process and waited. However, the combination of a recent change to use table level compression in our configuration, the sheer amount of data that we’ve built up since our last backup/recovery test, and the fact that the MySQL restore process was single-threaded meant that our restore would take an estimated 4+ days to complete. We were caught off guard at how long this process would run, and it was clearly an unacceptable amount of downtime.
...
Wednesday 03.23.16, 09:45 PM PDT
Full restoration of the system was completed with total downtime at seventy one (71) hours, fifteen (15) minutes.
Ironically, in the end, we ended up following our restoration playbook as it was written.
...
https://www.gliffy.com/blog/2016/03/25/turtle-and-the-hare/