bigWham wrote:In the late evening of Oct 3 (CC Time) one of our core system data tables suffered data loss and could not be recovered.
I find this interesting...
Surely you have (lots of) storage redundancy...you should be able to recover from hardware failure
without reverting to a backup.
Was it bad code? Did you implement a change that wasn't properly tested? Is your system documentation lacking and your developer(s) getting overwhelmed?
I'm just curious. Downtime is something that might happen when a disaster occurs. But having to go to a backup? Shit, somebody fucked up badly.
bigWham wrote:The only efficient solution was to roll back our entire database to the most recent backup, which happened to be approximately 24 hours before.
I think the word that is screaming at me in that sentence is "
efficient".
Because I've designed a number of systems with 100+ tables...and if one the of the "core" tables somehow "suffered data loss", I would expect to be able to recover the vital information from those tables based on their child and parent tables data, and other related tables. Whatever table was lost, you should be able to re-build it with data other tables.
I know there might be some information loss like exact time of turns, but that wouldn't be a big deal. You could look at the physical order to the rows and estimate the time of turns. Obviously I have to speak in general terms because I don't know shit about your design or what table was lost. But you can go to a backup for everything up to the last backup, and then "fix" the data for the time since the last backup.
I guess without going on and on, I think you chose the word "
efficient" because you know that there was a better solution with respect to recovering all the turns, but you were either too fuckin' lazy to do the work to take the site down and fix the problem properly, or because you don't understand the data well enough to fix the problem in an acceptable amount of time.
All these pats on the back that people are giving you shouldn't fool you; reverting to a backup is called "
Failing".