London, Friday June 21st 2013, 7:30 PM.
I’m away from Paris for a 1 week trip, celebrating my 10th wedding anniversary. After an incredible afternoon tea at Fortnum’s, we’re back at our hotel for a shower before a last meal by the Thames.
My cellphone rings and I know the caller ID too much. My manager is calling. He knows where I am, and I doubt he wants to ensure we’re having a good time. I pick up the phone, and in a few seconds, every drop of blood has left my face.
The situation is worst than your worst ops nightmare.
In the morning, someone has mistakenly overwritten a major client database with an old backup. The day passed as usual until the client realized the data was missing and the culprit admitted he fucked something up. The client is very sensitive, and we’re about to lose him.
My fine supper by the Thames disappears, replaced by a tinkling droid screening a low res hologram saying endlessly:
Help me, Obi-Wan Kenobi. You’re my only hope.
For a second, I’m tempted to ring off and shutdown my iPhone. I’ve resigned a few weeks ago and nothing forces me to stop my vacation every time someone screw up anymore.
I’ll think about balancing my personal and work life later. For now, I start investigating the case. If hell exists, I’ve just found worst than hell.
The latest backup is 1 week old.
Shutting down the backup daemon was the only way to ensure the monitoring stops complaining because the backup server is short on disk space.
The client has lost up to 1 week of data. And he has added new data during the 6 hours between the backup reload and the moment the client realized something was missing.
And to make things even worst, the database is big. No, it’s bigger than that. It’s BIG! A more than 100GB MySQL database makes manipulation even more complicated, long and hazardous.
This is the moment I’m tempted to throw my iPhone in the Thames, book a plane ticket for the other side of the world and disappear.
I call my manager back, tell him the whole story. We both know admitting we lost that data is not an option. Losing a client data is the worst thing that can happen to an IT company. We’re facing a major crisis; if the news spreads, we may lose all our clients by Monday.
That’s the moment I get the extra shot of adrenalin that makes my job so fascinating. I can feel every single cell in my brain starting to move, looking for a solution to my problem. There might be one. There must be one.
I swear I saw a light bulb flashing behind my head as I look in the mirror. The idea flows from my brain to my fingers at the speed of light. I’ve started typing before I was able to put the words together, knowing exactly what I was looking for.
MySQL binary logs.
The only thing I’ve left as an history is MySQL binary logs. We’re using them for the replication, and only delete them every 45 days. We’re using mixed replication statements, which means there must be a solution to turn the binary logs in plain SQL. If so, I’ve found my database Delorean, but I still need to find enough plutonium to run it.
I start loading the backup that was used in the morning on a test server. It will take some times, but I don’t want to take any risk with the production database, or I’ll have to commit seppuku when I come back.
My Delorean exists. It’s called
mysqlbinlog allows to run query into MySQL binary logs, so you can extract every writing queries that were run. Even better, it has a
--database option so you can use it on a single database so I won’t have to grep through all our clients
What I want to do is tricky but I know I’ll succeed. The fear goes as it came, and I’m back to a more normal, less automated mode I know well. Not thinking about failure takes me back to my comfort zone and I’m now fully operational.
First, I have to find the first query that was run after the backup latest statement. After I reload my backup, I quickly query the 96 tables of my data model. Bingo! Big Ben is slowly ringing 10 PM when I find what I was looking for (because, yes, loading a 100G database takes time).
Now, I can extract every statement that leads to this morning
drop database. I know where it it, more or less, but browsing suck a huge extract takes time, and that’s exactly what I lack. A few minutes later, I’m ready to extract every query between the 2 statements I just isolated.
I save everything in a
.sql file I push on my test server, crossing fingers as hard as my bones allow me. I can almost see the statements running one by one until I get the prompt back.
I’m not done yet.
I need to extract every operation that happened between the moment the 1 week old backup was loaded and the moment the platform was shutdown in case something was updated. Once again my SQL Delorean does magic. I inject the whole thing in my spare server database and once again, it works like a charm. There’s no way I can check my data integrity, but so far, binary logs have never betrayed me.
The rest of the story is common operation. I backup my main database, dump my restored one from my test server, load it on my production server, eat the sushi I ordered from the room service… Tada! The content the client has been complaining about is back, as well as all the content he had written that day.
I call my manager, and we can finally leave the room for a walk by the Thames. My 10th wedding anniversary is ruined, but my database is saved.