"I hate Friday and Monday."
Statistically, these 2 dates have demonstrated a much higher chance of system failure while compares to the rest of the other weekdays. Needless to mention that a majority of PMs love to ask for additional enhancement on Friday, in between 3pm and 5pm...
It was started at Thursday night, and I have a mini change to deploy. While I found that I can't remote in with some weird system error, my heart can sense that something had gone wrong. Unfortunately, data center is too far away to me, so there is nothing I can do.
Next day, reading the email from the operation team sent at late 4am. The team had worked around the clock to turn around the situation. They are later told that building maintenance was conducted some work and an "unforeseeable" event took place, and as a result air conditioning's power got cut off. Server room "virtually moved" to Africa with a temperature at high 40°C plus, servers shut itself down to survive the heat wave. Everything comes back online. A few issue here and there, but we can claim that we make it through. Check my SQL servers, most of the servers and databases are fine... except one pair of servers their log shipping is not working on all databases. After all my work is done to recover these log shipping, I think this is rather a rare scenario that a DBA can face. Given that, I really wanted to share here and hope it may help someone else who hates Friday like me.
Back to the topic. In general, log shipping would recover from most of the unexpected outage event very well. Usually, it just needs to catch up. Since we all know that the log backup are done in the order of transaction, and so does the restore job will only restore each file in the same order to ensure proper roll forward, there is really not much room for anything to go wrong. And if restore failed, one would assume that there could be a log backup file gone missing or corrupted. At worst, it could simply fix it by recreating the log shipping from scratch. In my case, I can't. I have a bunch of large SharePoint content databases and it will incur a big deal work and time to reinitialize them one by one.
Through investigation, what I found is:
-
The log backup job and log copy job keep going without an issue, but the log restore job is just stuck and claims that it could not restore a log which is too recent, with a error message similar to this:
- Error: The file 'S:\DEFAULT INSTANCE\LogShip\WHATEVERDB_20131010094501.trn is too recent to apply to the secondary database 'WHATEVERDB'.(Microsoft.SqlServer.Management.LogShipping)
- Error: The log in this backup set begins at LSN 813434000023423100001, which is too recent to apply to the database. An earlier log backup that includes LSN 813226000014464300001 can be restored. Searching for an older log backup file. Secondary Database: 'WHATEVERDB'
- Error: Could not find a log backup file that could be applied to secondary database 'WHATEVERDB'.(Microsoft.SqlServer.Management.LogShipping)
- I checked the list of log backup files and found that they are identical on both the primary and secondary server. I even delete those on the secondary and copy them over again. Therefore, I didn't miss any log backup that need to be restored. Still, restore job stuck at a single log backup file.
- By testing to manually restore the log backups, I can cross out the possibility that if the log backup is corrupted, as I believed that if the failure resulted of a corrupted log backup, the error message would be different.
- I noted that the system time on the server was gone strange for a short while, but it seems normal when I checked again.
I did some research and I found that there are 2 system tables in msdb, dbo.log_shipping_secondary and dbo.log_shipping_secondary_databases, can help to identify the last log backup file that it restored at the secondary database, Put them together in a query like the following:
So, I identified the last restored log backup and start to work from that point. I disabled the copy and restore jobs, then I tried to manually restore each log backup individually to the database. This is the list of files I got at the time:
First, I attempted to restore the trn file#1, which is done successfully. Next, I have a go with the trn file#2, and it complaint that it is too recent to apply to the database, same as reported from the restore job. Thought I have nothing to lose, so I tried to restore the trn file#3. I was expected a "too recent to apply" error again, however it surprises me with successfully done!! WTH? So, I retry on trn file#2, and it reports it's done successfully as well. After this point, I can restore in the normal order. So all I have done in order to fix it is to swap the order of restoring the trn file#2 and trn file#3. Strange, isn't it?
So, what is the "Lesson Learnt" here? I give it a thought on why it could be happened in the first place. Log backup file are named as [dbname_yyyymmddhhmiss.trn] format and log shipping job run based on the order of how the filename sorts in ascend. In our case, something gone wrong with the order when the 2 log backup files were written and named. Sounds non-sense, but it is possible. Remember, I have mentioned that the system time has gone strange for a short time. My guess is that time was incorrect on the server, and the time service doesn't able to correct it when the file was written. Therefore, when the log backup job first kicks in with a wrong system time, it named the file incorrectly. and once the time service fixed the time, you may found that SQL server would miss the right file because of the incorrect file sorting. That would explain why all the database on the server suffers the same symptom.
So, next time before you restart the log shipping from scratch with no choice but in pain, try to check the current situation. Sometimes, you may found that it is not as bad as you thought...
No comments:
Post a Comment