Corruption for the Accidental DBA’S

I apologize to all the readers for the lay-off that I had been in which resulted in me discontinuing my blog posts. I hope that with this post I can catch up to my intended speed of blogging useful information for all my colleagues and enthusiasts in the database domain.

With that I want to shed some light on the different verticals of database domain I have had the privilege of working on. Currently I work on Oracle, MS-SQL Server and Vertica technologies but since I began blogging I feel I have not shared information other than Oracle DB. So with this blog post I want to share about corruption in MS-SQL database and how to handle it, which I think is a major conundrum among professionals in the MS-SQL DB domain. I intend to keep this post and the subsequent follow-up to this post very simple and focused on the DBCC CHECKDB command, how to interpret it and how to resolve corruption.

To begin with let me say that corruption does happen, most of the MS-SQL professionals haven’t seen corruption or tend to ignore them claiming that they have the best hardware and system set-up. But at some point in time you are going to see corruption so it is best to be prepared for such situation to avoid data loss and downtime. Corruption happens mostly due to IO subsystem problems and most people don’t realize that they have corruption until its too late mostly because either they don’t know how to check for corruption or they miss the warning signs. On the other hand, people don’t know what to do when they do have corruption which leads to more data loss and downtime than necessary, monetary and even job loss, overall lowered perception of SQL server which makes it harder to convince the management that SQL server is Enterprise capable.

So, for these DBA’s who have never seen corruption, one of the common reaction when faced with corruption is to panic, and they tend to do things which make matters worse such as dropping a database or dropping a suspect database or running repair or rebuilding log causing more corruption.
Now let us come to the target of the subsequent posts that I will be blogging about. I shall talk about tools and methods to discover corruption and how to handle them. I will focus primarily on the CHECKDB command output, how to interpret the output and find what and where the corruption is. From this interpretation of the CHECKDB command output we shall than be able to choose between a restore or a repair, whatever might be the case. And finally we shall look at recovering from last resort, and that is our backups.
Ok, so how can we know when something has gone wrong? Well, we can assert the corruption from various ways, primarily from application/user connection betting broken frequently which happens when SQL server tries to process data that is corrupt or if the SQL server hits 823 or 824 errors which hints at IO subsystem problem. Another way to catch corruption is if we have regular backup jobs configured with checksum commands, and if these start to fail then we will be seeing error like 3043, which is the backup has found a checksum and it’s not going to finish the backup due to corrupt database. We can also catch corruption if we have alerts configures in our system. We should have alerts for any error with severity >=19 which will catch errors like 823, 824, etc.. And if such alerts start firing, means we have corruption. We have another error 825 which is a bit tedious to catch. As of SQL server 2005, if it gets 823, 824 error in few attempts, it will retry few times and if after few attempts if it succeeds, then everything carries on as far as the user is concerned. All the DBA knows about this is that an error with severity 10 and error message 825 is logged. So we should have alerts for this error 825 as well. Another sure shot way of finding the corruption is that if the regular maintenance jobs start failing. So it should be made customary to regularly check the maintenance jobs.

Now then as we have suspected corruption via the methods mentioned above, what do we do next. The first and foremost thing to do is, to not panic. The ideal thing is to have a certain set of steps or a certain algorithm to be followed which should be designed by the most experienced DBA and tested by the most junior DBA. This is what a proper DR plan should be as no one knows when a corruption is going to happen and who is going to be on support to handle the corruption. Next in process is to determine the extent of the corruption before resolving it. As it turns out in most of the cases the corrupted database might be easily repaired rather than have to be restored. I have seen cases when people have gone ahead and done things randomly to fix small meta-data issues causing even more corruption. One way to identify the extent of corruption is to run the CHECKDB command, and my subsequent posts are going to rely heavily on this command. We shall talk about it sooner. Again, we should also look at the SQL server error logs because the corruption might have been detected today but it might be an accumulated result of a corruption that began several days ago. This way we can easily find the extent of the corruption and deduce the best contingency plan. Similarly, it is also a good practice to check maintenance job history which will tell us the same thing as to from when did the database find corruption.

Another item in the check-list is to check for what backups are available. In some scenarios we might only need to restore in some point in time and as such might not need to do a full restore. So it is advised to have backups as per required and check for what kind of backups we have to ascertain the contingency plan to recover from the corruption. And finally, I will always stress on to wait for the CHECKDB command to finish before doing anything else as you may not need to do anything intrusive/destructive.

That’s it for today’s post. I shall follow-up this post with another one where we will be looking at the DBCC CHECKDB command and look at some scenarios and how to handle them.

P.S. Comments/suggestions/feedbacks are always welcomed

This entry was posted in ms sql, Uncategorized and tagged , , , , , , . Bookmark the permalink.

Leave a comment