SHRINK – The Evil Doctor

Every DBA faces the day to day challenge of making best use of the available resources. And on this note, it’s very obvious that storage is the one aspect that is quite tough to handle at times. Now there are a lot of ways to gain free space from the available designated storage in the servers, and we as DBA do get this frequently from a lot of developers and other stakeholders — “Lets just shrink the database, and we can get free space”.

There was a time when I’d readily say yes to it, but once you realize how evil that idea is, the first and foremost thing to say is — “That’s going to hit our servers performance”. The only time a shrink won’t affect performance is if we use the WITH TRUNCATEONLY option and there’s free space at the end of the file being shrunk. Shrink affects performance while it’s running, because it’s moving tons of data around, generating I/Os, fully logging everything it does and burning CPU, and as this operation is single-threaded, it is really slow.

Shrink affects performance after it’s run as well. As all the log has be backed up, and if the data file grows again, the new space has to be zeroed out again unless we have instant file initialization enabled. Worst of all, shrink causes index fragmentation, which badly impacts the performance of range scans.

The shrink option does allow us to regain some amount of free space, but it has its consequences. So let’s look at what workarounds do we have to gain free space with or without shrinking.

Option 1: Try a TRUNCATEONLY first

Try a DBCC SHRINKFILE TRAUNCATEONLY first. This just removes empty space from the end of the file, without re-organizing the used data pages first. It’s probably not going to work to our satisfaction, but it’s worth a try, especially as it’s much faster than a SHRINKFILE without the TRAUNCATEONLY option

Option 2: REORGANIZE/REBUILD indexes after the SHRINKFILE

If a SHRINKFILE must be done, allocate enough downtime for the process to complete, but be aware that we can stop the process at any time without incurring a roll-back. The process is logged, but any work done up to the point of cancelling is kept, so we can pick up again later and all the time we just spent isn’t totally waster. The downside is that we just fragmented a lot of our data.

Once the SHRINKFILE has completed, identify the indexes that need to be REBUILT/REORGANIZED and run that process. This is where it gets tricky to know how much time to allow for the downtime. Not only do we have to shrink file, but we have to estimate the time for the REBUILT/REORGANIZED operations after the SHRINKFILE has completed. This will also have the negative impact of growing our file again, though hopefully not as much as we just shrank it by.

Option 3: Move user data to a new file

Add a new file(s) to the database into a new FileGroup, rebuild all the indexes/data into the new FileGroup, then shrink the original files(s). There are several advantages to this method as given below:

1. Avoid the fragmentation problem on the SHRINKFILE operation for the original file as there is no user data to move around and fragment

2. Makes the SHRINKFILE operation for the original file go much faster as there’s not user data to move around

3. We don’t need to re-index tables in the original file, so we don’t end up re-growing it again

4. We can do it in smaller, more manageable chunks without impacting the data left in the original file

Of course, the new FileGroup should be set to be the Default, otherwise we are just going to have new tables/objects added to the old FileGroup. This may not be possible, for a wide variety of reasons, not enough disk space for new file(s), no indexes on Heap tables, management reluctance, etc. However, this option does have the distinct advantage of implementing a best practice of separating out the user data from the system data in the database.

As an end-note just remember, Datafile Shrink, ShrinkDatabase or Auto-shrink, they are never going to be fruitful in the long run and should be avoided as much as possible. Let me know your thoughts on shrinking data files in SQL Server. I would love to hear your thoughts.

Advertisements
Posted in Best Practices, configuration, ms sql | Tagged , , , , , , , , | Leave a comment

Recipe for Disaster

Happy New year to everyone. With a new year, here I start the year with a blog post talking about a common recipe for disaster waiting to happen with your SQL Servers.

By default, SQL Server’s databases are installed on the system drive.  This is a recipe for disaster under two situations:

  1. If your users do something that require the system databases to grow (like never purging backup history from MSDB or building giant temp tables), they can grow until the system drive runs out of space, and Windows will crash.  It’ll be your fault.
  2. If someone else does something to run the system drives out of space (like downloading huge files to their desktop or running big Windows Updates), the system will crash again.  It won’t be your fault – but actually, it still will be.

Now an experienced DBA will avoid such situation by configuring the SQL Server during installation so that the User as well as the system databases do not get into the system drive. But, unfortunately, I have seen cases when the SQL Server has not been installed by a DBA or has been installed without proper configuration document. And under such conditions it’s easy to find User or at least system databases on the system drive.

So how do you move the databases from the system drive to other drives that are allocated to the server. Well , for the most part it is easy to move the databases using the steps shown below:

  • Setting user databases offline
    • ALTER DATABASE [<database_name>] SET OFFLINE
  • Move data and log files to other drive
    • ALTER DATABASE [<databasename>] modify FILE ( NAME = <logicalname>, FILENAME='<new_location>\<physicalfilename>’)
  • Bringing user databases ONLINE
    • ALTER DATABASE [<database_name>] SET ONLINE

The above mentioned method works for all user databases and also works for MODEL, MSDB, and TEMPDB databases. But what if the MASTER database needs to be moved too.

The method of moving master database needs additional work to be done as it requires reconfiguration of SQL Server parameters. Here is the procedure to follow for this to be done:

1) Open SQL Server Configuration Manager
2) On "SQL Server Services" tab, right-click on the service related to the instance you need to take care of
3) Choose "Properties" in the pop-up menu
4) Go to advanced settings and edit startup parameters
5) Edit at least "-d"(master.mdf) and "-l" (master.ldf) parameters. Optionnally also edit "-e" (errorlog) option
 Example : -dD:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\master.mdf;-eD:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Log\ERRORLOG;-lD:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\mastlog.ldf
6) Stop SQL Server services
7) Move master.mdf and master.ldf to new location
8) Restart database engine
9) Check all is OK

That wraps up the procedure. It is highly advised to try this on a development server or test server first to avoid any risk. And after this change, you will have less reliability risk, and your system may even perform faster if the C drive ran on slow storage.

Let me know your views on this post, have you come across such situations? How did you handle it? And was there any hiccups during resolving this issue?

I would love to hear from you guys.

Posted in Best Practices, ms sql, Performance Tuning, Troubleshooting, Uncategorized | Tagged , , , , , , , , , | Leave a comment

SQL Server Storage Report

When working with Oracle databases, especially Oracle Cloud Control 12C, it was easy enough for me to generate a daily report early morning that would be mailed to all the stakeholders regarding the Storage Usage for all the Oracle instances. This mail would contain information something like shown below:

Oracle Storage Report

Oracle Storage Report

One look at the email early morning would allow me and my team to work on creating enough space on the servers with the label ‘LOW SPACE’, so that there is no bottleneck in terms of storage shortage.

As I was working on this server storage report for Oracle, I wondered how the same could be done for SQL Servers. It was then when I stumbled upon this piece of code, that allowed me to generate similar server storage reports for SQL Servers. That piece of code is as shown below:

declare @svrName varchar(255)
declare @sql varchar(400)

--by default it will take the current server name, we can the set the server name as well
set @svrName = @@SERVERNAME
set @sql = 'powershell.exe -c "Get-WmiObject -ComputerName '+@svrName+' -Class Win32_Volume -Filter ''DriveType = 3'' | select name,capacity,freespace | foreach{$_.name+''|''+$_.capacity/1048576+''%''+$_.freespace/1048576+''*''}"'

--creating a temporary table
CREATE TABLE #output
(line varchar(255))

--inserting disk name, total space and free space value in to temporary table
insert #output
EXEC xp_cmdshell @sql

--script to retrieve the values in GB from PS Script output
select rtrim(ltrim(SUBSTRING(line,1,CHARINDEX('|',line) -1))) as drivename
 ,round(cast(rtrim(ltrim(SUBSTRING(line,CHARINDEX('|',line)+1,
 (CHARINDEX('%',line) -1)-CHARINDEX('|',line)) )) as Float)/1024,0) as 'capacity(GB)'
 ,round(cast(rtrim(ltrim(SUBSTRING(line,CHARINDEX('%',line)+1,
 (CHARINDEX('*',line) -1)-CHARINDEX('%',line)) )) as Float) /1024 ,0)as 'freespace(GB)'
from #output
where line like '[A-Z][:]%'
order by drivename

--script to drop the temporary table
drop table #output

The output from the above query gave me results as shown below:

SQL Server Storage Report

SQL Server Storage Report

Beware that the query shown above uses xp_cmdshell and also requires powershell. Let me know what you think about this method, and also share, how do you generate such report for multiple servers under your administration/monitoring?

Posted in ms sql, reporting | Tagged , , , , , , , , | Leave a comment

How To Identify The Source Of Page Splits

In my last post I talked about how one could go about troubleshooting an issue in SQL Server, and today I would like to talk about one of the common issues that most DBA’s come across their professional career, Page Splits. When one page becomes full and a new page is needed, this is reported as a page split

Page splits can slow down your queries and increase the amount of I/O needed to retrieve your data. So, how do you monitor and identify the source of page splits, so you can avoid them in your databases?

The short answer is that you can monitor page splits with extended events. There’s an event called page_split. The problem is that there are many kinds of page split. You only need to catch the problematic ones, but the page_split event doesn’t identify this.  A page split is a regular operation – it doesn’t necessarily have bad consequences for your queries. The problem occurs with updates and non-sequential inserts, when a row needs to be inserted in the middle of an object’s pages and there’s no space.

SQL Server creates a new page, transfers half of the page data to the new page and writes the row data. This creates page fragmentation, which is very bad for performance. This is also reported as page split. You can find these problematic page splits using the event sql_server.transaction_log. This event monitors all the activities in the transaction log, so we need to use it with caution. Specifically, you need to filter the operation field, looking for the value 11, which records LOP_DELETE_SPLIT. This is the deletion of rows that happens when SQL Server moves rows from one page to another in a ‘bad’ page split.

This event isn’t visible in the Extended Event interface, so we need to create a session using T-SQL

CREATE EVENT SESSION [BadSplits] ON SERVER

ADD EVENT sqlserver.transaction_log (
    WHERE operation = 11 -- LOP_DELETE_SPLIT 
)
ADD TARGET package0.event_file

-- You need to customize the path
    ( SET filename = N'C:\xelogs\badsplits.xel' )
GO

-- Start the session
ALTER EVENT SESSION [Blocked]
ON SERVER
STATE = START;
GO

After some time, we can query this session to find the source of a page split. The first challenge is to find the database where the page splits are happening. To read the information in the files, you need to use the DMF sys.fn_xe_file_target_read_file. The information will be returned in XML, so we need to use XML methods to retrieve the information:

WITH    qry    AS 
( SELECT 
-- Retrieve the database_id from inside the XML document
theNodes.event_data.value
    ('(data[@name="database_id"]/value)[1]','int') 
AS database_id FROM
(SELECT CONVERT(XML, event_data) 
    event_data -- convert the text field to XML
FROM	-- reads the information in the event files
sys.fn_xe_file_target_read_file
   ('c:\xelogs\badsplits*.xel',NULL, NULL, NULL)
) theData
CROSS APPLY theData.event_data.nodes
   ('//event') theNodes ( event_data ))
SELECT	DB_NAME(database_id) ,COUNT(*) AS total
FROM	qry
GROUP BY DB_NAME(database_id) 
   -- group the result by database
ORDER BY total DESC

Once you know the database with the most page splits, you need to identify the objects that are causing the problem. This is a bit tricky, but here’s how it works (and there’s a code example at the end).

The event only returns the allocation_unit_id, so you need to convert the allocation_unit_id to the object name that owns these pages. The query to convert the allocation_unit_id to an object name needs to run on the database that contains the allocation units, so we’ll filter the results for this specific database. We now need to join the result with the DMV sys.allocation_units using the allocation_unit_id field. This DMV has three kinds of allocation units but we are interested only in two of them: type 1 and type 3, which represent data row pages and row overflow pages.

Next, we need to add sys.partitions to the query, and join the container_id field of sys.allocation_units with the hobt_id field in sys.partitions. Finally, we add sys.objects to the query, joining with the object_id field from sys.partitions, to get the name from sys.objects:

WITH qry AS
	(SELECT
theNodes.event_data.value
   ('(data[@name="database_id"]/value)[1]','int') 
     AS database_id,
theNodes.event_data.value
   ('(data[@name="alloc_unit_id"]/value)[1]','varchar(30)') 
     AS alloc_unit_id,
theNodes.event_data.value
   ('(data[@name="context"]/text)[1]','varchar(30)') 
     AS context
	FROM 
		(SELECT CONVERT(XML,event_data) event_data
		FROM
	sys.fn_xe_file_target_read_file
		('c:\xelogs\badsplits*.xel', NULL, NULL, NULL)) theData
   CROSS APPLY theData.event_data.nodes('//event') 
               theNodes(event_data) )
SELECT name,context,COUNT(*) AS total 
  -- The count of splits by objects
FROM qry,sys.allocation_units au, sys.partitions p, sys.objects ob
WHERE qry.alloc_unit_id=au.allocation_unit_id 
  AND au.container_id=p.hobt_id AND p.object_id=ob.object_id
  AND (au.type=1 or au.type=3) AND
	db_name(database_id)='MDW' -- Filter by the database
GROUP BY name,context -- group by object name and context
ORDER BY name

Now that we know which table is causing the most page splits, we can analyze its indexes to solve the problem.

Please note that although this solution can identify page split problems, you can’t let the session run for a long time, because you’re capturing transaction log activities, which can be too intensive for the server.

Posted in ms sql, Performance Tuning, Troubleshooting | Tagged , , , , , , , , , , | Leave a comment

Troubleshoot Your SQL Server

Hi Everyone. I wanted to ask a question before I begin my post today. And the question is:

When you’re trying to troubleshoot SQL Server, Where do you start looking?

This is one of the most hardest question to answer when you are to troubleshoot your SQL Server in terms of performance tuning. And if you don’t know where to focus your efforts, it’s easy to jump to the wrong conclusion.

Then how do you go about troubleshooting your SQL Server. Well, the key is to look at multiple sources of information, and not to take your data as multiple separate entities. A troubleshooting framework at such scenario, helps enormously. For me this basically trims down to the following items:

1. Wait statistics: A great starting point is to query sys.dm_os_wait_stats. You’re looking for significant system waits, such as an I/O bottleneck. This won’t tell you the root cause, but it’s a great start.

2. Virtual file statistics: Next, you should look at the virtual stats in the sys.dm_io_virtual_file_stats. These, combined with your wait stats, provide further evidence for the source of your problem. From the above mentioned statistics, you’ll be able to see the level of I/O activity and where it’s happening.

3. Performance counters: sys.dm_os_performance_counters provides you with performance counters that specifically relate to SQL Server. For example, if you’re seeing a problem in the disk I/O subsystem, high Full Scans/Sec may indicate a missing index or inefficient code.

4. Plan cache usage: Finally, query the execution information in sys.dm_exec_query_stats, and look for problems related to your findings in the previous stages of analysis. Find statements that have taken the most CPU, the longest execution time, and more.

Combining information from multiple statistics as shown above can help you pinpoint the source of your issue and come up with a solution that can improve the performance of your SQL Server.

Let me know what other statistics do you look at when you have to troubleshoot your SQL Servers and what kind of approach do you take while you’re at it?

Posted in Best Practices, ms sql, Performance Tuning, Troubleshooting, Uncategorized | Tagged , , , , , , , , | Leave a comment

You’re a DBA, you should know these

  • How many databases are you responsible for?
  • How many different version do you support?
  • How much data have we handled this month?
  • How much capacity will we need next year?
  • What new IT techniques will help you get there?

These are some of the questions people from business will tend to ask a DBA, and as soon as such questions are thrown at you, all that comes to ones mind is metrics. These metrics help you answer a lot of questions for the business people and help you get ready for any actions to be taken. For example, when you know about the different versions that you support, you can easily identify when you have to upgrade a server. Another example would be if you know how much data you handled this month, or How much capacity will you need next year, then you can easily ask for a specific amount in the budget that you might need to handle next year’s capacity. Remember that most business folks know what you do, but they don’t necessarily know how you do it. All they understand is numbers that relate to them and you need to be ready with a set of answers for questions as above when thrown at you. You don’t want to be searching for a way to get those answers when asked, but be ready with the answers.

Is every database being backed up?

The answer to this question must be a definitive yes, and nothing but yes. But saying yes isn’t enough, and it’s best to prove it with information coming from MSDB.DBO.BACKUPSET. From here, you can get all the information that you require to answer the above question. Even so, that is not all. For example, Do you know, if you are backing up to a folder on disk, are your backups making it to the tape? It’s not enough to know that you’re backups are going fine. You need to know that their backups are going fine as well. And by saying their , I mean the sysadmins. Are they completing the tape backups properly or not. There is a neat and easy way in windows to do this and almost all tape backup software supports this feature. All you have to do is switch to the details view in the file explorer and select the attributes flag for the files to be seen. You might see the files having the flags ‘A’. This means that these files are ready to be archived and sent to tape. Now whenever a tape backup software checks through these files and moves it to the tapes, most of them have a way to clear this flag and unset them. So, a file that has been moved to the tape, will not have this flag setup as ‘A’. If you are not seeing the flags unset, then its best to talk to the sysadmins and configure the tape backup software to unset the flags once the files are moved to the tape.

Will SQL Email us when something goes horribly wrong?

Again, there should be no other answer to this question than a definite and a confident YES. To do this all one needs to do is to setup the following:

  • SQL Server Agent
  • Database Mail
  • Operator
  • Failsafe operator
  • Alerts

If you have all these configured, then at times when something goes wrong with the SQL Server, then you will get an email. But what if, the SQL Server Agent goes down, as SQL Server Agent is the one to check for anything going wrong and send an email. For this, you can use any 3rd party freeware tool available, to healthcheck on the SQL Server agent. Or you can use the healthcheck tool that you currently use to check in on the status of SQL Server agent. Again, when configuring the above, its best to use distribution lists and not individuals, so that in the absence of one person, someone else can respond to the issue at hand. Also, know that these SQL Server agent alerts are not your best option, but they are free so always include them as a part of your strategy.

Find out if any of your databases have a different collation before it’s too late

Collation can be a headache, if they are unknown to you, or if you know about it too late. This is because, in SQL Server, collation can be server level, database level, schema level and even objects level. So if there are multiple collation in a single server. It would make your work a lot easier if you knew about all those different collations, before doing any work in production machines. Imagine a scenario where your database is in collation A and your server is in collation B. Now you go ahead and create a temporary table and insert data from your database. Which collation do you think the data in the temporary table will be?

The answer is, it will inherit whatever collation the server is in, unless the temporary table that was created has been explicitly assigned a collation in its DDL. And this messes up a lot of things especially sorting and storage in SQL Server.

Are you sure all of our servers are set up right?

There is no second answer to this question. It’s always the same, and must remain that way. And that answer is a big overwhelming YES. This is something that you as a DBA must answer with a yes and to prove it, you must be ready to provide supporting answers in a jiffy. Now you must be thinking, I have tons of servers that I handle, how can I query all of them at once and provide answers to the stakeholders. Well, there are a lot of ways to do that, but the most easiest and convenient way to do it would be through server registration and central management server in the SSMS. You can group all your servers into categories like production, DR, development, etc and even into versions, as there are usually differences in the DMV query output in the version when you do a SELECT * in any of them. Having said so, once you have this setup, you can easily fire a multi-server query and get your results, that can help you know if your servers are all setup right or not.

Finally, can you get answers to all these questions really fast?

When business people come to you with such questions, they want answers, fast. They don’t have the time for you to query the results and make them wait. When they say now, they mean it. So how do you go about getting all these answers in a moment notice, so that you are ready to splash out the metrics to the business people and support your requirements or requests for budget or hardware or anything that you need to support proper working of the system.

Here is something I personally use and have been using for a long time and find it ever so more an arsenal in my code repository to get my metrics right.

Brent Ozar’s sp_Blitz

This was a life saver for me when I began my career as a DBA, and gave me information about my servers that I wasn’t even aware of. This way I avoided a lot of scenarios that could have gotten my servers in a whole lot of trouble and now it’s currently helping me in providing the business people the metrics they ask for and it’s also helping me to fine tune my servers for better operation.

What other things do you think a DBA should know, other than the ones that I listed? Let me know your thoughts on this one in your comments below. Tune in until next time.

Posted in Best Practices, ms sql, Uncategorized | Tagged , , , , , , , , , | Leave a comment

Better…..Faster….SQL Server Backups

Ever since I phased out the legacy backup system in one of my recent projects with a centralized SQL Server backup solution, I have been asked by many people regarding the results that I got from the transition. To put it simply, the backup size reduced by over 400 percent (from 3.4 TB to 700 GB) and the time spent on backup tasks went from 6 hours to 1.5 hours. Now these are some good numbers to put up for a case study scenario, but how exactly did I achieve these results.

My answers usually begin by explaining the process using some of the terms that are relevant to the SQL Server backup command, and if people are not using SQL Server backup command to backup their database, then they might not know some of them. Today, I just want to talk about some of these essential backup options provided in SQL Server that can help any DBA make SQL Server backups go faster and better.

COMPRESSION

Backup compression was added in SQL Server 2008 in Enterprise Edition, but is now available in standard edition as well from SQL Server 2008 R2 onwards. This makes your backups smaller, and can also speed up your backup operation as well as recovery operations. The only trade-off is that compressing and decompressing uses more CPU, but this is usually not a problem.

COPY_ONLY

I have come across many situations where people have gone ahead and made a one-off full backup of databases that are scheduled for backups. In doing so, the one-off full backup becomes the basis for subsequent differential backups and so the backup chain disrupts. Especially, if the one-off full backup is not recorded properly. In case of a disaster recovery operation, this single one-off full backup can disrupt the disaster recover process. To avoid having one-off backups affecting the backup chain, you can use the COPY_ONLY option, as this option makes the backup NOT become the basis for subsequent differential backups, and for a log backup, this option makes the backup NOT become the part of the log backup chain.

CHECKSUM

This option does two things:

  1. As the data file pages are being read from the data files, their page checksums are validated. If an invalid checksum is found, by default the backup will fail and report a message indicating which page is corrupt.
  2. Calculate a checksum over the entire backup and store it in the backup header

From above we can conclude that this option is a way for SQL Server to determine that the I/O subsystem has corrupted a page. So, if you have backups setup with this option and your backups are failing then that could mean corruption in your database. This would make, usage of this option an absolute requirement and should be used on all backups, manual or automated.

The next and last of the BACKUP options are something that not many people are used to work with but can help in speeding up the backup process. There are no fixed values for these options to follow, but rather need a trial and measure approach. I suggest enabling the following trace flags prior to the tests to know what values are best for the following BACKUP options in your particular server setup

DBCC TRACEON (3605, 1)
DBCC TRACEON (3213, 1)

BLOCKSIZE

This specifies the physical block size, in bytes. The supported sizes are 512, 1024, 2048, 4096, 8192, 16384, 32768, and 65536 (64 KB) bytes. The default is 65536 for tape devices and 512 otherwise. From my tests, I have found that by specifying a large 64 KB sector size instead of the default 512 bytes typically shows a 5-6% improvement in backup throughput.

BUFFERCOUNT

This specifies the total number of I/O buffers to be used for the backup operation. The total space that will be used by the buffers is determined by buffercount * maxtransfersize.

MAXTRANSFERSIZE

This specifies the largest unit of transfer in bytes to be used between SQL Server and the backup media. The possible values are multiple of 64 KB ranging up to 4194304 bytes (4 MB). The default is 1 MB.

Once you have enabled the trace flags, you can play around these values and find a combination that gives you the maximum buffer space usage. Once that is reached it is not fruitful to increase the values of the parameters.

These are the parameters that I played with in the central backup system deployed by me making use of the above mentioned BACKUP options to get a better and faster SQL Server backup. To summarize, Backup compression is a great feature to use, as it will save you disk capacity and reduce the time needed to backup your data. You can also add multiple backup destination files which I have not mentioned here, but is another efficient way to increase backup throughput. Then, by specifying the BUFFERCOUNT, BLOCKSIZE and MAXTRANSFERSIZE parameters you can typically double the backup throughput, which means reducing the backup time-taken by half. So, the next time you need to a better and faster backup option, don’t just ask for better hardware, look for these options and see if you can achieve your task using these BACKUP options.

That’s all for today folks. Let me know if you have come across such backup performance improvement task and what exactly did you do to resolve it.

Posted in backup, Best Practices, ms sql | Tagged , , , , , , , | 1 Comment