I had an incident come across my desk the other day that was rather odd. We had an application that was running slow, and my Windows team noticed that SQL Server was using all of the available RAM on the server. Before I gave my Windows guru the Brent Ozar lecture on Task Manager being a filthy liar, I wanted to give it a look myself first. My instinct told me that we had a configuration problem because it’s very rare for my boxes to page to disk. We’re pretty conservative when setting MAX RAM. That’s when I found the culprit on the Memory tab of the server’s configuration.
Yes, we had a configuration problem. I immediately set MIN and MAX RAM to 16 GB, which is our best practice for a server with 32 GB running this particular application. Within a few seconds, my Windows guru asked how I fixed it. It really was that quick of a fix. Great. But how do we prevent this from happening in the future. My Grandma Hillwig used to say that an ounce of prevention is worth a pound of cure. She was a pretty smart woman. The first thing I did was have my team check the build documentation to make sure that this is set during server setup. I also had them check the peer review checklist to make sure this gets checked. Checklists are good and all, but I had this little itch to automate the check. This little script took me less than an hour.
set nocount on DECLARE @v_max_server_memory int DECLARE @v_min_server_memory int CREATE TABLE #config (name varchar(128), minimum int, maximum int, config_value int, run_value int) INSERT #config exec sp_configure SELECT @v_max_server_memory = config_value FROM #config WHERE name = 'max server memory (MB)' SELECT @v_min_server_memory = config_value FROM #config WHERE name = 'min server memory (MB)' drop table #config if @v_max_server_memory > 262144 and @v_min_server_memory < 512 and PATINDEX('%Hypervisor%',@@version) > 0 begin DECLARE @v_recipient varchar(128) DECLARE @v_subject varchar (128) DECLARE @v_body varchar(2000) SELECT @v_recipient = 'firstname.lastname@example.org' SELECT @v_subject = 'SQL Server Best Practices Alert' SELECT @v_body = 'SQL Server instance ' + @@servername + ' has failed a best practices check. This server is a VM and has MIN RAM set to ' + cast(@v_min_server_memory as varchar) + ' and MAX RAM set to ' + cast(@v_max_server_memory as varchar)+'.' + char(10) + char(10) + 'Setting MIN RAM too low will cause the VMWare balloon driver to force SQL Server to give up RAM. Setting MAX RAM too high will allow this server to start to page to disk.' EXEC msdb.dbo.sp_send_dbmail @profile_name = 'dbmail', @recipients=@v_recipient, @body=@v_body, @subject=@v_subject ; end
Now I would argue that one should set MIN and MAX RAM on all instances, not just those in virtual environments. However, it’s absolutely critical to do this in virtual environments. Notice that I’m looking for the word “Hypervisor” when using the @@version variable. And I’m looking for servers that have MAX RAM set higher than 256 GB of RAM. None of the instances in my environment are nearly this big. Your mileage will vary and you’ll need to modify that value.
Now that I have a script, I just need to deploy it. I tested this by running it against all of the instances in my environment with the CMS. After that, we’ll deploy it as a SQL Agent MSX job that runs on every instance once a week. The next step is to add our code segment that will send it to our service desk software and have it parse out the right configuration items.
I’m incredibly excited right now.
SQL PASS has just announced that SQL Saturday #262 Boston 2014 will be held on March 29 at the Microsoft Conference Center in Kendall Square in Cambridge, MA. Once again, I will be leading the team that organizes this event.
We had such an awesome experience with the 2013 event, and I’m really looking forward to 2014!
You can register for the event at the event website.
SQL Saturday season is starting, and I will be at two upcoming events.
On August 17, I will be presenting Recovery and Backup for Beginners in New York City. On September 14, I will be presenting What the VLF? in Orlando. I’ve also submitted to speak at SQL Saturday 213 in Providence.
In October, I will be attending the SQL PASS Summit. This is the superbowl of SQL Server learning events.
At this point, I can’t say too much, but SQL Saturday Boston 2014 is in the works.
I was looking at syntax of DBCC SHRINKDATBASE today and came across a little gem. Yes, I’m shrinking. I’m also doing it to recover close to a terabyte of storage. But I’m staying in control of when we shrink.
Looking at the MSDN page with the syntax, I came across this little nugget of truth.
Let me copy and paste the text so that the search engines pick this up, too.
Unless you have a specific requirement, do not set the AUTO_SHRINK database option to ON.
Countless people have documented this before, myself included. Microsoft is saying it. Please disable AUTO_SHRINK. Now. Please.
And then after you disable AUTO_SHRINK, please look at your indexes. They’re probably a mess.
A few weeks ago, I started putting together a new presentation to give about database backups. The more I thought about it, the more I realized that people don’t need to understand backups as much as they need to understand recovery. That’s where this session came from.
One of the first questions I have for my audience is if they have a backup strategy. And then I ask if they have a recovery strategy. It’s a very pointed question.
A few years ago, I saw Paul Randal do his Mythbusters session at the SQL PASS Summit. He said something that I will never forget: You need restore strategy and not a backup strategy.
I just can’t find a flaw in his reasoning. We can plan our backups, but unless we know how we’re going to recover those backups, we’re asking for failure. That’s when I decided that I wanted to change the name of my presentation. It’s now called Recovery and Backups for Beginners. The emphasis needs to be on the recovery instead of the backups. Once we know what we need in terms of recovery, then we can find the right backup solution to get there.
How often are DBAs required to put in place bad data structures because developers have already written the code. This is the same thing. We’re going to shoe-horn in a recovery to meet our existing backups. It needs to be the other way around. Once we know what our recovery requirements are, only then should we start talking about how to get there.
I have this statement on three slides in the presentation: If you have a backup strategy and not a recovery strategy, you’re doing it wrong!
How do we know what our needs are? This is where my ITIL training comes in. You probably have an SLA or OLA within your organization, formally or informally. And these probably include an RTO and RPO.
Lets break apart the alphabet soup:
Once you have defined these four things, then you can start putting together your strategy. Some people seem to think that they can provide an RTO of ten minutes with an RPO of zero. With simple backup and recovery, that’s pretty unlikely. If you lose an entire server, or the HBA in your needs to be replaced, you’re going to have an RTO much greater than ten minutes. The larger your database, the longer your recovery time. If your RTO is less than the time it takes you to restore your database, you’re simply not going to make it.
Once we know what our needs are, we can then start determining which recovery model works for our databases. I used to work for a company that had an RTO of twelve hours. That means if we lost a server, losing twelve hours of data was acceptable to the business. It would be painful, but all of the data could be recreated. That gave me a lot of flexibility in my recovery strategy. In my current environment, my RTO is 30 minutes. That’s a little tougher.
Recovery models tell us how us the backup methods. If you’re database is in SIMPLE recovery, it means you can only restore from a full backup. With a twelve hour RTO, I was able to keep my databases in SIMPLE mode and back up twice daily. However, with FULL Recovery mode, we still need to do that full backup and then roll forward the transaction log backups. That works better in situations with much shorter RTOs. The other advantage is that FULL recovery mode lets us do a point-in-time recovery. In some applications, that can be a life saver.
Now that we have a better idea for what we need, we can start to talk about backups. Depending on the size of your database and your infrastructure, you can back up as often as your RTO allows. If you’re in SIMPLE recovery mode and have an RTO of twelve hours, it means you need to do a full backup at least once every twelve hours. But if you’re in FULL recovery mode, you can back up as infrequently as you like and then roll forward all of your transaction logs. But be VERY careful doing this. Imagine doing a full backup once a month and then retaining 31 days worth of transaction log files. Recovering that could take days. Sure your backups won’t take long, but your recovery will be hell.
There is a reason I use the graphic of the guy with his pants down. If you don’t plan your recovery right, you WILL be caught with your pants down.
When I interviewed for my current position, the hiring manager asked me if I knew anything about VLFs. My response was “Very large filesystem?” Clearly I missed that question. She was talking about SQL Server’s Virtual Log Files.
Until that day, I had never heard of VLFs, and to this day, a lot of DBAs don’t know what they are. But before I get too far into this, let me state that I’m hardly the expert on this. Kimberly Tripp from SQL Skills has written several really good articles on transaction log throughput where she talks about VLFs.
VLFs are a construct within the SQL Server transaction log file that breaks the transaction log file into chunks. Those are the segements that get marked for being in use or ready to be reused.
The bottom line is that if you have lots of really small VLFs in your log file, you probably have a performance problem and you may not even know it. The best way to avoid this is to avoid the default settings on a new database for the transaction log file size and growth. If you start with a 1 MB transaction log file that grows by 10%, you’re going to have lots of little VLFs before you know what hit you.
My suggestion for a good practice is to start with a larger transaction log file and grow in larger increments. In my environment, for our datamart environments, we start with an 8 GB transaction log file and grow them in 8 GB increments. For smaller databases with smaller throughput, we set a minimum of a 256 MB transaction log file with a growth of 256 MB.
In this session, I talk about how this can be a Goldilocks problem. You might have too many or too few. Or they might be too big or too small. The real way to know is to baseline and test for your environment. The one thing I would suggest is that you don’t want to grow your transaction log file by more then 8 GB at a time. This will lead to 16 VLFs created per growth. And that would mean VLFs greater than 512 MB.
The real value in this session is the demo. The first script creates a database and starts stuffing it with data. It’s absolutely not a model of good design. In fact, it’s far from it. I have multiple columns that contain GUIDs. Worse yet, I have those columns indexed. The reason I’m doing this is because I want to create a ton of transaction log volume. Kick this thing off and let it run. It’ll grow your transaction log to about 12 GB.
The second script will show the VLFs in your database. This script executes
DBCC LOGINFO WITH TABLERESULTS
Notice that as the first script is running, the number of VLFs will continue to grow on each execution.
There are some cool details on this output, including the size of each VLF and the LSN of when it was created. Lets do a little analysis on it. What I’m doing is pumping that data into a temp tble and then grouping on the LSN. I can then see how much the log file was grown each time and the number of VLFs it created. What you see can be quite staggering.
CREATE TABLE #VLFS (recoveryunit int, --remove this column for SQL 2005/2008 environments fileid int, filesize bigint, startoffset bigint, fseqno bigint, status int, parity int, createlsn varchar(1000)) use miketest go INSERT INTO #VLFS EXEC('DBCC LOGINFO WITH TABLERESULTS') select createlsn, sum(cast(filesize as float))/1024/1024 [size], count(startoffset) [count] from #vlfs where createlsn <> '0' group by createlsn order by 3, 2, 1 drop table #vlfs go use master go
Look at all of those growths. And they’re tiny. That can’t be good. You can see that we’re we’re growing our transaction log file by less than 1 MB several times. But look at what happens closer to the end of the output. We’re growing our transaction log file in much larger increments. And the bigger the growth, the more VLFs.
Here is what I’ve been able to prove. If your growth is less than 512 KB, you create one VLF. from 512 KB to 768 KB, you create two VLFs. 768 KB to 1 MB creates three VLFs. 1 MB to 64 MB creates four VLFs. Until you reach 1 GB, it’s eight VLFs. Anything over 1 GB is 16 VLFs.
If you poke around your environment, you probably have databases that look just like this. The question is How do I fix it? The answer is simple. It depends. If you’re not seeing a noticable performance impact, you may want to leave it alone. This is where If it isn’t broken, don’t fix it could apply. If you are facing a performance problem from too many VLFs, I’d suggest doing this during a quiet period for your database.
First, note how big your transaction log file is. Because I know how my files grow, I’m going to put it back to the size it was before. Then I use DBCC SHRINKFILE to shrink the transaction log file and then grow it back to the original size, making sure I only grow it in 8 GB increments.
The real lesson here is something that I talk about in other posts: Defy the Defaults! If you’re creating databases with 1 MB transaction files and allowing them to grow in 10% increments, you’re probably doing it wrong.