Paul Randal

Avoiding Knee-Jerk Performance Troubleshooting

February 3, 2014 by in SQL Performance, Waits | 5 Comments
SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Jonathan Kehayias is a Principal Consultant with SQLskills and the youngest MCM ever.

Jonathan’s Posts

Performance troubleshooting is an art and a science. The art comes from experience (and learning from others’ experiences) and the science comes from well-known guidelines about what to do in which scenarios.

Or at least that’s what I like to think, and teach.

In reality, many DBAs and developers out there practice what I call ‘knee-jerk performance troubleshooting. This commonly happens when a performance problem has reached the critical stage with, for instance, queries timing out, processes running slowly or failing, users disgruntled, and management wanting answers and action fast!

The ‘knee-jerk’ comes from doing some superficial analysis of the problem and jumping to the conclusion (really it’s grasping at a straw) that the most prevalent symptom must be the root cause and trying to address that, to no or little avail, often using misguided or downright incorrect advice found online. This leads to a lot of frustration and wasted time, and often leads to wasted money when the organization decides to try to throw hardware at the problem by upgrading the server and/or the I/O subsystem, only to find the problem is still there, or reappears pretty quickly again.

Wait statistics analysis is one of the areas where it’s easiest to knee-jerk, and in this post I’m going to talk about a few of the common wait types and the mistakes that people make around them. There isn’t scope in an article like this to go into great depth about what to do in each case, but I’ll give you enough information to point you in the right direction.

LCK_M_XX

Most people assume that if locking waits are the most prevalent, then it must be some kind of blocking problem that is the issue. Often it is, such as a lack of a suitable nonclustered index causing a table scan in REPEATABLE_READ or SERIALIZABLE isolation levels that escalates to an S table lock. (And as a quick hint, if you don’t think you ever use SERIALIZABLE, you do if you use distributed transactions – everything is converted to SERIALIZABLE under the covers, which can lead to unexpected blocking and deadlocks.)

However, it’s often the case the blocking is being caused by something else. Under the default READ_COMMITTED isolation level, locks covering changes are held until the transaction commits, and will block reads and other updates to the same row(s). If anything prevents a transaction from committing, that could cause blocking to show up.

For instance, if the database is synchronously mirrored, then the transaction cannot commit and release its locks until the log records have been sent across to the mirror and written to the mirror’s log drive. If the network is severely congested, or there’s massive I/O contention on the mirror, this could seriously delay the mirroring operation, and so cause the transaction to take much longer to commit. This would look like blocking but the root cause is resource contention to do with mirroring.

For locking waits, unless the cause is obvious from looking at the query plan, lock resource (e.g. table-level indicating lock escalation, or isolation level, follow the blocking chain (using a script that walks the blocking_session_id column in sys.dm_exec_requests and then look to see what the thread at the head of the blocking chain is waiting for. That’s going to point towards the root cause.

ASYNC_NETWORK_IO

The name of this one causes lots of confusion. What word do you focus in on? NETWORK. The cause of this wait type usually has nothing to do with the network. It should really be called WAITING_FOR_APP_ACK (nowledgment), or something similar, as that’s exactly what is happening: SQL Server has sent some data to a client and is waiting for the client to acknowledge that is has consumed the data.

One of my favorite demos to do when teaching about wait statistics is to run a query that returns a large resultset in Management Studio and watch the server rack up ASYNC_NETWORK_IO waits. There’s clearly no network involved – it’s just SSMS taking a long time to reply to SQL Server. It’s doing what is known as RBAR (Row-By-Agonizing-Row), where only one row at a time is pulled from the results and processed, instead of caching all the results and then immediately replying to SQL Server and proceeding to process the cached rows.

This is the major cause of ASYNC_NETWORK_IO waits – poor application design. I'd then look at whether the server running the application code has a performance issue, even if the application code itself is well designed. Occasionally it’s the network, but that’s rare in my experience.

OLEDB

The common knee-jerk reaction here is to equate this wait type with linked servers. However, this wait time became more common to see when SQL Server 2005 shipped, because 2005 contained a raft of new DMVs, and DMVs mostly use OLE DB under the covers. Before looking for linked server problems, I’d check whether a monitoring tool is running DMVs constantly on the server.

If you do have linked servers, continue troubleshooting by going to the linked server and looking at the wait statistics there to see what the most prevalent issue is, and then continue the same analysis.

One other thing that can cause OLEDB waits is DBCC CHECKDB (and related commands). It uses an OLE DB rowset to communicate information between its Query Processor and Storage Engine subsystems.

Other Waits

Some of the other waits that cause knee-jerk reactions are CXPACKET, PAGEIOLATCH_XX, SOS_SCHEDULER_YIELD, and WRITELOG, and I’ll cover those in my post next month.

Summary

When you have a performance problem, take the time to understand the data you’re looking at and perform further investigations to help narrow down to the root-cause of the problem. Don’t just grasp at whatever seems to be the top wait statistic and follow the first piece of advice you come across online (unless it’s from a well-known and reputable source) or you likely won’t solve your problem, and may even make it worse.

As far as general wait statistics are concerned, you can find more information about using them for performance troubleshooting in:

This was the first in a series of posts I’ll be doing over the course of this year that talk about knee-jerk (re)actions around SQL Server and why they’re the wrong thing to do. Until next time, happy troubleshooting!