Search

Saturday, July 30, 2011

Performance counters

Process Object : % Processor Time

This is everyone's standard counter. Maybe because it's the default counter selected, but it's also a good gauge for looking at a server's usefulness. This is defined as the amount of time that the server is executing a non-idle thread, but it actually calculated by looking at the percentage of time that the idle thread executes and subtracting this from 100.

I use this counter because it provides an indication of a first bottleneck. If the CPU is pegged, then the server is probably not doing anything. I like to keep this number under 40% unless a large query is being processed. However, if this goes above 40%, I do not set an alert or have my monitoring software notify me immediately. Doing this will result in you being notified constantly.

Instead, understand that your server will go to 100% CPU at times, but unless this is sustained for a period of time, for me it's more than 15 minutes, I don't usually worry. I might lower this threshold for some systems, but for most of them this seems to work well.

However, over time, like over a month, if you constantly see your CPUs averaging more than 50%, you might want to start planning for an upgrade and if they are > 70% on average, you might want to hurry that planning along.
Return to the counter list

System Object : Processor Queue Length

This is an interesting counter and one that I've had a hard time determining how it affects the system. For me it's really a value that I watch, but mostly over a sustained period to see if the CPU is a bottleneck. It represents how many threads are waiting for processor time. The count includes all threads that are "ready" and not waiting on some other thing like IO to complete.

The general rule of thumb is that this value should be less than 2 x the number of CPUs in the system. If it is growing and the CPU % (above) is 80%+, then you likely need more or faster processors.
Return to the counter list

Memory Object : Pages/Sec

This counter looks at the number of memory pages read or written to disk. It should be the sum of both these.
Memory Object : Available MBytes

This counter is one I am glad got added to the list because I really got tired of trying to compute this from the Available Bytes counter. If you are worried about tracking memory bytes as opposed to MBytes, you have other issues. Or you're a lot smarter and more detailed than I am.

This number should be the amount of MB of memory on the computer (from the OS perspective) that is available to be allocated to processes. It's the sum of Free, Standby, and Zeroed memory pages. Again, if you need to know these individual numbers you probably are way past this value. I look at this to see if the average amount of MB is fairly consistent for a baseline perspective. This often can clue me in to some other process or service being added when I think everything on the server is the same and this number is lower.
System Object : Avg. Disk Queue Length

This counter measures whether I/O requests are being held by the disk drive while they catch up on requests. This can be a bottleneck in performance for a server in that if it grows large or you consistently see it above a 8 for a particular disk, then that disk is spending a good amount of time stacking requests up instead of servicing them.
PhysicalDisk Object : % Idle Time

The % disk time counter has been analyzed a bit on the Internet and there are some problems with how the data is collected. There is a reference below for this. In terms of it being accurate, I'm not sure, but I have read a few items that present a good case for it not being quite accurate, at least without some math being performed by the person doing the interpretation. Since I'm looking for things that are simple and easy to read, instead I've taken the alternate approach (also recommended by others). I look at the idle time, which supposedly is more accurate. Subtracting this from 100 gives me an idea of how hard the disks are working.

Be sure that you include this counter for each disk instance and not the whole group.
Network Interface Object : Bytes Total/Sec
This counter is supposed to be the number of bytes being transmitted per second on the NIC. There will be one instance per NIC on the machine, including one for the loopback adapter. This is much better than the current bandwidth counter, which is merely the limit of the NIC, 10, 100 or perhaps 1,000Mbps.
From the Steve Jones' Operation Guide: This should be a high number. roughly 60% of the theoretical max for your NIC. If it's not, start asking the network guys. I had an instance where this counter was 10% of what I expected, because the gigabit ethernet card was plugged into a 100Mbps port. I just happened to notice that this number wasn't much larger than other servers and I expected it to. Overall, this is basically a double check, not something I usually check. Having it on the baseline is just a way of double checking things.
SQL Server Access Methods Object : Full Scans/Sec

This counter is one I always capture to ensure that I know how often indexes are not being used.
I'm not completely sure what an unrestricted full scan is. As opposed to a restricted scan. And this is another "relative" counter that you have to baseline on your system. Until you know what the average number is, it's hard to know if your tuning efforts are lowering or raising this number, which should correspond to improving or worsening performance.
SQL Server Databases Methods Object : Transactions/Sec
This counter is one I always capture to ensure that I know what the average utilization of the server may be. After all, transactions are the basis of everything in SQL Server. Most queries are implicit transactions, but they are transactions. Unfortunately, this counter only shows the transactions that change data. So for me, this shows me on a particular database, the long term expected number of transactions.

This is extremely handy for determining if the load has substantially increased. Having a long term average of a dozen or so transactions/sec and seeing a spot rate of 200/sec shows me that perhaps the application isn't at fault. I may be just outgrowing this hardware.
SQL Server Buffer Manager Object : Cache Hit Ratio
This counter shows the percentage of pages that are found in the buffer as opposed to disk. Since there is a read ahead thread, if that can keep ahead of the read requests, it can keep this counter low. It does now, as far as I know, imply that everything you have requested is in memory as opposed to disk. Since I have had systems with issues, have low IO, have this above 95% and only have 4GB of RAM with a 200+GB database, this must not include the read aheads. Especially when I've seen it high and I've read a couple GB from a table, not all of which can possibly be in memory.
The operations guide gives a hint for this counter. I rarely see if below 90% on any of my servers. Perhaps it's dumb luck, or perhaps I've got things fairly well tuned, but this is one of those counters that I'm not concerned with until it's below 95%. At which time I am really concerned.

SQL Server General Statistics Object : User Connections

This counter tells you how many connections there are to SQL Server on a spot basis, meaning this is not an average, just the current number of connections. Now there are a number of system connections depending on how you have your server connected, typically 10-20. But if you track this on a baseline basis over time, you don't really care. More you are trying to correlate the performance of the system with an average number of users. If the users go up, check other counters, CPU, memory, lock wait time, etc. to see if there is a corresponding change with this larger load.

As with many counters, you want to use this as a broad instrument to measure the performance of the system, not a finely honed one.
SQL Server Locks Object : Average Wait Time
This counter measures the amount of time in milliseconds that a user is waiting for a lock.

No comments:

Post a Comment