SQL SERVER – SPID is KILLED/ROLLBACK state. What to Do Next?

I always recall my fun days after talking to my friends and blog readers. Recently killed/rollback discussion came back when my close DBA friend called me for help. Our discussion was worth blogging.

If you are a SQL DBA or developer, I am sure you must have seen something like below:

SQL SERVER – SPID is KILLED/ROLLBACK state. What to Do Next? killed-01

The first question which would come to mind would be “what is killed/rollback command?” The answer is very simple – sometimes, DBA/Developer use “KILL” command to end processes that appear in a hung state. Once that happen, if SQL knows how to rollback that work, it would start “real” rollback process and would undo the work done. There are situations where SQL doesn’t know and has no control of work done of the work done and it has no control to roll back the work. These are the cases where its stays in KILLED/ROLLBACK state forever.

How would you decide which action to take? It completely depends on the action done by the SPID which was killed. There are only a few things, which you can do:

  1. Have patience and wait for rollback to finish. Some large operations may take a long time to rollback. Such SPID should show a change in CPU, Memory and IO column in size. sys.sysprocesses DMV (query in the next step).
  2. If the SPID is not showing any change in CPU, Memory and IO column then it is doing nothing. Here is the query which can tell us more about the SPID and from where the connection might have originated.
1234567891011121314SELECT spid,kpid,login_time,last_batch,status,hostname,nt_username,loginame,hostprocess,cpu,memusage,physical_ioFROM sys.sysprocessesWHERE cmd = 'KILLED/ROLLBACK'

If the process is originated from different machine, then using task manager and doing end process on that machine might help. This decision has to be taken very carefully as it might be a process of a 3rd party application.

After figuring out the SPID we can run the DBCC INPUTBUFFER (<SPID>) to know what is the batch running under this session.

  1. Restart SQL Service if it was killed and doing nothing.

Can you reproduce it? The simplest way is to use xp_cmdshell and open notepad.exe using below command. DON’T DO THIS ON PRODUCTION

1xp_cmdshell 'notepad.exe'

Now, it you kill this SPID it would go to KILLED/ROLLBACK state.

In summary, it is important to understand the cause. Killed/Rollback SPIDs are waiting for an event that will never happen. It could be a real “rollback” situation where SPID is rolling back and trying to get the database into a consistent state. By restarting the SQL Server service, you are postponing that rollback (or roll forward) work to the recovery step that runs before the database becomes available.

Reference: Pinal Dave (https://blog.sqlauthority.com)

Add other servers to the monitoring server

To add other servers to the monitoring server here’s what will need to be done:
  1. Create a Windows account on the server called SQLLogShipping (standard password: let me know if you need that).
  2. Add that SQLLogShipping account to the server’s Administrators group
  3. Make whatever changes are necessary to allow the monitoring server to connect to the server to be monitored.
    Best to make this across all ports, as picking up files uses dynamic ports so it’s at best very tedious to try to open it for specific ports/ranges.
  4. Login in to SQL Server with sysadmin rights and execute the following script:
use master;
create login [servername\SQLLogShipping] from Windows;
alter server role sysadmin add member [servername\SQLLogShipping];
(Replace “servername” in the above script with the name of the server to be added)

Mirroring vs log shipping

Database mirroring is functionality in the SQL Server engine that reads from the transaction log and copies transactions from the principal server instance to the mirror server instance. Database mirroring can operate synchronously or asynchronously. If configured to operate synchronously, the transaction on the principal will not be committed until it is hardened to disk on the mirror. Database mirroring supports only one miror for each principal database. Database mirroring also supports automatic failover if the principal database becomes unavailable. The mirror database is always offline in a recovering state, but you can create snapshots of the mirror database to provide read access for reporting, etc.

Log shipping is based on SQL Server Agent jobs that periodically take log backups of the primary database, copy the backup files to one or more secondary server instances, and restore the backups into the secondary database(s). Log shipping supports an unlimited number of secondaries for each primary database.

Database mirroring is preferable to log shipping in most cases, although log shipping does have the following advantages:

1. it provides backup files as part of the process
2. multiple secondaries are supported
3. it is possible to introduce a fixed delay when applying logs to allow the secondary to be used for recovering from user error

More information about both technologies is available in SQL Server 2005 Books Online in the topics “Understanding Log Shipping” and “Overview of Database Mirroring”.

Thus, there is a trade-off between the speed of retrieving data from a table and the speed of updating the table. For example, if a table is primarily read-only, having more indexes can be useful; but if a table is heavily updated, having fewer indexes could be preferable.


For as many OEMs I’ve setup, the one metric that has always amazed me has been the  “Tablespace Space Used (%)” metric.  This metic is often misunderstood although it “should” be quite simple to understand.  What is so hard to understand about percentage (%) used?

In reviewing the documentation for OEM 11g and OEM 12c, the explanation for this metric has not changed much between releases.  The calculation that is performed to trigger this metic is really simple math:

Tablespace Space Used (%) = (TotalUsedSpace / MaximumSize) * 100

Once this metric has been triggered, most DBAs start scrambling to perform one of the following task:

  • Increase the size of the tablespace
  • Reorganizing the entire tablespace (fragmentation issues)
  • Relocate segements to another tablespace
  • Run Segment Advisor on the tablespace

What I have come to find out is, some times OEM will trigger this metric and the data files may not need any adjustments.  In order to get a clearer understanding of what caused this metric to trigger, we need to look at the “fulTbsp.pl” script.  This script is located in the $AGENT_HOME/sysman/admin/scripts directory.

In reviewing the “fulTbsp.pl” script, Oracle is not only looking at the current size of the data files and the maxsize of the datafile; they are looking at the file system space as well.  The reason for this is to ensure that the data files have enough space to expand if needed.

Now, here is where it can become misleading.  By setting the Tablespace Space Used (%) metric for critical to 95, we are thinking that the metric will trigger when the tablespace reaches 95% used, correct.  Before rushing to perform the tasks above, lets check and see what space is actually used in the tablespace.  In order to do this, Oracle provides us with a DBA view (DBA_TABLESPACE_USAGE_METRICS) to review the percentage of tablespace used.  Below I have provided a sample query for getting the usage of a tablespace:

round(used_percent, 2)
round(used_percent,2) > 90;

Often, I have found that when an alert is triggered for the Tablespace Space Used (%) metric, the data files are less than 90% full.  This is due to the alert being triggered because OEM makes the determination that there is not enough space on the file system to expand the data file if needed.  If you keep this in mind, you’ll be able to keep a firm grasp on what is going on from the OEM and your tablespaces.


source: dbasolved

Why do we need Spinlocks?

It doesn’t make sense to put a latch in front of every shared data structure, and synchronize the access to the data structure across multiple threads with the latch. A latch has a huge overhead associated with it: when you can’t acquire a latch (because someone else has already an incompatible latch acquired), the query is forced to wait, and enters the SUSPENDED state. The query waits in the SUSPENDED state until the latch can be acquired, and afterwards moves on into the RUNNABLE state. The query remains in the RUNNABLE state as long as no CPU is available for query execution. As soon as the CPU is free, the query moves again into the RUNNINGstate and can finally access the shared data structure which is protected with the latch, which was successfully acquired. The following picture shows the state machine that SQLOS implements for the cooperative thread scheduling.

Cooperative Thread Scheduling.png

Because of the associated overhead of latches, it doesn’t make sense to protect “busy” data structures with a latch. For that reason SQL Server also implements so-called Spinlocks. A spinlock is like a latch a lightweight synchronization object used by the storage engine to synchronize thread access to shared data structures. The main difference to a latch is that you actively wait for the spinlock – without leaving the CPU. A “wait” on a spinlock always happens on the CPU, in the RUNNING state. You spin in a tight loop until the spinlock is acquired. It’s a so-called busy wait. The main advantage of a spinlock is that no context switches are involved when a query must wait on a spinlock. On the other hand busy waiting wastes CPU cycles that other queries might be able to use more productively.

To avoid waisting too much CPU cycles, SQL Server 2008 R2 and higher implements a so-called exponential backoff mechanism, where the thread stops spinning after some time and sleeps on the CPU. The interval after which a thread goes to sleep increases over time between the attemps to acquire the spinlock. This behavior can reduce the impact on CPU performance.

Spinlocks & Troubleshooting

The main DMV for troubleshooting spinlocks issys.dm_os_spinlock_stats. Every row that is returned by that DMV represents one specific spinlock in SQL Server. SQL Server 2014 implements 262 different spinlocks. Let’s have a more detailed look at the various columns in this DMV.

  • name: The name of the spinlock
  • collisions: The number of times that threads were blocked by a spinlock when trying to access a protected data structure
  • spins: The number of times threads spinned in a loop trying to obtain the spinlock
  • spins_per_collision: Ratio between spins and collisions
  • sleep_time: The time that threads were sleeping because of a backoff
  • backoffs: The number of times that threads were backed-off to allow other threads to continue on the CPU

The most important column in this DMV is backoffs, because this column tells you how often a backoff event occurred for a specific spinlock type. And very high backoffs yield to high CPU consumption and a so-called Spinlock Contention in SQL Server. I have already seen SQL Server installations where 32 cores were running at 100% without performing any work – a typical symptom for spinlock contention.

To troubleshoot a spinlock contention problem in more detail you can use the XEvent sqlos.spinlock_backoff provided by Extended Events. This event is always raised when a backoff occurs. If you capture this event, you also have to make sure that you use a very good selective predicate, because backoffs will always occur in SQL Server. A good predicate can be a specific spinlock type, where you have already seen high backoffs through the above mentioned DMV. The following code sample shows how you can create such an XEvent session.

As you can see from the listing, I use here the histogram target to bucketize on the callstack. Therefore you can see which code path within SQL Server generated the highest backoffs for the specific spinlock type. You can even symbolize the call stack by enabling trace flag 3656. As a prerequisite you need to install the public symbols of SQL Server. Paul Randal (Blog, Twitter) has written a blog postingabout it. Here you can see an output from this XEvent session.

sqlmin.dll!lockdb+0x4a sqlmin.dll!DBMgr::OpenDB+0x1ec

With the provided call stack, it is not that hard to identify in which area of SQL Server the spinlock contention occurred. In that specific call stack the contention occurred in the LOCK_HASH spinlock type that protects the hashtable of the lock manager. Every time when a lock or unlock operation in the lock manager is executed, a spinlock must be acquired on the corresponding hash bucket. As you can also see from the call stack the spinlock was acquired when calling the functionGetSharedDBLockFromLockManager from the classXactWorkspaceImp. It’s an indication that a shared database lock was tried to be acquired, when connecting to a database. And this finally yielded to a spinlock contention in the LOCK_HASH spinlock with very high backoffs.

If you attend my talk Latches, Spinlocks, and Lock Free Data Structures at SQLbits (Telford, UK) in 2 weeks or at the SQLPASS Summit in Seattle in November, I will also show you how you can reproduce this spinlock contention, how to troubleshoot it, and finally how you can resolve it.


In this blog posting you have learned more about spinlocks in SQL Server. In the first part we have discussed why SQL Server needs to implement spinlocks. As you have seen, with spinlocks it’s just cheaper to protect a “busy” shared data structure from concurrent thread access – like the lock manager. And in the second section we had a more detailed look on how you can troubleshoot spinlock contention in SQL Server, and how you can identify with a symbolized call stack the root cause of the problem.


Source :http://www.sqlpassion.at/archive/2014/06/30/introduction-to-spinlocks-in-sql-server-2/

thus, therefore and hence are different

A simple way of distinguishing and using these words accurately:

1. ‘Thus’ means ‘in this/that way’ – it relates to ‘HOW’ – the manner in which – this or that happens or comes about. It has a practical flavour. eg.Traditionally, you arrange things thus = Traditionally, this is how you arrange things

2 .’Therefore’ means ‘for this reason’, or ‘because of this or that’ – it relates to deductive reasoning, it tells WHY this or that is so, or happened. eg. He was late and therefore missed the bus = he was late and for this reason missed the bus

3. ‘Hence’ means ‘from this/that’ – it relates to WHERE – position, or point in time; it tells from where or what, or to where or what, something comes, derives, or goes eg. -i. Get thee hence! = Get yourself away from here! -ii. Henceforth all entrances will be guarded = From now on all entrances will be guarded -iii. She got the job – hence her good spirits = She got the job and her good spirits derive from that fact. (Note the different slant to ‘therefore’, which would also fit, but would say ” her good spirits are due to (’because of’; ‘for that reason’) that”.

Eg :

Thus: This thing is a balloon, and thus is made of rubber and inflates when you blow into it.

Therefore: This thing inflates when you blow into it and is made of rubber; therefore, it is a balloon.

Hence: This thing is called a balloon, hence it must inflate and be made of rubber.


Data Type Precedence (Transact-SQL)

SQL Server uses the following precedence order for data types:

  1. user-defined data types (highest)
  2. sql_variant
  3. xml
  4. datetimeoffset
  5. datetime2
  6. datetime
  7. smalldatetime
  8. date
  9. time
  10. float
  11. real
  12. decimal
  13. money
  14. smallmoney
  15. bigint
  16. int
  17. smallint
  18. tinyint
  19. bit
  20. ntext
  21. text
  22. image
  23. timestamp
  24. uniqueidentifier
  25. nvarchar (including nvarchar(max) )
  26. nchar
  27. varchar (including varchar(max) )
  28. char
  29. varbinary (including varbinary(max) )
  30. binary (lowest)


Datatype Precedence

Oracle uses datatype precedence to determine implicit datatype conversion, which is discussed in the section that follows. Oracle datatypes take the following precedence:

  • Datetime and interval datatypes
  • Character datatypes
  • All other built-in datatypes

SQL Server : Merge replication fails due to timeout errors  

Solution for : Merge replication fails due to timeout errors


You administer several Microsoft SQL Server 2012 database servers. Merge replication has been configured for an application that is distributed across offices throughout a wide area network (WAN).

Many of the tables involved in replication use the XML and varchar (max) data types.Occasionally, merge replication fails due to timeout errors.

You need to reduce the occurrence of these timeout errors. What should you do?


When you synchronize data rows with a large amount of data, such as rows with LOB columns, Web synchronization can require additional memory allocation and hurt performance. This occurs when the Merge Agent generates an XML message that contains too many data rows with large amounts of data. If the Merge Agent is consuming too many resources during Web synchronization, reduce the number of rows sent in a single message in one of the following ways:
  • Use the slow link agent profile for the Merge Agent.
  • Decrease the -DownloadGenerationsPerBatch and -UploadGenerationsPerBatch parameters for the Merge Agent to a value of 10 or less. The default value of these parameters is 50.
Note: Merge Agent has a “slow link” profile designed for low bandwidth connections.

How to allow merge agent to use “slow link profile” : Change Existing Agents: 

Select a profile (On the General page of the Distributor Properties – <Distributor> dialog box, click Profile Defaults), and then click Change Existing Agents to specify that all existing jobs for an agent of a given type should use the selected profile. For example, if you have created a number of subscriptions to a merge publication, and you want to change the profile to specify that the Merge Agent job for each of these subscriptions should use the Slow link agent profile, select that profile, and then click Change Existing Agents.