Blog Home  Home Feed your aggregator (RSS 2.0)  
Implements IVillage - BizTalk / SQL Blocking Issue Deconstructed & Fixed?
It takes a village to keep up with .Net
 
 Friday, January 11, 2008

I've previously blogged about a big BizTalk issue we've been having a work.  After many months of traversing the Microsoft Support organization, the proper resources have been brought to bear on the problem.  We have hope of a solutionon the horizon.  Here's a recap with the answers.

Evolution of the Problem

Over a year ago, we stood up a new BizTalk 2006 server infrastructure to be used as the central intergation / orchestration point for all enterprise applications.  The new environment is very nice with lots of memory, 64-bit processors, load balanced servers, etc.  A nice change from the operational BizTalk 2004 environment which was single server and backed by a questionable SQL Server cluster.  So with our new BT 2006 environment, we started migrating application off of BT2004.  We also started creating new applications and expanding operaitonal ones.  Life was good.

We use a single file share server for the majority of our file drop locations.  This file share server is a Windows 2000 server and is not shceduled to be replaced soon enough by a Windows 2003 R2 server.  About half way through our migration, we noticed receive locations on the BizTalk 2006 Server were shutting down every couple of days.  This then became every day and then every hour and then every couple of minutes.  The problem went from the simple matter of gettong the occasional MOM alert and manually restarting the port to being unmanagable. Life was not good.

Our quick fix was too implement a relatively simple C# console application to scan the receive locations every x minutes and restart stopped receive locations.  This was accomplished using the BizTalk WMI interface and the Windows task scheduler.  Things were'n't perfect, but we were operational.  Then the fun started.  Everything would grind to a halt every couple of weeks, then every couple of days.  The symptoms were:

  • BizTalk Management Interface was unresponsive
  • SQL Server showed a blocking SPID to the SSO DB that would never clear (the system was blocked for 24 hours once before we implemented better alerting)
  • Messages stop processing through BizTalk

The clearing procedure became cycling the Enterprise SSO service on the primary SSO server - which required all BT Hosts to be stopped.  When the database is blocked, this becomes an hour task.  Once the ESSO service was recycled, everything was well again.

First Try at a Fix

We were never happy with the ports shutting down and spent alot of time blaming the less than desirable Windows 2000 server hosting the file shares.  We did alot of Googling and came up with a potential to fix the port shutdown problem: http://support.microsoft.com/default.aspx?scid=kb;en-us;810886.  I had previosuly discussed this here.  We increased the registry entries from 50 to 200.  The problem didn't resolve.  So we walked away from this KB article and resolved to wait for the file share server to be upgraded.  We still had to deal with the blocking.

Enter Microsoft Tech Support

Microsoft Technical Support was contacted.  We did a health check on the systems and identified many BizTalk housekeeping issues we had.  These items were rectified but the problems persisted.  In the process of exploring the SQL Tables, we noticed the BizTalk work queues filling up with orphaned instances.  We had no idea where they were coming from and they were not showing up in the Group Hub.  There were actually thousands of them spread accros the various host instances in our implementation.  We worked with MS support to do some cleanup of our produciton environment.  This seemed to help but then the queues kept filling up again.  After some focuse digging, we managed to stear ourselves and MS support to the solution.  A new hotfix just off the presses - KB936536 (still internal as of this post).  We tried and it did not work.  Back to the drawing board.

After some more in-depth digging, it turned out that there was a bug in BizTalk that involved receive locations stopping unexpectedly not being cleaned up properly.  This bug left the orphaned instance in the work queues.  A patch was created after several weeks and fixed the queue problem.  It did not fix the blocking problem.

After some escalation and shuffling around, we were given a new set of support professionals and had the attention of Escalation Engineering, Product Team, SQL Sengineers and finally - DTC Engineers.  Many weeks of logging and capturing increasingly deeper levels of data led to get the DTC folks involved.  It was one of those joyous moments where you send away your tons of log files and you get one of those moments in an email where you here - 'oh yeah, we've seen this before... there's a hotfix'.  There is such joy that there is a solution and a sense of frustration that nobody said this sooner.

The Root of it All

  1. File Share Server not tweaked to handle the load.
  2. BizTalk shutting down receive locations.
  3. Custom BizTalk WMI program keeps restarting ports.
  4. High WMI/DTC activity brings about KB 934849: A COM+ application that is running on a Windows Server 2003-based computer stops responding and some work items that are queued in the MTA thread pool are not completed.

First Step

Stop the ports from shutting down.  Apparently KB 810886  was the solution to stopping the port shutdowns.  We needed to increase the registry entries on both the client and server to 2048 to see a difference.  Once thoe port shutdowns stopped, the WMI based port watcher has starting ports less frequently which reduced the load on WMI/DTC.  When thoe ports stopped dropping, we set the port watcher to run every 30 minutes.  We've not had a problem since.  We are now testing KB 934849 on our staging servers.  It will be deployed next week to production if all goes well.

SQL Adapter Issue Also?

When we upgraded some SQL Server Adapte rintensive projects form BT 2004 to BT2006, we experienced a similar level of blocking on the SQL transaction associated with these projects.  It only affected the transactional system databse and not the BizTalk database.  We were never able to fix it despite all of the hints and other tweaks we pulled out of our bag of tricks.  Apparently the SQL Adapter in BT 2006 has an increased default isolation level.  We followed all of the new guidelines and still had no success.  We are hoping that this SQL Adapter blocking was the result of KB 93489 as well.  We will be testing this shortly and I will blog it as well.

Friday, January 11, 2008 7:42:29 PM (Eastern Standard Time, UTC-05:00)  #    Comments [0]    | 
Copyright © 2008 Christian M Loris. All rights reserved.
DasBlog 'Portal' theme by Johnny Hughes.