Friday, June 6, 2008

16 thread pile-up...

We have been having issues with one of our servers going down. The offending action seemed to be happening in the middle of the night some time. One of the symptoms was a sql table that was locked and the site would just spin when trying to access it. It was either that or the cf service was plain dead in the morning. After further review in the logs, I found these scattered throughout:
06/05 23:49:53 Error [jrpp-57] - Error Executing Database Query.[Macromedia][SQLServer JDBC Driver]Connection reset by peer: socket write error The specific sequence of files included or processed is: X:\xxx\xxx\xxx.xxx, line: 348
removeOnExceptions is true for xxx. Closed the physical connection.

followed by a bunch of

java.lang.RuntimeException: Request timed out waiting for an available thread to run. You may want to consider increasing the number of active threads in the thread pool.
at jrunx.scheduler.ThreadPool$Throttle.enter(ThreadPool.java:116)
at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:425)
at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)
Notice that the connection to sql was being reset when CF was attempting a sql execution. It didn’t look like the same line of code every time, but there is one line that is showing up more often. While researching the error in google, I came across a couple of possible causes. One was a possible flaky network that was dropping the attempted sql. While I was checking on the sql server box for a record of possible network issues in the sql log, I found that there were some sql errors that were occurring at the exact same time as our CF woes. Here they are:
Event Type: Error
Event Source: MSSQLSERVER
Event Category: (2)
Event ID: 17066
Date: 6/5/2008
Time: 11:49:50 PM
User: N/A
Computer: xxxxxx
Description:
SQL Server Assertion: File: , line=9421 Failed Assertion = 'NULL == m_lockList.Head ()'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.

Event Type: Error
Event Source: MSSQLSERVER
Event Category: (2)
Event ID: 3624
Date: 6/5/2008
Time: 11:49:50 PM
User: N/A
Computer: xxxxxx
Description:
A system assertion check has failed. Check the SQL Server error log for details

Event Type: Error
Event Source: SQLSERVERAGENT
Event Category: Alert Engine
Event ID: 318
Date: 6/5/2008
Time: 11:49:55 PM
User: N/A
Computer: xxxxxx
Description:
Unable to read local eventlog (reason: The parameter is incorrect).
(Link: http://go.microsoft.com/fwlink/events.asp.
)
After looking at a couple of these errors on google, I saw one instance where it was solved by moving to SP1 for SQL2k5. We recently upgraded to SQL 2K5 and are still on SP0. There may be an issue with sp0 where it doesn’t handle locks correctly and it may be dumping there. Notice this line… SQL Server Assertion: File: <lckmgr.cpp>, line=9421 Failed Assertion = 'NULL == m_lockList.Head ()'. The lock isn’t being handled correctly for some reason.

We have opted to try to up to SP1. Think we can trust it? Its only been out since 2006 ;). We'll see what happens.

UPDATE (6/10/08): our server has been up for 2 straight week days with no issues at all. In fact, the site has been reported to be responding much faster. A lot of SP1's updates were efficiency related so things should naturally respond better after upgrading.

1 comment:

Code Fusion, LLC (Kevin Penny) said...

This was an easy win for us to upgrade for sure - hooray for Service Packs.

I've also heard rumors that once we flip the switch and go from SQL 2K compatibility mode to 2K5 compatibility mode, that we'll also get another boost in performance.