Monday, June 16, 2008

No Cookie for You!!



UPDATE:
A better explanation has surfaced. See this post for more info http://jochem.vandieten.net/2008/07/03/reserved-names-for-cookies/
I am leaving this entry as is as it still solved our issue. Call it a "work around".


So we've been noticing a lot of entries in one of our logs that look like this:
06/16 16:27:08 error Cannot create cookie: expires = Wed 09-Jun-2038 16:17:42 GMT
06/16 16:27:08 error Cannot create cookie: expires = Wed 09-Jun-2038 16:17:42 GMT
06/16 16:27:54 error Cannot create cookie: expires = Wed 09-Jun-2038 16:17:42 GMT
06/16 16:27:54 error Cannot create cookie: expires = Wed 09-Jun-2038 16:17:42 GMT
06/16 16:28:34 error Cannot create cookie: expires = Wed 09-Jun-2038 16:17:42 GMT
06/16 16:28:34 error Cannot create cookie: expires = Wed 09-Jun-2038 16:17:42 GMT
06/16 16:29:22 error Cannot create cookie: expires = Wed 09-Jun-2038 16:17:42 GMT
06/16 16:29:22 error Cannot create cookie: expires = Wed 09-Jun-2038 16:17:42 GMT
06/16 16:30:13 error Cannot create cookie: expires = Wed 09-Jun-2038 16:17:42 GMT
06/16 16:30:13 error Cannot create cookie: expires = Wed 09-Jun-2038 16:17:42 GMT
06/16 16:31:11 error Cannot create cookie: expires = Wed 09-Jun-2038 16:17:42 GMT
06/16 16:31:11 error Cannot create cookie: expires = Wed 09-Jun-2038 16:17:42 GMT


I kept seeing postings from a couple of years ago that people would run into this issue as a result of a memory shortage in the JVM. This was not the case with us however as we have 400+ megs of memory free on average.

I decided to try and pair up the times of these errors with our IIS logs to see what I could find. I did find a matching entry that looked like this:
2008-06-16 00:00:01 W3SVC762671395 xxx.xxx.xxx.xxx GET /xxx/ xxx=5773E359497D4F1B 80 - 12.129.9.198 HTTP/1.1 Botster.LinkChecker/v.1.0 CFID=xxxxxxxx;+expires=Tue+08-Jun-2038+16:28:07+GMT;+CFTOKEN=xxxxxxx;+expires=Tue+08-Jun-2038+16:28:07+GMT;+JSESSIONID=xxxxxxx - www.xxxxxxxx.com 200 0 0 526 356 453

Not sure what to do about it yet. We could block Botster.LinkChecker at the IIS level as we have an ISAPI tool in place. But I'm wondering if there is something we can do at the code level too. The section of the log entry with +CFID, +expires, etc.. is part of the cs(Cookie) area. I'm assuming this is the area that reveals which cookies were requested to get set. Is it possible to check the user-agent for Botster and then redirect at the code level? I don't remember the order of operations there.

The rest of the story...
After looking at this for another day I finally realized what was going on. The key to this all lies within CF's client management. When the bot is hitting our site, cf wants to establish a session so it tries to write two permanent cookies, cfid and cftoken, to the visiting client. The problem comes in when bots don't allow the setting of permanent cookies. CF was noticing this and writing an error entry a couple times every minute, as often as the bot was hitting us.

CFID=7857398;
+expires=Thu+10-Jun-2038+13:51:18+GMT;
+CFTOKEN=1959af634c5b6ec3-96CB3B7F-17A4-A7F4-705CBC2B0F9052AA;
+expires=Thu+10-Jun-2038+13:51:18+GMT;
+JSESSIONID=2e3040ee771260d33446

The light turned on for me when I broke down the iis log entry cs(Cookie) area like this. It looks like the first item in the ';' delimited list is the cookie that was requested. If it is followed by an expires, that is a modifier to the previous item in the list. So when I saw this I realized that it was CF that was trying to set the cookie, not some renegade bot code. Notice that JSESSIONID doesn't have the expires modifier because it is always set as a session cookie and goes away when the browser is shut down (hence, not an issue for bots). I think its helpful to break the info down to simple lines, otherwise you get lost in the horizontal scrolling.

There are two solutions for this that I looked at. The first I mentioned before was filtering the bot at the ISAPI level. The second was using j2ee session management. We already had this turned on but we must have forgotten about the fact that the setclientcookies in the cfapplication tag defaults to true. Jsessionid replaces the need for cfid/cftoken though cf still seems to put cf still seems to append cfid/cftoken to the urls. Cflocation will append these variables if you have addtoken=true. We can solve our problem by simply adding the setclientcookies=false in the cfapplication tag for the app.

FYI: if you are using cfloginuser, cfid/cftoken perm cookies will still be set if in your cfapplication tag you have loginmanagement set to cookie. This seems to override the setclientcookie setting. I haven't tested it yet, but I bet if you set loginmanagement="session", you would not see the perm cookies being set. This is not a concern for us at the moment since the bot is not trying to log in, its simply hitting publically accessible data.

Additional info on J2EE session management can be found here:
How to write CFID and CFTOKEN as per-session cookies
http://kb.adobe.com/selfservice/viewContent.do?externalId=tn_17915

Hiding / Encrypting ColdFusion CFID And CFTOKEN Values
http://www.bennadel.com/blog/785-Ask-Ben-Hiding-Encrypting-ColdFusion-CFID-And-CFTOKEN-Values.htm

Friday, June 6, 2008

16 thread pile-up...

We have been having issues with one of our servers going down. The offending action seemed to be happening in the middle of the night some time. One of the symptoms was a sql table that was locked and the site would just spin when trying to access it. It was either that or the cf service was plain dead in the morning. After further review in the logs, I found these scattered throughout:
06/05 23:49:53 Error [jrpp-57] - Error Executing Database Query.[Macromedia][SQLServer JDBC Driver]Connection reset by peer: socket write error The specific sequence of files included or processed is: X:\xxx\xxx\xxx.xxx, line: 348
removeOnExceptions is true for xxx. Closed the physical connection.

followed by a bunch of

java.lang.RuntimeException: Request timed out waiting for an available thread to run. You may want to consider increasing the number of active threads in the thread pool.
at jrunx.scheduler.ThreadPool$Throttle.enter(ThreadPool.java:116)
at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:425)
at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)
Notice that the connection to sql was being reset when CF was attempting a sql execution. It didn’t look like the same line of code every time, but there is one line that is showing up more often. While researching the error in google, I came across a couple of possible causes. One was a possible flaky network that was dropping the attempted sql. While I was checking on the sql server box for a record of possible network issues in the sql log, I found that there were some sql errors that were occurring at the exact same time as our CF woes. Here they are:
Event Type: Error
Event Source: MSSQLSERVER
Event Category: (2)
Event ID: 17066
Date: 6/5/2008
Time: 11:49:50 PM
User: N/A
Computer: xxxxxx
Description:
SQL Server Assertion: File: , line=9421 Failed Assertion = 'NULL == m_lockList.Head ()'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.

Event Type: Error
Event Source: MSSQLSERVER
Event Category: (2)
Event ID: 3624
Date: 6/5/2008
Time: 11:49:50 PM
User: N/A
Computer: xxxxxx
Description:
A system assertion check has failed. Check the SQL Server error log for details

Event Type: Error
Event Source: SQLSERVERAGENT
Event Category: Alert Engine
Event ID: 318
Date: 6/5/2008
Time: 11:49:55 PM
User: N/A
Computer: xxxxxx
Description:
Unable to read local eventlog (reason: The parameter is incorrect).
(Link: http://go.microsoft.com/fwlink/events.asp.
)
After looking at a couple of these errors on google, I saw one instance where it was solved by moving to SP1 for SQL2k5. We recently upgraded to SQL 2K5 and are still on SP0. There may be an issue with sp0 where it doesn’t handle locks correctly and it may be dumping there. Notice this line… SQL Server Assertion: File: <lckmgr.cpp>, line=9421 Failed Assertion = 'NULL == m_lockList.Head ()'. The lock isn’t being handled correctly for some reason.

We have opted to try to up to SP1. Think we can trust it? Its only been out since 2006 ;). We'll see what happens.

UPDATE (6/10/08): our server has been up for 2 straight week days with no issues at all. In fact, the site has been reported to be responding much faster. A lot of SP1's updates were efficiency related so things should naturally respond better after upgrading.

Thursday, June 5, 2008

CF Mail Spooler Woes...

We have been having some issues with our mail server recently. Mail hasn't been going out and this is what we noticed in the logs:

"Error","scheduler-4","06/03/08","10:33:24",,"Could not connect to SMTP host: xxx.xxx.xxx.xxx, port: 25; nested exception is: java.net.ConnectException: Connection refused: connect"
javax.mail.MessagingException: Could not connect to SMTP host: xxx.xxx.xxx.xxx, port: 25;
nested exception is:
java.net.ConnectException: Connection refused: connect
and
"Error","scheduler-1","04/02/08","03:30:56",,"A problem occurred when attempting to deliver mail. This exception was caused by: coldfusion.mail.MailSpooler$SpoolLockTimeoutException: A timeout occurred while waiting for the lock on the mail spool directory.."
From what I found, this occurs for two reasons:
  1. Bombage from sending a huge email
  2. Insufficient disk space
Reference: http://www.talkingtree.com/blog/index.cfm?mode=entry&entry=67FD4A34-50DA-0559-A042BCA588B4C15B

We did notice a message the other day claiming that we were running low on disk space.
There is a recent hotfix from Adobe that supposedly helps this that was released in April '08 that supersedes a previous hotfix in this area. That
hotfix can be found here: http://kb.adobe.com/selfservice/viewContent.do?externalId=kb402001&sliceId=1
We applied the hotfix today and cleared some disk space so we'll see happens.

Update: after getting our hands dirty today, KP found that there was a backup of 10,000+ emails in the inetpub/mailroot/queue directory. Something is getting hosed at that level. He found this by telnet'ing to the server's port 25 and sending a message. It told him that the message had been queued for delivery. He never did get the message which told us that the bottleneck was at the queue. We've got a ticket opened up to check it out.