One of the roles Mark Thomas is responsible for within the Apache Software Foundation is managing the issues.apache.org servers - their issue trackers that run both Bugzilla as well as JIRA. A few months ago, while looking at the JIRA management interface, Mark discovered exactly how significant an impact web crawlers were having on these resources:
I noticed that we were seeing around 100,000 concurrent sessions. Given that there are only 60,000 registered users and less than 5,000 active users any month, this number appeared extremely inflated. After a bit of investigation, the access logs revealed that when many of the webcrawlers (e.g., googlebot, bingbot, etc) were crawling the JIRA site, they were creating a new session for every request. For our JIRA instance, this meant that about 95% of the open sessions were left over from a bot creating a single request. For instance, a bot requesting 100 pages, would open 100 sessions. Each one of these requests would hang around in memory for about 4 hours, chewing up tremendous memory resources on the server.
Quite understandably, web crawlers, or bots, do not use any kind of cookie or session authentication to crawl web pages and index their results. As a result, for applications such as JIRA that rely on sessions, every request for every page on a JIRA site appears as a fresh visitor and creates a new session. Very quickly the number of concurrent sessions can escalate, and consume valuable system resources, especially memory.
To improve performance, Thomas has designed a new valve that is available only in Tomcat 7. This valve is designed to ensure that each series of requests by a crawler only consumes a single session. Outlined today in a post on the appropriately named Crawler Session Manager Valve, he describes how the new valve works:
To do this, Tomcat uses a regular expression to see if the incoming request is from a known user agent HTTP request header (by default it checks for *[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*), and it keeps a note of all the IP addresses those headers came from as well as the last Session ID of that request.
When a crawler first access the site, a new session is created as part of that first request, however upon requesting a second page – the Crawler Session Manager Valve recognizes the crawler from its user agent header, matches it to the IP address and insert the previous session ID into the request. Thus, the crawler only ever opens a single session.
For the ASF's JIRA issue tracker, implementing this valve alone reduced the number of concurrent sessions from around 100,000 to 5,000, significantly improving resource consumption.
The new valve is only available in Tomcat 7, and is not turned on by default. For more insight on turning the valve on and configuring it, see Mark Thomas's full post.
As an interesting little footnote, Thomas also comments on the results of ASF using JIRA on Tomcat 7 (not yet supported by Atlassian).
Special note: Although JIRA is only certified to run on Tomcat 5 and Tomcat 6, we actually run it on the latest Tomcat 7 release. Running JIRA on Tomcat 7 has not caused any issues which, as an aside, is a testament to how well Tomcat 7 and the Servlet 3.0 specification has been engineered for backwards compatibility.