Discuss this Interview
Can you tell us a little about yourself?
I'm Billy Newport. I'm the lead high availability architect for the WebSphere application server platform and a lead architect for WebSphere Extended Deployment (XD). I joined IBM in 2001 after 13 years of international consulting where I worked on workflow, publishing, telecommunications and investment banking projects. My goal is to make WebSphere XD a superior platform against traditional transaction processing (TP) monitors like IBM Encina, BEA Tuxedo, and normal J2EE servers for high end applications in the Telco, financial and high volume OLTP spaces.
What are some of the features now in WebSphere that you're responsible for having added?
I'm involved in various aspects of the product but have been more heavily involved with the design and/or coding of the following:
- Async beans function or application controlled threading for J2EE applications.
- Startup beans
- Pluggable Staff support framework in WBISF (WebSphere Business Integration Server Foundation) 5.1
- Scheduler / timer beans
- DRS II/Session replication in WebSphere Application Server 6.0
- High Availability Manager (HAManager) and its exploitation by messaging and transaction management among others in XD and WebSphere Application Server 6.0
- WebSphere Partition Facility in WebSphere XD
- Channel Framework in WebSphere Application Server 6.0
- Clustering enhancements in WebSphere Application Server 6.0
Your most recent work is with WebSphere XD, what is XD?
WebSphere Extended Deployment (XD) is an add-on for WebSphere Application Server Network Deployment 5.1.1 or WebSphere WBISF 5.1 (our business integration offering). It provides advanced features for administrators, such as consolidating multiple independent server farms into single farms. It also offers key features for developers, such as asymmetric clustering.
What is asymmetric clustering?
Traditional J2EE application servers work well for a large class of applications. This class can broadly be categorized as applications that run in a stateless cluster in front of a database. I call this a symmetric cluster:
- All the cluster members can perform any task at any time.
- The application is stateless.
- The application is modal which means it only performs work synchronously in response to a client request which can be received using HTTP/IIOP or JMS.
There are other applications that do not work well in such an environment, for example, an electronic trading system in a bank. Such applications typically use tricks that can greatly improve performance such as partitioning, multi-threading and write through caching. These are applications that can exploit asymmetric clustering. An asymmetric cluster is practically the opposite of a symmetric cluster:
- applications can declare named partitions at any point while it's running
- partitions are highly available uniquely named singletons and run on a single cluster member at a time
- incoming work for a partition is routed to the cluster member hosting the partition
- The application is amodal. Partitions have a lifecycle of their own and can start background threads/alarms as well as respond to incoming events whether they are IIOP/HTTP or JMS/foreign messages.
WebSphere XD offers a new set of programming API's called the "Partitioning Facility". These APIs allow applications that require an asymmetric cluster to be deployed on a J2EE server for the first time to my knowledge.
How can partitioning improve application performance?
A stateless cluster will only scale so far once the cluster members start competing for database access. If the workload is a read mostly workload then solutions like caching with optimistic locking work well. However, as the write rate increases this starts to break down because of collisions. Some workloads also require incoming work to be executed in order, for example, buy and sell orders for a stock symbol need to be processed in order of price and then in order of arrival. When the work is spread over a cluster then this is made more complex and while it can be done, it's not easy. Partitioning allows the application to partition itself, split the incoming requests in to streams corresponding to those partitions and then route incoming work for a particular event/request stream exclusively to a single partition. This incoming work needs to be classified in to partitions and the requests are then routed to the cluster member hosting that partition using either IIOP or messaging. Once the work arrives then the cluster member can aggressively cache read/write data specific to that partition as the developer knows this data can only be modified on this cluster member. The cluster member can also order the work and ensure that it's processed in the correct sequence in memory independently of any other cluster members. The database is now offloaded as it just gets writes from the cluster, all reads are satisfied using the cache in the cluster member. So long as there are more partitions than boxes then adding boxes will make this application run faster until the database ultimately becomes a bottleneck.
The other situation where async beans plus partitioning improves performance is that both these features provide low level primitives that can be used to practically build a custom application container that fully leverages the features of the application server. The J2EE specification provides a very high level set of services to an application developer. If this set of services isn’t what the application developer requires then the standard spec does not allow the developer to provide an alternative. The normal option is write a standalone J2SE application that does it or try to have features added to the next release of the commercial application server. The partitioning facility and async beans provide low level primitives that allow this code to run on top of the application server in a fully supported manner. This approach allows advanced customers to have the benefit of using a commercial application server without the normal limitations of the J2EE one size fits all philosophy.
Why might the asymmetric/partitioned model scale better than the symmetric model?
A financial application that matches buyers with sellers is an example of one type of application that we've seen. The application trades a set of financial instruments such as stocks. It tracks the buy and sell orders for each stock and attempts to match buyers with compatible sellers. It would also compare the buyers and sellers with prices from other exchanges and may route orders to an external exchange if that exchange is currently posting a better price. Orders should be processed in price order and if orders have the same price then they are processed in the order they arrived. Such a system might receive between 20,000 and 100,000 orders per day but would also receive between 500 and 2000 remote exchange prices/second (2 million prices/hour) and this volume is growing at about 40% a year. Clearly handling the prices volume is the main performance problem. Partitioned application can also better exploit SQL batching to significantly improve performance as they don't need to worry about optimistic locking collisions. The rows are only being changed on a single server, no collisions.
What kind of applications typically implemented on symmetric clusters might benefit from being refactored into asymmetric ones?
- Applications with smaller data sets that experience high request volumes and a relatively high write to read ratio. This kind of application typically doesn’t scale horizontally because of contention between cluster members even using approaches such as optimistic locking.
- Applications that require sequenced request handling where a subset of the incoming events must be processed using some sequence or order. This can be implemented more efficiently using partitioning than with approaches using database locking.
- Applications that have dynamic messaging requirements. If the set of message queues or topics used by an application changes dynamically during the time the application is deployed then a dynamic POJO message listener framework can be easily constructed with the exact threading model needed by the application rather than the normal cookie cutter approach.
- Applications that have very high incoming message rates and where it makes sense to split the incoming message feed so that a specific subset only goes to a single cluster member. We’re partitioning the incoming topics into groups using hashing or some deterministic approach and each cluster member only receives messages for topics assigned to partitions hosted by that cluster member. This cluster member can then aggressively cache state for this subset and this improve performance as well as offloads the database. This again enables horizontal scale up especially when message order is important.
Can you give an example of a large scale application and how you might re-factor it into a partitioned, asymmetric model?
Let’s take an equity order matching system as an example. The system takes ordered streams of buy/sell commands as well as best price indications from other exchanges. It must match compatible buyers against sellers or route orders to an exchange when the pricing is favorable. The system must accept the buy/sell orders as well as best price messages, and process them in order for a given tradable company. The set of companies being traded is dynamic and can change while the system is running. The application can be factored in to two parts. A static part and a dynamic part.
The static part sets up a dynamic POJO container that provides dynamic topic subscription capabilities. A dynamic topic subscription service can be built as follows. Make a couple of daemon threads that use a JMS session per thread. When a new topic subscription is required then the application provides the topic and message listener callback to this service. The service adds the subscription to one of the JMS sessions. The WebSphere specific async beans APIs allow the application developer to build this kind of service pretty easily. The commonj APIs standardized between BEA and IBM can also but they lack some APIs needed to make this service really robust, they are a subset of what’s possible in async beans. It also provides a specialized event delivery subsystem that guarantees all events for a given stream are processed in order but events from different streams can be processed in parallel. Doug (Lea) told me that he’d seen this event delivery pattern in a TP monitor called Genie before. The number of threads in use by this event delivery subsystem is tunable and specifies an upper bound on the amount of concurrency for the application.
The dynamic part is partition driven. The application tells WebSphere when the application starts that the application wants to use say 512 partitions. WebSphere then starts each of the 512 partitions on exactly one cluster member. If there were 10 cluster members then 51 partitions would activate on each cluster member. The application receives an event from WebSphere when a partition is activated in each JVM. The application then asks the database for all the symbols and state that are grouped together in this partition. This grouping can be done using historical data. The application then subscribes to all message feeds needed for the symbols assigned to that partition. The database state for the symbols (lists of buy/sell orders, best prices for exchanges, etc) is cached using a write through cache implemented using CMP. When an order or exchange price is received then it’s sent to the sequencer component and then delivered to the application logic for handling orders or prices. The developer or administrator can specify policies to decide on which machine a partition can run. For example, partitions [0..3] can be grouped and run on machines A and B, partitions [4..7] can be grouped and run on machines B and then A. The policy mechanism is very flexible and can be modified in real time by an administrator.
We can demonstrate that application running on 6 dual processor blades at around 3k tps where each transaction consists of a database transaction with 12 statements and five outgoing messages. This represents the worst case workload of the application using real time matching when every incoming exchange price indication results in a match. More blades give linear scale up. The database needs to be partitioned also both from an availability point of view and for horizontal scalability. We used a quad CPU Intel box running DB2 that used a Network Appliance F940 iSCSI server for disks. The database was running at 98% CPU load at 3.6k transactions per second and these transactions were modifications, not simple cache hits.
Some of that sounds like a container implementation.
Well, people may stare at these static and dynamic portions and claim: “Hey, we could build a container for those and save the developers from the chore of writing this code”. Clearly, we could and we may but I’d prefer to provide the primitives used above as tools for application developers and let them build these components. I call these types of component or plumbing “micro-containers”. I think the ability for developers to write micro-containers is a key new feature for OLTP middleware in order to meet the performance goals of an application in an economical manner. WebSphere provides APIs like partitioning and threading (async beans) to allow this. Micro-containers that prove them-selves to be successful patterns can be componentized and reused in other applications, be made available as open source for other customers or could be incorporated in to off the shelf middleware once they are proven. Most of the current open source lightweight containers are basically just copying what’s in J2EE but using POJOs instead of EJBs. I think we’ll start to see more interesting, really innovative containers once customers start to exploit these kinds of features in WebSphere.
What is an appropriate level of granularity for a partition?
That’s very application specific. Customers currently typically use less than 512 partitions. Partitions can be either:
- Named partitions for a specific function in the application.
Here the application may need a single thread in the cluster doing database archiving for example or summarizing audit records. This work could be done in a partition and the partitioning facility runtime would make sure it’s always running in one cluster member. These partitions correspond to conventional singletons.
- Named partitions for groups of data sets, hashing is an example. All datasets whose key hashes to X run on partition X. An example here is a stock symbol. A partition could correspond to a stock symbol and be responsible for all order matching for that symbol. This partition can cache all symbol related data and then execute the incoming orders/quotes in sequence and write any state changes through to the database to harden them. The goal here is to give each partition exclusive write rights for the subset of the data corresponding to that partition in the database.
Can partitions be used as a conventional singleton?
Yes, a named partition is simply a conventional singleton. HTTP and EJB client requests can be routed to the JVM in the cluster that hosts a specific partition. The application programmer must write an interceptor that is called on the EJB client when a remote method on a specific stateless session bean is invoked. The interceptor then examines the method name and arguments and then returns the name of the partition that the request should be routed to. So, for example, a client can call an EJB method called ‘acceptOrder’ that takes an Order object as a parameter. The interceptor is provided with the method name and argument(s) and the interceptor then returns the String stored in the Order.getCompanyName() method. The client then routes the request to the partition named with the company name. If the JVM with that company crashes or fails then WebSphere automatically fails it over to another cluster member and new requests are now routed there instead.
So the workload management component lives in the client stubs? What about work arriving using messaging?
Applications can also use JMS or non JMS messaging providers such as Tibco Rendevous or Reuters SFC to receive partition related work. When a partition is activated in a JVM then the application can subscribe using JMS or native APIs to queues or Topics for messages for that partition. When the partition is deactivated then the application can unsubscribe from the topic or queue. The async beans feature of WebSphere is critical in providing this kind of flexibility to the application and such an approach is not possible with the J2EE specification. Async beans provides a fully supported threading capability for J2EE application and allows custom threading models for receiving this work. JSR 236 is intended to standardize these APIs and fill this current gap in the J2EE programming model.
How are the partitions made highly available?
WebSphere has a new component called the High Availability Manager or HAManager. This component is included in both WebSphere XD 5.1 and WebSphere 6.0. This provides high availability services in both products. XD uses the HAManager to manage partitions and ensure that a partition only runs on exactly one cluster member. Administrators can use policies to specify preferred server lists and fail back options for partitions. For example, the administrator could say that the partition for trading IBM stock should primarily run on server A and failover to server B with manual fail back to A if A restarts after a failover. We can demonstrate failover times on the order of seconds using this technology. Policies can also be updated in real time to control/move partitions around a cluster without the need for JVM restarts.
How do you detect server failure?
We use heart beating and we also watch the health of connections between peer JVMs. If a connection to a JVM closes or it doesn’t respond to heart beats then we mark it as dead and initiate recovery.
What if the server isn’t really dead?
This is possible when a JVM pauses temporarily and doesn’t send heart beats or because of network flooding or swapping on a machine. The application can be coded to detect and tolerate split clusters using database locking or a file based database on a shared file system using leased file locks. We do this in WebSphere for transaction logging as well as our built-in messaging engine. Or, if a customer desires it, WebSphere can integrate directly with the hardware to make a very cost effective highly available system. When WebSphere detects a failed server, it can be configured to run a script whose purpose is to guarantee the server is stopped. When the script returns then WebSphere starts recovery. We have scripts that communicate with the service processor on xSeries servers as well as the xSeries blade chassis service processor and tell it to power down the blade with the suspicious JVM. Scripts can also be written to control intelligent power strips, there are power outlets with Ethernet ports and individual outlets can be switched on or off using SNMP traps typically and to take advantage of non IBM server hardware with similar capabilities. This guarantees the JVM has failed before recovery starts. If the script cannot verify the server is powered down then WebSphere doesn’t failover and we notify the administrator.
Why is it best for a partition to run on one server vs. across multiple ones?
The key to horizontal scalability is eliminating cross talk and contention between servers. Stock trading, airline/hotel reservations, batch applications are examples of such applications. The ideal situation is that partitions do not need to interact with each other at all. If this is the case then we'll get linearly scalability. All applications that use a database will experience some cross talk within the database due to index locks or latches within the database (such as those surrounding the transaction log) until the application is using a partitioned database. Application Architects should strive to stay as close as possible to this ideal as possible for best performance.
Won't the database become a bottleneck?
Of course, but this database can do more work than normal because it's just processing updates, it should be experiencing very little contention and practically no queries at all. We have seen even small 4 way boxes perform several thousand update transactions per second on internal benchmarking for customers in the financial sector. XD includes new support for a proxy data source, which enables developers to be able to use multiple databases for a single CMP per deployment. This means the data in the database itself can be partitioned. We could use two independent DB2 databases and put the read/write data from partitions A-M in database A and partitions N-Z in database B. The application can tell XD which DB2 database to use for a given transaction. This also improves availability as losing a database doesn't stop all work, it's a partial outage. Suppose database A goes down then the data from N-Z can still be processed. This provides a linearly scalable pattern for the database that’s superior from an availability and scalability viewpoint when compared with competitive database products.
You mentioned XD supporting server consolidation, how does that work?
A lot of customers tell us that they want to consolidate multiple independent server farms in to a single smaller server farm. They want to do this because most server farms are underutilized or over provisioned. The boxes are only typically running at 20%-30% capacity, if that. This is quite costly and is not flexible. For example, an application in one server farm experiences a surge in requests and maxes out the CPU capacity in that farm, while the server farm in the next room is still basically idle at 10% CPU.
XD allows administrator to define a single cluster (a node group) and then deploy multiple applications to that node group. The administrator does not tell XD how many servers in the node group should run each application. Instead, the administrator tells XD that application A should have a response time of one second, application B 500ms and application C two seconds. The administrator can also tell us that A is more important than B and C and C is more important than B. XD will then monitor the workload and dynamically decide how many instances of each application to run on the node group in order to meet these goals. If application A currently has a response time of 1.5 seconds then XD will add more application A server instances and potentially remove application B and C server instances to bring more resources to application A so it can meet its goal. XD can also predict that A will likely exceed its response time in 10 minutes based on a trend and react in anticipation of the event. This greatly simplifies the life of an administrator and allows the machines to be more efficiently used than a usual independent server set approach. XD also offers options to generate various email alerts when conditions are exceeded; for example it can restart servers when they appear to have a memory leak or after X requests.
Can you describe how this automatic provisioning works within the context of J2EE applications?
Normally, an administrator creates a cluster and then adds specific machines to that cluster. The application is then deployed to this cluster and runs on all those machines. The number of servers in the cluster is static unless the administrator manually adds/removes servers to/from the cluster. XD is different. The administrator creates a node group with all available servers. The administrator then creates a dynamic cluster for each application to be run on the shared environment represented by the node group. The administrator when specifies the goals and which applications/application URIs have which goals. XD's job is then t
- Measure the performance of the applications in the node group
- Analyze it
- Predict and compare the response times with the goals for the requests
- Execute changes to the placement of applications to the node group servers to better meet the goals
This is also called a MAPE loop from the initials. As an example, let’s say there are four machines in the node group. The administrator then adds two applications to the node group by making a dynamic cluster for each application. Application A has a service level agreement of 1 second and high priority. Application B has a service level agreement of .5 seconds and a medium priority. A is initially just running on one server. If the workload for A is small and B gets a spike and breaks the .5 second SLA then XD will automatically start application B on all 4 servers. If application A then because of the load from B breached it's SLA then because it's more important, XD will stop B running on the server with A and XD may over time start A running on all 4 servers and stop B running on 3 of them in order to allow A to meet its goal.
Any parting thoughts?
I think it’s an exciting time now for J2EE and WebSphere. Products like XD, especially with the partitioning facility and async beans (a WBI 5.1 or WebSphere 6.0 feature) and the additional features that we’re planning for XD 6.0, will redefine the types of application that Java application servers can be used for. This technology will apply to the high and low end.
Some customers want 100 transactions per second while others want 100,000 transactions per second. Some of the largest customers want less than ten transactions per second but they are very high value, mission critical transactions. Almost everybody wants very high availability levels, fast recovery times (seconds) and to be able to leverage standards based skill-sets that their programmers already mostly know rather than have to hire a team of rocket scientists and build a complete custom platform from scratch.
A small amount of custom code can make a huge performance difference for an application and WebSpheres support for micro-containers finally allows customers to do this without having to build a complete custom application server from scratch. Use what we provide and enhance it with a micro-container if you need to. Most of your application is vanilla J2EE code and a small fraction of it will likely interact with a micro-container if you needed one.
The trick is making an off the shelf platform that can deliver these kinds of expectations using economical technology both from a development and a deployment point of view. The technology we’re adding to the WebSphere J2EE platform will allow customers to do this and get the best of both worlds and hopefully will redefine the OLTP transaction monitor space.
PRINTER FRIENDLY VERSION
|