Discussions

News: Billy Newport on Asymmetric Clustering & Websphere XD

  1. Large scale systems such as electronic trading floors have performance and scalability requirements that invalidate the usual clustering & distributed caching best practices adopted in most enterprise deployments. In this TSS Tech Talk, Websphere High Availability Lead Architect Billy Newport introduces the notion of Partitioning & Asymmetric clustering, and explains how these concepts are supported in Websphere's Extended Deployment XD product.

    Asymmetric clustering proposes an architecture that is almost opposite to the typical stateless server farm where the entire app is replicated across machines, some times using distributed caching products for performance increasing. In an asymmetric cluster, business logic is split into partitions, where each partition can be the sole accessor of a set of underlying data. As a result, each node in the cluster can implement it's own local cache (and be the sole accessor of that data), resulting in high performance reading and writing without the need to maintain a distributed cache between cluster nodes.

    Read: Billy Newport on Asymmetric Clustering & Websphere XD

    Threaded Messages (34)

  2. Most of what is described sounds like standard OMS techniques. It's fairly easy to partition the OMS (order management system) by either ticker/cusip or by exchange. Like NYSE, LSE, Nasdaq, etc. The tricky part about using this approach is routing the messages efficiently and having efficient message filters.

    For example, say all the bid/buy transactions sent through JMS. If the filters are written such that there is one filter class per security, performance will go down the tubes. Very quickly, you end up with a situation where there's 20K filters and every message has to scanned. Once the message gets to the correct partition, the system has to efficient match buyer with seller. It gets even more complex when you consider the OMS will sort and aggregate bid/buy for the same security.

    finding the right combination of orders to optimize trade execution is rather tricky.

    peter
  3. Why is this new?[ Go to top ]

    People have been doing this for ages. Why is it any special? just because someone calls it a product?

    People have been spliting web apps. e.g. serve static content from one server farm and dynamic from another. Another example is separate order entry system cluster from customer support cluster, so even if you can't take calls, orders are still coming in. With web based system's it is easier to partition the application to support this deployment.

    I wonder how this is different than what people have been doing for ages. It has nothing to do with Java/J2EE it is a standard deployment practice for any kind of system.
  4. Because...[ Go to top ]

    You are describing some standard multi-tier setups that can be done with any application server even one that isn't clustered, i.e. you just split the front and backends. Clustering just provides load balancing over the live servers in these typical scenarios.
    This isn't what XD partitioning gives you. WebSphere and basically all application server products clearly cover the scenarios you describe with existing technology.
    XD lets a developer declare dynamic named partitions for a single clustered application, i.e. a single tier. Each partition is a named 'live' service residing in one cluster member. If a cluster member fails then all partitions hosted by the now failed cluster member are actively reassigned within seconds on the surviving cluster members as soon as the failure is detected. The application programmer is involved and is informed of these events. The developer can use partitions and the threading APIs supported by WebSphere to develop advanced applications as a result.
    This is very different than the pretty standard scenarios that you've described in your post.

    Thanks
    Billy
  5. Because...[ Go to top ]

    If a cluster member fails then all partitions hosted by the now failed cluster member are actively reassigned within seconds on the surviving cluster members as soon as the failure is detected.

    Supposing the asymetric cluster is running an OMS, and you have a one-to-one partition-machine mapping (for the sake of simplycity), and every partition manages the order book for one security, I was wondering how do you recover the order book that was hosted on the failed cluester member?

    Regards,
    Horia
  6. Recovery[ Go to top ]

    The application has a callback that basically has two methods, load and unload. Each event also receives the partition name.

    Lets say we just started and partition A is on S1 and partition B is on S2.

    The runtime will invoke load("A") on S1 and load("B") on S2. S1 and S2 now start processing work for A and B and for the purposes of here, they use a write through cache to a database to harden state.

    If S1 fails then the runtime will invoke load("A") on S2. So, recovery doesn't involve a JVM restart, it's a hot recovery in this sense. The time from failure to load("A) on S2 depends on the HAManager tuning in WAS but it's around 6 seconds. The main issue with such fast recovery times is getting the database locks released in time due to TCP stack issues.

    The application can then recover partition A. The application will preload the write through cache for A from the database and then continue processing work for A after/while recovery anything pending on A. Normally, a partition satisfies all reads using the cache and writes all modifications through to the database to harden them.

    The administrator can specify where a partition should run using a preferred server list with an optional failback setting. The policy can be set per app, per group of partitions or per partition.

    Billy
  7. Recovery[ Go to top ]

    Well, a DB involvement makes it clear ehough. Distributed (or for that matter replicated) caches used mainly in read-only mode in web apps deployments can take some load off the DB.

    A OMS type app, kind of requires asymetric processing since an order book of one security is managed serially and is statefull and can be replicated maybe for read-only purposes. AFAIK OMSs are tipically in memory processes storing state asyncrhonously and recovering using log replay in case of crash.

    In your architecture you are talking about write-through caches. Any thoughts about softening posible DB bottleneck?

    Regards,
    Horia
  8. Recovery[ Go to top ]

    Well, a DB involvement makes it clear ehough. Distributed (or for that matter replicated) caches used mainly in read-only mode in web apps deployments can take some load off the DB.A OMS type app, kind of requires asymetric processing since an order book of one security is managed serially and is statefull and can be replicated maybe for read-only purposes. AFAIK OMSs are tipically in memory processes storing state asyncrhonously and recovering using log replay in case of crash.In your architecture you are talking about write-through caches. Any thoughts about softening posible DB bottleneck?
    Regards,Horia

    This will sound expensive, but when it matters, businesses will have hot backups. Some of the bigger shops have 2 levels of backup that are exact mirrors of the production environment. Say partition A goes down, the hot backup for A immediately kicks in and starts working. I don't know the exact details of how this is setup at the lowest levels, but it is a common practice. Usually, there's also monitors that will send out fire alarms. If partition A really goes down, a bunch of people scramble to figure what went wrong and get the system back up as fast as they can :)

    peter
  9. Recovery[ Go to top ]

    XD doesn't mandate a particular state recovery strategy. At the high end, I've seen naked iron systems using 3 way synchronous memory replication as Peter mentions. XD can exploit this kind of system if its available. Applications using XD partitioning can take advantage of such setups also. It's not coupled directly to database technology.

    If the customer wants to use a database then we've added supported for partitioned databases to XD to these scenarios. Here, the application can tell XD which database to use for a transaction transaction when its using CMP beans for an OR mapper. This allows the database to be scaled by using multiple databases for different slices of the data. It also prevents a database failure from spotting the complete system. If a database fails then the application can continue processing requests for the data stored in the remaining database boxes while the failed database is restarted. This approach turns DB failure in to a partial outage rather than the normal total outage. Even so, we've seen applications drive a 4 way database to over 3k tps which is a substantial load for the majority of customers. Given, the DB can be partitioned and scaled, the limit is basically how many boxes can be deployed.

    Or XD could work with third party replicated memory products rather than a database, it's a choice the developer can take.

    It's important to realise that XD can be used for more than just trading systems. For example, suppose a clustered application needs to consolidate the results from each JVM on a machine and process them in a single file. XD has node scoped partitions where when there are at least 2 JVMs on a particular node, XD activates the node scoped partition once per node. The JVM with the active partition can aggregate the data and write it to a node specific file. The application can also use a cluster scoped partition to aggregate the files fron all the nodes to make a cluster level result. We'd juse need a shared file system and then let XD worry about making sure there is always a per node partition for node aggregation and a cluster level partition for cluster aggregation.

    Billy
  10. Because...[ Go to top ]

    My goal is to make WebSphere XD a superior platform against traditional transaction processing (TP) monitors like IBM Encina, BEA Tuxedo, and normal J2EE servers for high end applications in the Telco, financial and high volume OLTP spaces.

    Why integrate these features into J2EE? It sounds like you're just adding a bunch of OLTP features to a container that was designed for interactive and DSS type systems. Are you proposing that both of these types of computing can take place in the same tier?

    -- Les Walker
  11. Yes[ Go to top ]

    J2EE infrastructure like any infrastructure is a significant investment for a customer. The license costs are the tip of the iceberg once you factor in training, staff costs, operations, etc. We're looking at letting customers leverage this investment and use it for applications that normally wouldn't be appropriate.

    An application built using this technology is 95% commodity skills for the business logic and operations side where as a bespoke system or one built on a legacy platform clearly is more expensive to develop, deploy and maintain. Plus, you've two platforms now, not one. This, for me, is what makes it very attractive to customers that already have these skills in house and want to leverage them to build OLTP applications.

    Billy
  12. Some more[ Go to top ]

    just thinking about this more and what I want to say. We've put a lot of work in to WebSphere 6.0 from the high availability side of things. We can hot recover in doubt transactions in around 10-15 seconds when WebSphere is used with NFS v4 drives for transaction log storage, the platform messaging (JMS) also recovers in similar timeframes.

    Our session replication technology is substantially more performant (no numbers here but you'll be amazed) than it was in 5.x and has a higher quality of service despite the performance improvement.

    When WAS is used with a shared file system, and the messaging uses a remote database for message persistent then WAS requires NO EXTRA software to make it very highly available. This means you will likely not need to have SAN LUNs, dual ported disks, or buy Veritias for each WAS box etc. It's clearly significantly simpler to deploy a HA setup in this topology when compared with before because of the much lower complexity.

    We didn't make all these changes to see WebSphere relegated to just doing interactive/DSS. These changes plus whats in XD show and in flight when XD 6.0 ships show that we're very serious about enabling customers to leverage their investment and get more out of it. I think these features are a big step up from what traditional OLTP middleware offers and when you add in the standards based programming model etc, I think it's a great value.

    Sorry if this sounds like marketecture, it's not the intention, I just want to articulate the value proposition.

    Billy
  13. Some more[ Go to top ]

    I think these features are a big step up from what traditional OLTP middleware offers and when you add in the standards based programming model etc, I think it's a great value.

    No problem with the first part of this statement. These features are clearly pain-relief for the programmers that are trying to build trading systems in J2EE. My standard way of dealing with this kind of thing is to create huge-grained entity beans crammed with business logic and deployed to a different tier so you can use the app server's call routing. Partitions are definitely a step up.

    Integrating these tools is also a good thing. Under torture I'll even admit that a standard programming model is a good thing.

    I'm going to get myself in trouble here, but where I differ is with the value of the J2EE programming standard when it's applied to high-performance OLTP applications. In my opinion OLTP systems have to be extremely cohesive. If my controller is enlisting a component in a transaction I have to do more than make sure that the API is compatible. Everything that component does has to follow the same design principles that mine does -- things like locking resources in a canonical order and not breaking up the transaction. The concepts of "application assembly", "declarative transaction management", etc. in J2EE are at odds with this cohesiveness.

    As well, the container doesn't provide the isolation that I believe is necessary to insure stable performance. Seldom will you find a trading system sharing a database instance with a reporting system, why would you let it share a container?

    I'd like to see someone go back to the first principles of designing and building OLTP systems and come up with a container whose programming model encourages cohesiveness rather than loose-coupling.

    -- Les Walker
  14. Some more[ Go to top ]

    I agree with most of what you say. There is nothing to stop a customer running their app in XD, using partitions and then using a POJO like model for the app behind the partitions. That can be bootstrapped using Spring etc which is what I'm looking at as we speak. They can also choose to use a POJO based OR mapper rather than CMP again because of path length although while POJO OR mappers can generate competitive SQL to us, path length wise, we're still pretty fast so it's not as clear cut as you might think w.r.t. pathlength.

    I'm not someone that advocates EJBs or nothing. Our own starter applications written for customer while we don't use Spring in them for legal reasons, are written using EJBs only where it makes sense and not as a component model. I have a session bean for my partition callbacks because thats how its designed (by me) and I use CMPs for persistence and a lot of POJOs around it for the microcontainers. I could easily see someone using another persister rather than CMP and I can quite easily see myself using Spring to write it all together with my initial session bean with the callbacks bootstrapping the whole thing.

    So, in a way, we're trying to do what you say. We have a standard and we're adding new primitives that allow you, the customer, to use what makes sense in your application with the hybrid programming model.

    Thanks
    Billy
  15. Some more[ Go to top ]

    ... we don't use Spring in them for legal reasons

    Hmmm, care to elaborate on this. Spring is Apache/BSD isn't it? Or should the rest of us be careful of any lurking clauses using Spring.
  16. Spring[ Go to top ]

    Nope,
    It's just company policy. We need to seek permission to use any software sourced outside the company and its a lot of hassle getting clearance to ship things we write that use open source software. The Apache 2 license is probably the best one out there. We usually can't touch or even view docs for open source with the other licenses.

    Spring is an excellent piece of software, very useful and I've seen quite a few customers who use it and are happy with it.

    Billy
  17. Some more[ Go to top ]

    I'd like to see someone go back to the first principles of designing and building OLTP systems and come up with a container whose programming model encourages cohesiveness rather than loose-coupling.

    Not so sure about that. Loose-coupling gives you the power to scale. Cohesiveness enforcement on partitions is enough.

    Regards,
    Horia
  18. Sharing containers[ Go to top ]

    Seldom will you find a trading system sharing a database instance with a reporting system, why would you let it share a container?I'd like to see someone go back to the first principles of designing and building OLTP systems and come up with a container whose programming model encourages cohesiveness rather than loose-coupling.-- Les Walker

    I don't think I'm saying share the container or database. If the OLTP application used database partitioning in combination with application directed transaction routing then I'd recommend a product like DB2 Information Integrator to provide a query capability across the independant partitioned databases. I'd use independant database instances purely because it avoids a failure in one impacting the remaining databases, a problem with current clustered databases.

    The reporting and OLTP application can, of course, run in seperate clusters on seperate hardware. The reporting software could use a large cache subscribed to change events generated by the OLTP system or it could be more passive and just trawl the info from an async replica of the primary OLTP databases, lots of ways to slice and dice this depending on whats allowed.

    Billy
  19. First cluster and then don't.[ Go to top ]

    First convince your customers that clustering is good and make money. Then convince it is not and make some more money.
  20. Peters right, of course. Sounds like the voice of experience :)

    The trick as he says is designing the application so that it works taking these limitations in to account. You can't just naively subscribe to 20k jms subscriptions and then wonder why your application is slow. Floyd had asked me a question which we didn't publish on what did I think seperates enterprise architects from web developers and realizing points such as Peters demonstrates it. Experience is an able teacher and most of us have the scars to prove it.

    Maybe, we use 256 partitions and publish prices for a CUSID on to a topic for the partition hosting the CUSPID. (A CUSIP is a financial term for a tradable security). Now, even with 9k CUSIPs there are only 256 subscriptions to filter. You need to look at the complete stack in use and optimise the application architecture for the complete stack. You can't write the application naively ignoring the reality of the capabilities of the middleware around the application server and expect a reasonable result. This applies equally to vanilla J2EE applications that use just an application server and a database. Applications that are poorly coded can inefficiently use a database on even a very fast machine. You need to know how to use the database efficiently and write the application accordingly. The JMS problems Peter mentions are a different manifestation of the same thing.

    We've written an application that is available to customers demonstrating an OMS written using the XD technology and it solves the problems Peter mentions.

    Nice post Peter.
    Billy
  21. More like brain damage[ Go to top ]

    Peters right, of course. Sounds like the voice of experience :)The trick as he says is designing the application so that it works taking these limitations in to account. You can't just naively subscribe to 20k jms subscriptions and then wonder why your application is slow. Floyd had asked me a question which we didn't publish on what did I think seperates enterprise architects from web developers and realizing points such as Peters demonstrates it. Experience is an able teacher and most of us have the scars to prove it.Maybe, we use 256 partitions and publish prices for a CUSID on to a topic for the partition hosting the CUSPID. (A CUSIP is a financial term for a tradable security). Now, even with 9k CUSIPs there are only 256 subscriptions to filter. You need to look at the complete stack in use and optimise the application architecture for the complete stack. You can't write the application naively ignoring the reality of the capabilities of the middleware around the application server and expect a reasonable result. This applies equally to vanilla J2EE applications that use just an application server and a database. Applications that are poorly coded can inefficiently use a database on even a very fast machine. You need to know how to use the database efficiently and write the application accordingly. The JMS problems Peter mentions are a different manifestation of the same thing.We've written an application that is available to customers demonstrating an OMS written using the XD technology and it solves the problems Peter mentions.Nice post Peter.Billy

    thanks for the compliment, but it's more a result of banging my head against the problem long enough to realize "doh, the problem is a lot harder than i thought". That and lots and lots of benchmarking, profiling and head scatching. Good coffee helps too, but these are hard problems to solve. I've been focusing on pre-trade compliance the last few years, so I've had to face these problems head on. I still don't know much, but I have learned enough to realize just how hard it is to scale this stuff to massive loads.

    peter
  22. More like a bad headache now[ Go to top ]

    And having banged my head on the same walls pre IBM (its only been 3 [long] years since I joined :P), I'm now trying to build advanced middleware to provide a bigger/better toolkit for these customers building these applications. My goal is to make XD the premier platform for building advanced applications using J2EE technology, it's still early days and there is more in the pipeline.

    Billy
  23. lots of patience[ Go to top ]

    And having banged my head on the same walls pre IBM (its only been 3 [long] years since I joined :P), I'm now trying to build advanced middleware to provide a bigger/better toolkit for these customers building these applications. My goal is to make XD the premier platform for building advanced applications using J2EE technology, it's still early days and there is more in the pipeline.

    Billy

    I find lots of patience and listening very carefully to domain experts helps alot in these cases. Still very hard problems, but listening to domain experts about how trading really occurs helps reveal areas where the heuristics lend themselves to partition or staging. Once the stages or partitions have been identified, it becomes easier to tackle each piece. Now if only I had time to try out the latest websphere. It's good to see these techniques getting standardized and included in websphere.

    peter
  24. Nice, insightful interview, Billy. You should head a JSR for asymmetric clustering.
  25. Thanks[ Go to top ]

    Lets get JSR 236/237 out of the way first...
  26. Thanks[ Go to top ]

    Lets get JSR 236/237 out of the way first...

    +1 :-)
  27. JSR 236[ Go to top ]

    Lets get JSR 236/237 out of the way first...

    What is the deal with JSR236 anyway, are the new EJB timers a result of this? Is there a non-EJB api?
  28. JSR 236[ Go to top ]

    Not sure what you mean by the new EJB timers. There isn't anything in the JSR thats EJB specific. The original async beans APIs in WebSphere BI 5.1 or WebSphere 6.0 are POJO based but leverage the J2EE environment of J2EE components such as servlets and EJBs that use them. The timers in the JSR are transient in memory timers only. Normal EJB timers are transactional/persistent type timers.

    Billy
  29. More on this subject[ Go to top ]

    On www.websphere-world.com is a link to an IBM course about WebSphere XD:

    http://www-128.ibm.com/developerworks/websphere/library/tutorials/dl/sw729/

    kind regards,

    Tom
  30. This has to be one of the most interesting and informative interviews I've read on TSS this year.

    Thanks Billy!

    Thomas
  31. *observation*
    I see the focus of these posts certainly pertains to the most relevant use of Websphere Extended Deployment application domains. I was hoping to switch gears to generate some feedback that may better reflect situations of more TSS readers, and more specifically the situations of many TSS readers down the road.

    *scene*
    I have recently come off of a significant J2EE project involving everything including a rich GUI, web client, mobile, and integration needs, into a position that must lead the needs of the recent transition of a large manufacturing company's move from mainframes to a distributed computing architecture.

    *own conclusion of Billy's great interview*
    It seems to me that the WebSphere XD is an example of a tool for an application domain that might have been mainframe computing in the past, but has taken advantage of an existing toolset within the J2EE definition and still required the traditional processing power of a "super-huge-box". The idea being that to make the system scalable, separate systems are used, rather than a bigger and bigger one, and therefore partioning and asynchronous messaging best take advantage of the infrastructure as a whole.

    *situation*
    Our goal is to double our sales revenue in half a dozen years. It occurred to me that this means if everything stays the same (and it won't), about doubling the incoming request messages. Our request messages come from various systems including B2B integration applications (inhouse legacy mainframe system and J2EE integration systems) as well as a realtime connection to our extremely large CRM system, that is not a mainframe but has enough "modules" to soon warrant integration systems of its own.

    *questions*
    Knowing that there are queues on these systems that are effectively on a single partition, has prompted me to realize that these will become bottlenecks long before the new distributed data stores become troublesome, perhaps what happened to you guys a couple of years ago?

    How does the community anticipate WebSphere XD and other J2EE servers becoming the platform for these kinds of applications? Anyone else see themselves in a similar situation - bunch of smaller systems that will be taxed by the means that connect them when the requests spike?

    I can already see my recently acquired integration systems becoming large queue systems that route messages through several partion spaces and I anticipate issues with that, and I am asking myself how that differs from what I just read?!

    Anyone already pushing the performance scalability of these kinds of systems?

    Just my thoughts!
    ~tim
  32. Very good information[ Go to top ]

    I like this idea "Asymmetric Clustering". Great works.
  33. Basically this asymmetric clustering is not new. It does solve
    certain kind of problems. I implemented this functionality in the BPM component so that the load can be better managed depending on where it is deployed. The load patterns on end points in a collaborative environment like srm/marketplaces is different then at the hub. It also had a way to tag the messages/requests using rules so that message processing can be partioned. This in conjunction with object caching in the jvm gave a good scalability solution and reduced lot of round trips to db, cache synchronization,growth and invalidation related spikes.
  34. OMS performance[ Go to top ]

    Hello,
    I read the article with great interest. I am working here in KSA with a foriegn bank and we are using a locally build OMS with internet share trading interface. Lately we are facing a lot of scalability issues with our application. The stress test results show that the application can take 200 concurrent users on an HP-UX (Oracle 9i Application Server). The server has 4 cpu with 8 GB RAM , on the internet interfact the application is running IBM WebSphere 5.1 with only 100 users per machine (Intel 4 process and 8 GB RAM), there are 6 servers in the WebSphere cluster. And we are facing problems with only 5000 registered users.
    Finally the backend database is running on HP-UX 8 CPU with 12GB RAM.
    I am interested if someone has any experience with OMS and can provide me the benchmark figures for standard OMS performance.
    Please forgive me if this is out of scope of this forum.
    regards
    ali
    alikhan@himindz.net
  35. scale up or out[ Go to top ]

    Hello,I read the article with great interest. I am working here in KSA with a foriegn bank and we are using a locally build OMS with internet share trading interface. Lately we are facing a lot of scalability issues with our application. The stress test results show that the application can take 200 concurrent users on an HP-UX (Oracle 9i Application Server). The server has 4 cpu with 8 GB RAM , on the internet interfact the application is running IBM WebSphere 5.1 with only 100 users per machine (Intel 4 process and 8 GB RAM), there are 6 servers in the WebSphere cluster. And we are facing problems with only 5000 registered users. Finally the backend database is running on HP-UX 8 CPU with 12GB RAM. I am interested if someone has any experience with OMS and can provide me the benchmark figures for standard OMS performance.Please forgive me if this is out of scope of this forum.
    regards
    alialikhan at himindz dot net

    From my knowledge, the high end OMS systems use rich clients written in Java or C++. Trading applications using Web UI usually have much larger latencies and delays. It is much easier to scale and support larger number of concurrent users if you're using messaging and filter the traffic in several stages. Depending on the number of data feeds, the message rate will probably range from 2-10K/second. Feeding these results to a web tier is going to be rather challenging. How you filter the messages is also going to be a bottleneck. Keeping in mind that the most popular stocks account for a big percentage of the trading volume, it's desirable to partition the deployment so that one server doesn't get 80% of the traffic while the other servers handle the remaining 20%. To make this approach work, your architecture needs to be designed that way. If it isn't, there's not much you can do besides throw hardware at the problem.

    good luck

    peter