Discussions

News: Announcing Terracotta - Java clustering & caching without API's

  1. Terracotta is a new VC funded startup with a new and very different approach to clustering and caching. Current industry approaches either offer a new API to learn in order to cache data, or operate behind programming models that can affect application design, such as Jcache or EJB CMP. Terracotta's technology aims to eliminate I/O bottlenecks in Java--both communication between JVM's (by virtually sharing entire graphs of Java objects at the field change level across a set of JVMs), as well as communication to the database (by wrapping the native JDBC driver with a fault-tolerant caching server), offering a transparent clustering solution for Java without proprietary API's.
     
    Effectively, Terracotta can make a set of JVM's in a cluster appear as one JVM (even to the database), so a developer could attach an ordinary collection to an ordinary singleton and have it be accessible across a cluster. Even the semantics of concurrency utilities such as the 'synchronized' keyword will mean something across a cluster.

    Terracotta also wraps JDBC to transparently cache database queries cluster wide while including tools that will invalidate the query results should the underlying database change, for example, if a DBA changes records outside of the application.

    TSS interviewed Terracotta founder Ari Zilka (CEO, and former Chief Architect at Walmart.com) and SVP Bob Griswold (former General Manager of JRockit JVM at BEA) to ask a few more questions about this new system.
     
    The ability to have the ordinary java syntax and semantics be virtually clustered is definitely attractive – how far does that go?

    Pretty far. Our technology uses AOP-like bytecode manipulation to enable the sharing of field values anywhere in the graph of a shared object – all at runtime, without any development impact at all. Clustering, however, is more than just the sharing of data – data sharing is as far as plain caching solutions can go today. Terracotta includes transparent support for distributing Java methods such as wait(); notify(); synchronized; etc. In addition, we support distributed locks (concurrent or otherwise), distributed GC, and virtual memory management to finely tune the performance of the clustering solution – this is far more than just a distributed cache. One of my favorite features of the Terracotta Virtualization Server is "distributed method calls" which allows an event-based application to fire events cluster-wide – yet another example of the difference between clustering and caching.

    How does the data get propagated from server to server?

    Unlike distributed cache solutions out there, the Terracotta Virtualization Server (TVS) is a hub-and-spoke solution, with "L1" libraries running in-process, with one or more "L2" servers serving as the hub. When a field in a shared object is changed in any VM, only the deltas are pushed to the L2 server, and then relayed from that server only to the L1’s that need it. Objects are dynamically swapped in and out of client VM's as needed.

    That sounds like a lot of inter-node chatter – what about latencies?

    We do not use Java serialization, and we do not broadcast changes across the entire cluster. We only distribute the deltas to the nodes that need the deltas. This eliminates any unnecessary chattiness.

    You mentioned that this is like AOP – how so?

    We use ASM at class-load to manipulate applications into this clustering model. Our model is a pure runtime approach, much more like AspectWerkz than the older AspectJ (before the merger). That said, we do not provide a general purpose AOP development model or framework. In this sense, we are more like an applied AOP system – an application of AOP concepts, meaning our customers do no aspect oriented programming themselves – we are a purely codeless solution. We call it plug-configure-play.

    Why does Terracotta optimize database access in addition to clustering JVMs?

    Terracotta views the challenge of designing clustering solutions for Java along 2 dimensions—the vertical dimension wherein I/O crosses the app tier into the data tier, and the horizontal dimension which remains purely in the app tier. A solution that can help with clustered I/O has to help the app and data tiers play nicely together. We help the app tier stay horizontal in its I/O when it is indeed intending to cluster JVMs. We also help the app tier and data tier decouple from each other such that the data tier doesn’t see the number of app server instances in a clustered deployment and the app doesn't get built to depend on the DB when it doesn’t need to. We spent a bit of time explaining our transparent capabilities in the horizontal I/O plane, but in the vertical, I have yet to mention that Terracotta implements seamless API replacements that
    "plug-configure-play" underneath a Java app. In the case of JDBC, we provide a JDBC driver that proxies all I/O through the Terracotta Virtualization Server so that the DB sees one client (the TVS itself). We also cache all responses from the DB and then snoop the DB transaction log for updates from any and all sources. TVS is perceived as a constant-time interface to the DB. And, since TVS knows when the DB is offline, it is free to serve the last known good version of data to any application. This concept can be extended to JNDI, JMS, etc.--pretty much anywhere the VM goes outside the Java context through a published API to access data from external sources.

    What kind of clients are using your infrastructure now? What are some of use cases they are applying your technology in?

    Our technology is based on technology that powers a top 5 e-commerce site, so the architecture is quite pressure tested to workloads of hundreds of millions of transactions per day. Our current customers are a mix of high-profile Wall Street financial services companies, as well as one of the top search engines on the internet today. The use cases are a mix of database acceleration, application availability and geographic clustering (ie. reducing load on the system of record, or enabling the availability of an application in the event the database is unavailable). Right now our production customers are using all the features we have talked about today, and more. Some specific use cases are a portal-based trading system for a Wall Street bank that is distributed between New York, Tokyo and London, or a distributed event-management system that needs to synchronize geographically dispersed app servers without using JMS – our DSO product replaces an unwieldy JMS implementation. Other use cases include distributed shared GUI’s such as shared calendar applications or groupware applications where distributed teams work on shared data using a shared GUI without any complex messaging or signaling code. We even have one customer looking at using our technology to power an RFID application, relying on our inherent high-performance signaling infrastructure to seamlessly share information between nodes.

    Terracotta will have a private, invitation only demo event of their technologies in San Francisco during the week of JavaOne – go to www.terracottatech.com/TechPreview.html for an invitation.

    Threaded Messages (93)

  2. How is this different from Gemfire?
  3. How is this different from Gemfire?

    Good Question.

    I'm not an expert on GemFire but I'll give this a shot anyway. Terracotta's DSO product is a zero-API mechanism for sharing trees of objects ("Managed Objects") between multiple Java VMs by specifying "Roots". It provides facilities for object-level locking (fine grained), distributed wait and notify, and distributed method calls all without requiring code changes. We maintain object identity across VMs even when "Managed Objects" are referenced by unmanaged objects. DSO also has memory management so that parts of object trees can move in and out of client VM's (even when those objects are referenced by unmanaged objects). DSO has full distributed GC and properly handles the case where "Managed Objects" are referenced by unmanaged objects.

    GemFire appears to be a traditional object cache that requires the use of an API to get and put objects. This suggests that if an object is changed anywhere in its tree, those changes won't be propagated unless the object is re-put into the cache. It is unclear to me whether they manage object identity in this world once you are outside of those "special" caches. As far as I can tell, GemFire allows cache level locking, but doesn't do fine-grained locks on individual objects. They have what sounds like some pretty cool write through cache abilities to heterogeneous data sources. DSO currently has hooks that allow one to know what changes occur and react to those changes.

    Hope this helps and I certainly welcome any corrections to my understanding of GemFire.

    Cheers,
    Steve
  4. A few more details[ Go to top ]

    A little more detail on how Terracotta's DSO product differs from the get/put model is it's ability to handle object identity between the "cache world" and the "non-cache world." When one writes an application they don't always flatten out their models in a way that assures a check in check out approach is practical or wise. With DSO that is not a problem because object identity is maintained throughout and across VMs. One gets this ability without having to give up partial tree memory management and GC.

    A Second feature is that any object can be your cache root. One does not have to force a Map to be the root of their "Managed Objects." This is nice because it allows the user the keep their creative design and not shoe horn it into a get/put paradigm.
  5. A few more details[ Go to top ]

    A little more detail on how Terracotta's DSO product differs from the get/put model is it's ability to handle object identity between the "cache world" and the "non-cache world." When one writes an application they don't always flatten out their models in a way that assures a check in check out approach is practical or wise. With DSO that is not a problem because object identity is maintained throughout and across VMs. One gets this ability without having to give up partial tree memory management and GC.A Second feature is that any object can be your cache root. One does not have to force a Map to be the root of their "Managed Objects." This is nice because it allows the user the keep their creative design and not shoe horn it into a get/put paradigm.

    Is this the Steve Harris I used to work with?

    Sorry for the noise...

    Bill
  6. A few more details[ Go to top ]

    A little more detail on how Terracotta's DSO product differs from the get/put model is it's ability to handle object identity between the "cache world" and the "non-cache world." When one writes an application they don't always flatten out their models in a way that assures a check in check out approach is practical or wise. With DSO that is not a problem because object identity is maintained throughout and across VMs. One gets this ability without having to give up partial tree memory management and GC.A Second feature is that any object can be your cache root. One does not have to force a Map to be the root of their "Managed Objects." This is nice because it allows the user the keep their creative design and not shoe horn it into a get/put paradigm.

    JBoss Cache AOP also does this as well. Object references are maintained within the VM and cluster-wide so you do not have to do the get/set approach. So for instance if you did the following:

    Person p = new Person();
    Address addr = new Address();

    treeCacheAop.putObject("/stuff", p);
    Person p2 = treeCacheAOP.getObject("/stuff");
    p2 == p;
    p2.getAddress() == addr;
    p2.address == addr;

    Also, access to p2 is now transactional So, after the above code if you did:

    tx.begin();
    p2.setName("Bill");
    tx.commit();

    Only the name property would be propagated at transaction commit.


    One problem that we have found is how to deal with state that is a System class. AFAIK, system classes cannot be instrumented at runtime. So, for things like Maps, Lists, and Sets, we actually have to proxy them and replace the old object instance. I'm just wondering how other people deal with this sort of problem...But I guess you cannot devulge all your secrets ;)

    Bill
  7. A few more details[ Go to top ]

    I don't think I'm the same Steve Harris you have worked with (Many of us floating around) but I could be wrong.

     

    Here is a cool example of transparent caches. (NOTE: "customers" is the root "ManagedObject" and transaction/lock bounderies are defined by the synchronization and the object being locked on is the lock).

     

       private Map customers = new HashMap();

       public void setFavoriteSong(String name, Song song){
          Customer cust = getCustomer(name);

          synchronized(cust){
             cust.setFavoriteSong(song);
          }
       }

       public Customer getCustomer(String name){
         synchronized(customers){
           return customers.get(name);
         }
       }


     

    Though I guess I would be more likely to write it like this. (NOTE: The jukebox

    variable is the "Managed Object"):

     

       private Jukebox jukebox = new Jukebox();

     

       public void setFavoriteSong(String name, Song song){
          Customer cust = getCustomer(name);
          synchronized(cust){
             cust.setFavoriteSong(song);
          }
       }

       public Customer getCustomer(String name){
         synchronized(jukebox){
           return jukebox.getCustomer(name);
         }
       }

     

    Definitely cool to be able to seamlessly handle things like arrays, field access, inner classes, and identity etc... I have to say, I really enjoy this stuff.
  8. How is this different from Gemfire?
    Good Question.I'm not an expert on GemFire but I'll give this a shot anyway. Terracotta's DSO product is a zero-API mechanism for sharing trees of objects ("Managed Objects") between multiple Java VMs by specifying "Roots". It provides facilities for object-level locking (fine grained), distributed wait and notify, and distributed method calls all without requiring code changes. We maintain object identity across VMs even when "Managed Objects" are referenced by unmanaged objects. DSO also has memory management so that parts of object trees can move in and out of client VM's (even when those objects are referenced by unmanaged objects). DSO has full distributed GC and properly handles the case where "Managed Objects" are referenced by unmanaged objects.
    This is actually quite similar to one of the options provided by GemFire. GemFire provides a mechanism to extend in some sense the memory model of the JVM to spill over to a shared memory segment. Basically any number of JVMs and/or 'C' processes can attach to a shared memory segment and share objects. Transparency is provided to the application developer through byte-code enhancement of the domain classes chosen for sharing (If byte-code enhancement is not preferred then objects are copied to shared memory). Applications simply instantiate these enhanced domain classes and modify fields. The updates are instantenously visible to all connected processes. So, the object creation in shared memory occurs before invoking the constructor and all field accesses are diverted to the shared object. Memory management in shared memory is somewhat similar to VM with object tables, compaction and a variety of GC algorithms except for a few key differences: Object reference management (and lifetime management) has to be global across all connected processes and a provision for spilling over to disk if available memory falls below a set threshold.
    GemFire appears to be a traditional object cache that requires the use of an API to get and put objects. This suggests that if an object is changed anywhere in its tree, those changes won't be propagated unless the object is re-put into the cache. It is unclear to me whether they manage object identity in this world once you are outside of those "special" caches. As far as I can tell, GemFire allows cache level locking, but doesn't do fine-grained locks on individual objects. They have what sounds like some pretty cool write through cache abilities to heterogeneous data sources. DSO currently has hooks that allow one to know what changes occur and react to those changes.Hope this helps and I certainly welcome any corrections to my understanding of GemFire.Cheers,Steve

    GemFire can also be used as a straight distributed cache without any involvement of native code or a shared memory segment. When operating in this mode, there is no explicit object identity management (left upto the application). Depending on the application use case, the product can be configured to operate peer-2-peer or use a explicit set of cache servers or a combination.
    When using GemFire as a distributed cache, data regions can be configured to reside in java memory, on disk, in shared memory or out-of-process.
  9. I know you didn't claim drop-in for JDBC, but your website does:
    GemFire JDBC Interceptor is a pluggable extension for GemFire Enterprise that is used to tune database access performance without requiring any modifications to the application code.

    GemFire JDBC Interceptor features a JDBC™ 2.0 driver that allows the user to monitor any JDBC application for SQL queries, configure individual queries or patterns of queries as being cacheable, and dynamically cache and distribute database query result-sets in-memory and/or disk across multiple nodes. With the JDBC compliant driver, there is no impact to the application code and application performance can be tuned in real-time. The JDBC driver shields the user from having to use the GemFire Enterprise cache APIs explicitly.

    GemFire JDBC Interceptor exposes all SQL being generated by an application and summarizes this information in a simple, easy to understand graphical user-interface. The product instantly identifies excessive or long-running SQL statements generated by code in development. This early visibility allows performance problems to be corrected early in the development cycle, when performance tuning is dramatically less time-consuming and costly. Developers can rapidly isolate and correct performance problems with every build by analyzing the SQL generated by unit tests and looking for newly introduced code that may be causing performance problems.

    What concerns me about this is there is NO discussion of how invalidation works or how the driver learns about stored procedures, yet the JDBC product is called drop-in. What's a use case a customer could / should trust this driver for? C-JDBC can invalidate in-band SQL. Terracotta can invalidate SQL changes from ANY source. What does Gemstone do, in tactical terms to drop-in and do no harm to the app semantics?



    As for DSO, a few quotes from your whitepaper on GemFire:
    Cache listeners can be used to provide asynchronous event notifications to any number of applications
    connected to a GemFire distributed system. Events on regions and region entries are automatically
    propagated to all members subscribing to the region. For instance, region events like adding, updating,
    deleting or invalidating an entry will be routed to all listeners registered with the region.

    This notion of a cache listener is very cool. It is, however, not drop-in as it does not inherit the natural language to do its work. If an application is built with java.util.hashmap, how can it be expected to be listening for GemFire cache events? It doesn't know about them, no?
    The GemFire distributed cache API presents the entire distributed system as
    if it were just one logical cache (see figure 2) completely abstracting the actual location of the data or the
    data source from the developer. Each distributed cache system is identified by an IP address and port.

    I believe there is, indeed an API as the quote from the GemFire whitepaper mentions it. As for what the API signatures are or what bytecode manipulation capabilities exist, these things are not documented anywhere on your site so I cannot comment. I think most of us would like access to more user-level detail on how this works or how to use it.

    Thanks,

    --Ari
  10. What concerns me about this is there is NO discussion of how invalidation works or how the driver learns about stored procedures, yet the JDBC product is called drop-in. What's a use case a customer could / should trust this driver for? C-JDBC can invalidate in-band SQL. Terracotta can invalidate SQL changes from ANY source. What does Gemstone do, in tactical terms to drop-in and do no harm to the app semantics?
    Hi Ari,
     The web site is light on technical details, today. I believe, we will open up the developer portal soon. Customers have access to quite a bit more than what is visible on the web site.
    Anyway, let us keep the discussion on what might interest this mailing list. The JDBC driver parses all DML operations on the fly and detects which query result sets to invalidate. I don't understand how one could process CallableStatements (stored procedures) and know what to do. AFAIK, for this, one has to tap into events on the DB engine. Our experience on this has been varied: Most of them do not want any disruption to production databases, even if we are talking about sniffing logs. Most often, our users simply want to use the configurable cache refresh policy (TTL or idleTime based). We offer toolkits for filtering logs such as in Oracle and apply invalidations to the cache. The challenge here is not so much about capturing and propagating the event, but the intelligence to know what events matter to a middle-tier cache. As you know, If one is not careful, you will flood the middle-tier cache with events that don't matter, consuming CPU cycles and ultimately reducing throughput.
    If real-time invalidations are desired then we propose the use of scripts to generate row-level triggers that dispatch events to Oracle AQ (a JMS destination), which in turn arrive to distributed caching system.

    I believe there is, indeed an API as the quote from the GemFire whitepaper mentions it. As for what the API signatures are or what bytecode manipulation capabilities exist, these things are not documented anywhere on your site so I cannot comment. I think most of us would like access to more user-level detail on how this works or how to use it.Thanks,--Ari

    The GemFire technical whitepaper does not refer to the JDBC driver cache, rather the distributed data cache which indeed is API driven. Like I mentioned before, the developer portal should open up soon. I will keep you posted.
    Thanks.
    - Jags Ramnarayan
  11. If real-time invalidations are desired then we propose the use of scripts to generate row-level triggers that dispatch events to Oracle AQ (a JMS destination), which in turn arrive to distributed caching system.

    We have also found AQ to work very well as a way for DB transactions to queue change and changeset information. It is definitely one of the hidden gems in the recent Oracle releases.

    Peace,

    Cameron Purdy
    Tangosol Coherence: Clustered Shared Memory for Java
  12. congrats Ari[ Go to top ]

    congrats Ari on release.

    peter
  13. congrats Ari[ Go to top ]

    Peter,

    Great to hear from you! Thanks for the congratulatory message. How are things out on the opposite coast?

    As for our release, we were very excited to finally be able to share our message out there. With folks in production and extended field trials, we felt the time was right.

    Hopefully people like what they are reading. We welcome the feedback.

    --Ari
  14. congrats Ari[ Go to top ]

    Peter,Great to hear from you! Thanks for the congratulatory message. How are things out on the opposite coast?As for our release, we were very excited to finally be able to share our message out there. With folks in production and extended field trials, we felt the time was right.Hopefully people like what they are reading. We welcome the feedback.

    --Ari

    Good reading. I've been thinking about using AOP in a similar way in the rule engine space to provide seamless support. It's good to see Terracotta and others pushing forward with these techniques. Hopefully, I'll have enough time to implement seamless rule support for Drools in the next 2 years, so that users can simply deploy their business objects and Drools will modify the bytecode so it will work.

    peter
  15. I still kind of like Azul's approach better. At least they were upfront saying that you need hardware support for that low-level clustering. Even if you set aside bigger question of whether such fine-grain data sharing capabilities (field-based change propagation) is a right approach overall, nobody would probably think that this system can perform adequately under a significant load on a plain TCP/IP... I think such approach is more suited for special IPCs like InfiniBand, etc. but then it has very little to do with Java and especially “no extra development required”.
    Terracotta includes transparent support for distributing Java methods such as wait(); notify(); synchronized; etc.
    Are they going to provide tools to debug it with major Java IDEs for app developers?
    One of my favorite features of the Terracotta Virtualization Server is "distributed method calls" which allows an event-based application to fire events cluster-wide – yet another example of the difference between clustering and caching.
    This is embarrassingly simple to implement with various technologies today. If that's the favorite feature than I don't see the point.
    We do not use Java serialization, and we do not broadcast changes across the entire cluster. We only distribute the deltas to the nodes that need the deltas. This eliminates any unnecessary chattiness.
    That's like saying - “well, we didn't make this protocol as slow as possible...”. Of course you do it, the point was that make any distributed action on each field change is a very significant IO load that will almost kill any applications unless they were designed with such assumption from the beginning (minimize mutations, etc) – which contradicts the non-intrusiveness stated in this product.
    ...data sharing is as far as plain caching solutions can go today.
    Data sharing is just a technicality of today's distributed caching solutions. I'm curios to see how said product supports distributed transactions, for example, which by default require certain programming support. Integration with J2EE stack, etc, etc... The more questions that I have time to ask. Look at any caching products that are close to Jcache (now I'm biased here) such as Tangosol and you'll what I'm talking about.

    On a bright note I think it's refreshing that money are slowly getting back to infrastructure software.

    Best,
    Nikita.
    GridGain Systems.
  16. ----
    That's like saying - “well, we didn't make this protocol as slow as possible...”. Of course you do it, the point was that make any distributed action on each field change is a very significant IO load that will almost kill any applications unless they were designed with such assumption from the beginning (minimize mutations, etc) – which contradicts the non-intrusiveness stated in this product.
    -----

    I certainly could understand why someone would be concerned if every field change were sent accross the network when it happened. That is not the case. Field changes to managed objects are batched together and sent to the server on lock bounderies (usually defined by the developers synchronization). In this way it is very similar to how the java memory model works.
  17. Field changes to managed objects are batched together and sent to the server on lock bounderies (usually defined by the developers synchronization). In this way it is very similar to how the java memory model works.
    ... right, the caveat though is that Java memory works usually 2-3 orders of magnitude faster than the fastest IPC, let alone plain TCP/IP – hence my initial point.

    Nikita.
    GridGain Systems.
  18. Field changes to managed objects are batched together and sent to the server on lock bounderies (usually defined by the developers synchronization). In this way it is very similar to how the java memory model works.
    ... right, the caveat though is that Java memory works usually 2-3 orders of magnitude faster than the fastest IPC, let alone plain TCP/IP – hence my initial point.Nikita.GridGain Systems.

    I suspect that most codes are not correctly synchonized in the strict sense of java memory model, they work because they are running on VMs with stronger memory model than the spec.
  19. Announcing Terracotta - Java clustering[ Go to top ]

    I suspect that most codes are not correctly synchonized in the strict sense of java memory model, they work because they are running on VMs with stronger memory model than the spec.

    Oh, this is almost certainly true. I have seen some truly broken multithreaded code in production apps that have supposedly passed through rigorous QA cycles.

    To that extent, technologies like Terracotta can serve a valuable testing/auditing/code improvement role as well -- it will exercise your code in ways a typical execution environment will not, making it easier to find the concurrency bugs. (In the same way that the Azul technology will, as an unintended side-benefit, help you find hidden scalability issues in your code.)
  20. So, looking at GridGain, I notice that you are in the same space as Terracotta. It is very nice to see competition and we welcome your foray into API-less clustering.

    I am confused by some of your responses, however. Let me provide some context and then I will explain my confusion.

    Terracotta has several demos of our software and we regularly show those demos to customers. The interesting one here is a swing demo of a drawing program; Terracotta's virtualization server clusters that demo so that changes in any window appear in all windows, simultaneously (BTW, I don't mean that adding a rectangle on one client causes it to be added on all...I mean that as I click-and-drag the rectangle on the screen you can see the click-and-drag happening on all copies of the drawing tool at once). We do this without touching the application source code in any way...and it took only a few minutes for an engineer to get it configured as a clustered drawing tool (okay, the use case is strange, but you get the idea).

    What confuses me is:
    1. Azul and TC do not compete. Azul would not turn this GUI into a clustered GUI. So, what does it mean to like Azul "more?"
    2. GridGain would require annotating the application and would, thus perturb the code so do GridGain and TC really compete?
    3. Tangosol cannot cluster an arbitrary app, especially not a GUI which requires firing change events against the data model around a cluster, etc. So can you please clarify the point about JCache and Tangosol? Tangosol seems like a valid technology, yet solving a different problem, IMO.
    4. Regarding performance, in order to cluster an app, you must eventually hit the network. Terracotta hits the network in batches, which avoids the TCP overhead as much as possible. And, Terracotta pushes only what changes. Is it even possible to cluster more cheaply than this--logically speaking?
  21. Some thoughts[ Go to top ]

    Hi Ari,
    I'll skip GridGain to avoid the change of topics.

    <blockqoute>
    Azul and TC do not compete. Azul would not turn this GUI into a clustered GUI. So, what does it mean to like Azul "more?"I certainly do not know whether or not a specific GUI demo will be addressed by Azul, but the technology approach seems similar enough to make a high-level comparison: both use low-level distribution and either custom VM or code augmentation. But Azul took it to the hardware level plus special (modified) VM which ultimately removes most of if not all performance issues (obviously, with much greater price).
    Tangosol cannot cluster an arbitrary app, especially not a GUI which requires firing change events against the data model around a cluster, etc. So can you please clarify the point about JCache and Tangosol? Tangosol seems like a valid technology, yet solving a different problem, IMO.
    It's not my product and I let Cameron speak for it. But it seems to me we are talking about apples and oranges here. Jcache-type of caching is a distributed caching in its classic sense while your GUI demo is rather something else. See below...
    Regarding performance, in order to cluster an app, you must eventually hit the network. Terracotta hits the network in batches, which avoids the TCP overhead as much as possible. And, Terracotta pushes only what changes. Is it even possible to cluster more cheaply than this--logically speaking?
    I think it is. Some time ago I have seen a paper analyzing differences between AOP-style of transaction management (with per-field advises; I think Jboss 4 does it) and a traditional one where a developer is responsible for enlisting – a conceptually similar problem. It turned out that it's more efficient to do it by developer for various reasons.

    In a nutshell, per-field advising is a way too simple of a approach to gather the changes in the “state” that leads to a unnecessary load on memory, CPU and IO (in case of distribution).

    Case in hand: within distributed cache transaction I can update multiple objects crossing multiple synchronization boundaries but ending up with just one object being enlisted and actually modified in transaction and as such one delta to be sent. In fact, most TM and/or caching products will ensure this optimization.

    In case of TC, I'm not sure how this behavior can even be supported when all individual modifications for all objects have already been distributed (on synch boundaries). Is there a notion of a transaction?

    Regards,
    Nikita.
    GridGain Systems.
  22. Some thoughts[ Go to top ]

    You make excellent points, Nikita. Don't want to get into a "tit for tat" here, but TC has more locking schemes than exclusive (concurrent, read/write, etc.) and more transaction boundaries are possible than synchronize{}. We are not claiming zero developer involvement...we are claiming zero code. Our product is in fact capable of the usage patterns you suggest as well as the drop-in auto-locking one we espouse above.

    Thanks so much for your input. Clearly you are on top of the concepts and challenges inherent in our product space. I would welcome your feedback and continued conversation at any time. Please drop me an e-mail and we can talk some more.
  23. Some thoughts[ Go to top ]

    OK :-)
    I hope my questions didn't come across as rude...
    What would be very helpful is that if you could provide some corner stone use cases for your software which would clearly highlight the differences between TC and other products and/or technologies such as traditional grid computing or Azul-type of clustering.

    Best,
    Nikita.
    GridGain Systems.
  24. Some thoughts[ Go to top ]

    Terracotta's implementation of java object clustering is general purpose, so there are many use cases that apply. Here are a few that we've discussed with customers and prospects:


    1) Group scheduling/event coordination
    One customer is considering DSO for a special-purpose scheduling system (groupware). In this case, DSO is replacing JMS message queues with a clustered java.util.List.

    2) Fault-tolerant corporate portal
    A large financial institution is using Terracotta's HA-JDBC product to make its corporate portal resilient to transient database failures, and to improve response time with our L1/L2 cache architecture.

    3) Drop-in replacement for servlet session state.
    DSO can be used to implement clustered servlet sessions. Because DSO can distribute field-level changes, it is much more efficient than serialization-based session replication schemes. Also, DSO is capable of paging objects to disk, making it possible to manage larger servlet sessions than are supported by many commercial application servers.

    4) Shared GUIs
    Virtually any Java object is fair game for clustering, including swing data models (in the MVC sense). This makes it trivially easy to write a distributed swing application. Write it for a single VM then, through configuration, mark one or more of the models as "shared", thereby clustering the GUI.

    5) Cluster-wide hashmaps
    Yeah, we do that too.

    The beauty of the Terracotta solution is that it is virtually invisible. Use the data structures and programming constructs that are native to the JDK (including synchronization and wait/notify), and control clustering and persistence through configuration. Once you start programming this way, you're hooked. It's the most natural way to write distributed java applications.
  25. HA-JDBC[ Go to top ]

    How Terracota HA-JDBC differs from the HA-JDBC project found on SourceForge or C-JDBC from ObjectWeb?
  26. I'm glad you mentioned C-JDBC (and SourceForge's HA-JDBC) -- they're beautiful solutions; the concept of a RAID of databases is powerful and simple. You get substantially increased availability and additional performance, just by replicating your main data source as many times as you want. In fact, C-JDBC can sit underneath our own HA-JDBC really nicely, and you get the benefits of both worlds.

    There are some significant differences between C-JDBC and Terracotta's HA-JDBC, too -- the most obvious one being that our HA-JDBC uses a cache atop the database to provide performance and availability, rather than replication of the system-of-record (SoR) itself. Depending on your data size, scalability requirements, TCO position, and ability to use open-source (no license fee) databases, running several copies of the original database can be anywhere from extremely straightforward to pretty cost-prohibitive.

    Being a cache also lets us do a few other things differently. We pass through writes but absorb most reads directly, so the total load on the SoR goes down pretty substantially, and our approach to invalidations means that you can change the data using any tool (command-line console, Java app, whatever) and we can still cache quite aggressively. (For example, we can properly handle all stored procedures, whether they write, read, or both, and cache when -- and only when -- it's safe to do so.) We can even return results when the entire SoR is hung or down, letting us step in when hardware or software faults (e.g., deadlocks) happen in that SoR. Being a cache also lets us avoid a certain set of synchronization challenges that present themselves when you have a replicated SoR. Again, depending on your application, this can be irrelevant or pretty significant.

    I think of it like adding an NVRAM cache in front of your disks vs. using a RAID -- they both have their benefits; it just depends on what problem you're tackling. We've solved problems using C-JDBC, our own HA-JDBC, and a combination of both -- we think they each have their place.
  27. We pass through writes but absorb most reads directly, so the total load on the SoR goes down pretty substantially, and our approach to invalidations means that you can change the data using any tool (command-line console, Java app, whatever) and we can still cache quite aggressively. (For example, we can properly handle all stored procedures, whether they write, read, or both, and cache when -- and only when -- it's safe to do so.)

    It sounds quite interesting. How do you determine the readset and writeset of stored procedures without modifying the database engine? If I do {call someproc()} in JDBC, how do you know which data has to be invalidated in the cache?
    Also how does the cache manage transaction isolation?

    Thanks for your detailed answer.
    Emmanuel
  28. Applied AOP[ Go to top ]

    That's great to see AOP-applied software reaching corporate and production stage this way Bob ! This remember me that last March we had Oracle TopLink at the annual AOP conference to talk some around their build time field get/set interception. You guys should talk next time. I am sure the interception model is roughly the same (even if yours is runtime/loadtime), and that only the backend serves a different purpose.

    The way you deal with synchronize blocks seems interesting as well and is currently missing in the AOP semantics.
    So you are using AspectWerkz or parts of it in there ?

    I am wondering how the DB transaction log is replicated in a transactional way accross the distributed cache images all the way to the object model in the VM to handle DB updates from external source. Does that works for all DB ?.

    Alex
  29. Re: Applied AOP[ Go to top ]

    Thanks, Alex!

    We do use pieces of AspectWerkz, and our configuration files will be very familiar to users of AspectWerkz. Thanks to you and Jonas for creating such a great tool!
  30. Re: Applied AOP[ Go to top ]

    Congratulations Bob, I'm sure you are having a lot of fun at Terracotta. :-)

    This new product of yours looks really cool and it is always fun to see AOP, loadtime weaving and last but not least AspectWerkz ;-) used in interesting and new ways.

    A possible customer for JRockit VM weaving... ;-)

    /Jonas
  31. Applied AOP/VM weaving[ Go to top ]

    Congratulations Bob, I'm sure you are having a lot of fun at Terracotta. :-)This new product of yours looks really cool and it is always fun to see AOP, loadtime weaving and last but not least AspectWerkz ;-) used in interesting and new ways.A possible customer for JRockit VM weaving... ;-)/Jonas

    Thanks Jonas - I am having a lot of fun, of course ;-)

    Having led the JRockit team for the last 3 years, I of course have a pretty in-depth understanding of what you are doing with VM weaving, and we are excited about helping with it and using it. I'll see you, Alex and the rest of the JRockit team at JavaOne and we can discuss it in more depth.

    /Bob
  32. Re: Applied AOP[ Go to top ]

    I am wondering how the DB transaction log is replicated in a transactional way accross the distributed cache images all the way to the object model in the VM to handle DB updates from external source.

    A Terracotta process on the (each) database server listens in real time to the database transaction log. When a transaction is committed in the transaction log, information about that transaction is broadcast to any L2 instances that are running. The L2 will invalidate any parts of the cache affected by that transaction locally, then tell any connected L1 instances to also invalidate copies they may have cached as well. In this way all updates to the database regardless of source will cause in invalidation, thus providing true data consistency even if someone uses a back door to update data.

    When an update to the database is made (via any source), there is a certain fixed time that occurs after the transaction has committed but before the relevant portions of the cache can be invalidated. To avoid a race condition where some data may be updated and then immediately re-queried by an application, any updates that go through the HA-JDBC driver cause a synchronous cache invalidation (again, to the parts of the cache that are actually affected by the update). Only updates to the database that go outside the HA-JDBC driver will be received asynchronously.
    Does that works for all DB ?

    Currently we support Oracle databases, but in the future will support Sybase, UDB, ....
  33. Fine-grained replication[ Go to top ]

    Ari,

    First of all, congratulations to the product release!

    It is good to see AOP has been used by another product for fine-grained clustering. At JBoss, we also use JBossAop to achieve fine-grained, OO-like distributed caching (JBossCache-Aop) at runtime.

    (Sorry that I can't gather much information on DSO other than what I read from your web site.) But I am just curious how your DSO approach is different?

    Regards,

    -Ben Wang
    JBossCache
  34. Fine-grained replication[ Go to top ]

    Ari,First of all, congratulations to the product release!It is good to see AOP has been used by another product for fine-grained clustering.

    Ben, thank you so much. I have talked a few times to JBoss folks about AOP and our different thinking on the topic. In general I agree it is great to have friends out there trying things and learning together.
    At JBoss, we also use JBossAop to achieve fine-grained, OO-like distributed caching (JBossCache-Aop) at runtime.(Sorry that I can't gather much information on DSO other than what I read from your web site.) But I am just curious how your DSO approach is different?Regards,-Ben WangJBossCache

    Good question. We have come across JBossCache at customer bake-offs more than a few times. While I have nowhere near the expertise you folks have on your own product, I would suggest that our runtime-only focus leads to some differences between TVS and JBossCache:

    1. We treat JDBC and Object data discreetly. JBossCache is a clustered object cache. Our HA-JDBC product can be integrated with our DSO product to do OR-mapping w/o any JDBC calls anywhere in the application code, for example. You could also implement our HA-JDBC (clustered database cache) with our DSO product quite easily. This is because we are solving both the problem of talking to systems of record as well as clustering VMs.

    2. Terracotta's scalability is based off of lots of product hardening. We have what we think is a linearly scalable solution that is [clustered] hub & spoke instead of peer-to-peer. Not sure if JBossCache can scale linearly to hundreds of servers.

    3. Synthesized transaction-boundaries means that with TC, objects are replicated with as much granularity as is needed for any one use case. Some classes can use auto-locking to lock on synchronization. Other classes can use named locks to wrap lots of updates within the graph into a single transaction. Again, not suggesting JBossCache can't but customers have been really happy with our model for transactions.

    4. Configurable fine-grained locking. There are object-level locks and they are separate and apart from objects so you could have 2 classes lock on the same lock. You could also have read/write locks, concurrent-write locks that can be entered unfettered but still guarantee stable views of the data, etc. Furthermore, TC has a runtime console in our Virtualization Server. You can see locks being contended for at runtime via our mgmt GUI. You can also record race conditions that play out in the cluster and replay later for problem resolution w/o first trying to recreate the problem. (one of my rants about multi-threaded / clustered programming is that in order to reproduce a concurrency bug, you already need to know what the problem is.)

    5. Distributed signaling (method calls, wait/notify, etc.) as we have mentioned before.

    6. Inherit the class definition from the Java...clustering an object in Terracotta only requires identifying the root object of the graph you want shared by name. For example, if you have a field named FOO in class MyClass, you would identify the com.mycompany.MyClass and define a root named FOO and map it to field FOO and that's the whole task. Everything that gets added to the graph of field FOO at runtime starts getting shared too...

    I would be very happy to spend time talking about JBossCache and improving our understanding of your product. Please do shoot me your contact info (feel free to sign up for info at http://www.terracottatech.com/company/TellMeMore.html) and we can get together at JavaOne if you will be there, or at another convenient time.
  35. hub & spoke, peer to peer[ Go to top ]

    Hi Ari,

    Congratulations on the release. As you know, I like the approach you're taking with the JDBC driver, and if it works, I'll be happy to recommend it. I was trying not to jump in to this thread, but I did want to clear up a misconception:
    2. Terracotta's scalability is based off of lots of product hardening. We have what we think is a linearly scalable solution that is [clustered] hub & spoke instead of peer-to-peer. Not sure if JBossCache can scale linearly to hundreds of servers.

    An H&S architecture is the right solutions for certain classes of problems. However, an H&S architecture can not linearly scale. You can visualize this fairly simply since the H&S architecture is an analogue of the client/server architecture. There may be specific small-scale use cases in which the aggregate throughput of a single hub's spokes will scale close to linearly, but that is not the same thing.

    Regarding Peer-To-Peer as a concept, I think that it has adequately proven its scalability, with some P2P networks scaling up into the millions of machines. P2P as a concept underlies many modern approaches to building mesh architectures, such as data and compute grids, information fabrics, etc. I have witnessed the "scaling linearly to hundreds of servers" on a P2P architecture in Java, and so far I haven't seen any other architecture that can do it. ;-)

    Peace,

    Cameron Purdy
    Tangosol Coherence: Clustered Shared Memory for Java
  36. an H&amp;S architecture can not linearly scale.

    It can, if the hub can scale ;-) Here you are assuming that the hub doesn't scale.

    Real-world example: Take 50 machines running Oracle Real Application Clusters, and put stateless app servers in front of it. From the application developer's point of view it seems as if you have a database bottleneck, but you don't.

    The scalability of that type of system is very definitely linear. There is some overhead (say, 15%), but it's linear.

    Since I didn't see any mention of replication it doesn't appear like terracotta's hub scales linearly - it's probably just a 1+1 availability kind of deal. Maybe the TC guys can explain this in detail?

    G
  37. hub &amp; spoke, peer to peer[ Go to top ]

    Cameron,

    Absolutely. And, furthermore, agreed. That is why, in my original post I emphasized [clustered] in "clustered hub." Our clustered hub affords 2-tier data solutions and can scale replication traffic with more fine grain control than a class P2P solution. Our hub knows what objects and caches are where so only pushes updates where needed. And our hub, itself is designed to replicate data to only a controlled number of nodes in the cluster of hubs. Its sort of like all data is durably SOMEWHERE, but no data is EVERYWHERE.

    Its nothing we, as a community, haven't seen before, but I just wanted to clear up any misconceptions that our hub is either a bottleneck or less scalable than P2P. It _should_ be more scalable for 2 reasons:

    1. we push only deltas, and our hub knows which VMs have the data so we only push the data where it is needed.
    2. Our hub also clusters for scale as well as availability so it is never a bottleneck.

    I think it looks like P2P but with a factor of 100 less traffic (we have run 100 app servers per Terracotta Virtualization Server).

    --Ari
  38. hub &amp;amp; spoke, peer to peer[ Go to top ]

    I emphasized [clustered] in "clustered hub."

    I should read more carefully, then ;-)
    Our clustered hub affords 2-tier data solutions and can scale replication traffic with more fine grain control than a class P2P solution. Our hub knows what objects and caches are where so only pushes updates where needed. And our hub, itself is designed to replicate data to only a controlled number of nodes in the cluster of hubs. Its sort of like all data is durably SOMEWHERE, but no data is EVERYWHERE.

    Yes, we do the same, except in a P2P manner.

    Peace,

    Cameron Purdy
    Tangosol Coherence: Clustered Shared Memory for Java
  39. I emphasized [clustered] in "clustered hub."
    I should read more carefully, then ;-)
    Our clustered hub affords 2-tier data solutions and can scale replication traffic with more fine grain control than a class P2P solution. Our hub knows what objects and caches are where so only pushes updates where needed. And our hub, itself is designed to replicate data to only a controlled number of nodes in the cluster of hubs. Its sort of like all data is durably SOMEWHERE, but no data is EVERYWHERE.
    Yes, we do the same, except in a P2P manner.Peace,Cameron PurdyTangosol Coherence: Clustered Shared Memory for Java

    Then you probably get better latency.

    They are trying to keep their technology close to their chest, but it looks like to get a monitor on a Java object you have to do this:

    node1 -> server -> node2

    and back.

    That means that the latency to get a monitor is pretty high.
     
    NFS already tried to do this: hide network access without changing the api. There is a good article by Bill Joy (who wrote NFS, I think) about why this in the end didn't work, and why at the time he thought that Jini was the "right way" to approach distributed applications.
  40. Then you probably get better latency.

    They are trying to keep their technology close to their chest, but it looks like to get a monitor on a Java object you have to do this:

    node1 -> server -> node2

    and back.

    That means that the latency to get a monitor is pretty high.

    Sure, that isn't surprising, but it isn't terribly bad either, assuming you can get good parallelization in the application, i.e.

    throughput = (1/latency) * parallelismm

    Peace,

    Cameron Purdy
    Tangosol Coherence: Clustered Shared Memory for Java
  41. Then you probably get better latency.They are trying to keep their technology close to their chest, but it looks like to get a monitor on a Java object you have to do this:node1 -> server -> node2and back.That means that the latency to get a monitor is pretty high.
    Sure, that isn't surprising, but it isn't terribly bad either, assuming you can get good parallelization in the application, i.e.throughput = (1/latency) * parallelismmPeace,Cameron PurdyTangosol Coherence: Clustered Shared Memory for Java

    Do you consider two threads updating the same integer to have a parallelism of two?
  42. They are trying to keep their technology close to their chest, but it looks like to get a monitor on a Java object you have to do this:node1 -> server -> node2and back.That means that the latency to get a monitor is pretty high.&nbsp;NFS already tried to do this: hide network access without changing the api. There is a good article by Bill Joy (who wrote NFS, I think) about why this in the end didn't work, and why at the time he thought that Jini was the "right way" to approach distributed applications.

    Interesting set of assumptions. We, strangely enough, used NFS as one of many models we looked at when designing our solutions. Although I do not want to come across as defending our solution (we have lots to learn about the way people want to use our product), we have seen customer use cases and solved some of the problems you are talking about:

    1. node 1 -> server -> node 2 doesn't quite make sense to me. When using simple clustered locking Terracotta goes node 1 -> server. No other nodes participate in the lock...

    2. We also have concurrent write mode where no locking occurs, yet transactionality is somewhat adhered to (although last one in wins).

    3. Last, we have greedy locks that stay client side and, while networked, never hit the network till lock contention across VMs occurs. node 1 -> server to checkout the lock and then node 1 talks to itself for the lock, from that point forward.

    Now, I have personally used other products and can get close to these capabilities with most other products. But the natural model of just using synchronize{} (or named locks) and changing the behavior at runtime to support my actual usage is not possible with other products. I could actually switch between models #1-3 above at runtime.
  43. 1. node 1 -> server -> node 2 doesn't quite make sense to me. When using simple clustered locking Terracotta goes node 1 -> server. No other nodes participate in the lock...

    Then that means that all the nodes always have to hit the server? For example, for reading a Java field, I have to send a message to the server?
    2. We also have concurrent write mode where no locking occurs, yet transactionality is somewhat adhered to (although last one in wins).

    How do you roll back the threads that "lose"? You must keep around the original copies of the objects?
    3. Last, we have greedy locks that stay client side and, while networked, never hit the network till lock contention across VMs occurs. node 1 -> server to checkout the lock and then node 1 talks to itself for the lock, from that point forward.

    If you could state clearly if you support ACID transactions across nodes, and with what isolation levels I think that would be helpful.

    If you ever explore using synchronous network communications instead of UDP I would definitely consider this type of product.

    It sounds like this product is similar to IBM's parallel sysplex. That also has a hub-and-spoke design, with the hub being a partition of one of the mainframe nodes known as the Coupling Facility (CF). They also go out to the CF to manage locks (also shared queues, other data structures) but the communication is over fiber, and it's in user land. They even added an assembly instruction to their chip to support this. This all from Pfister's book "In Search of Clusters".

    Fortunately, now all this is available for real chip. Just buy infiniband adapters and an infiniband switch. Oracle RAC uses it. Mellanox sells on-motherboard adapters for $69, I believe.
  44. Then that means that all the nodes always have to hit the server? For example, for reading a Java field, I have to send a message to the server?

    Guglielmo,

    I suggest we take this offline. You should definitely stop by our event at JavaOne if you are in town that week. You should sign up for our download. And, we should speak by phone / chat...I suggest this not because I want to stop this thread--quite the opposite. I do think there is a disconnect here, however between what you are trying to solve and what we are trying to solve...

    Regarding your questions above: no, nothing hits the server when it has info locally (roughly speaking). A clustered lock is simply that. You have to lock on the cluster, and then you trust that all your peers do the same, and then they can be blocked when you have the lock. And, reading a field is not a network-activity. If you have a non-clustered object in the VM, it is never going to cause a network hit. If you have a clustered object in the VM, it can cause a network hit on read ONLY if it is (a) out of date or (b) paged-out to save memory.
    How do you roll back the threads that "lose"? You must keep around the original copies of the objects?
    You don't roll back. Last one in wins. Rollback would otherwise not be drop-in as the business logic would have to handle the exception. RMI throws RemoteExceptions and we don't do anything like that. Very important point here is that there are use cases in business apps where this is useful. A simple example is ServletSession. If you have 2 HTTP requests in the cluster at once (user hits refresh after a server appears to crash due to a VM pause), and both requests cause an update session...only 1 update will take. This is actually the expected behavior for ServletSession (no one expects their Servlet container to throw OptimisticConcurrencyException, for example).
    If you could state clearly if you support ACID transactions across nodes, and with what isolation levels I think that would be helpful.
    We do not support ACID transactions.
    If you ever explore using synchronous network communications instead of UDP I would definitely consider this type of product.It sounds like this product is similar to IBM's parallel sysplex.

    Interesting. We will look into it. FWIW, we do believe that there are a class of business app that just does not need MainFrame or MPP-types of capacity. There are indicators out in the customer landscape (folks dumping proprietary HW solutions for linux / Intel) that suggest we are on to something. And, I should say that several companies exist (and pre-date us) that are selling software that works fine all the way down to 10MBits to these customers. They are making money selling software to folks who used to run these proprietary solutions.
  45. Small correction. DSO is ACID.
  46. Small correction. DSO is ACID.

    What isolation levels do you support?
  47. Isolation levels in JDBC[ Go to top ]

    For HA-JDBC, we support any transaction isolation level that the underlying database supports, and any specified by the application.
  48. Isolation levels in JDBC[ Go to top ]

    For HA-JDBC, we support any transaction isolation level that the underlying database supports, and any specified by the application.

    That's not what I am talking about.

    Steve Harris said that DSO is ACID. So, what isolation levels do you support for DSO?
  49. Isolation levels in JDBC[ Go to top ]

    For HA-JDBC, we support any transaction isolation level that the underlying database supports, and any specified by the application.
    That's not what I am talking about.Steve Harris said that DSO is ACID. So, what isolation levels do you support for DSO?

    We have multiple locking levels. They range from Serializable (in database terms but we are not a database) to a few different variations on Read Committed.

    All of our locks guarantee a stable view

    Read Locks - Allow multiple readers and a single writer
    Write Locks - Allow 1 entrant at a time (Serializable)
    Concurrent - Allow multiple writters but maintain a stable view (last one wins).

    One can use any of these kinds of locks at the "object" level rather than having to lock a whole table/row (impedence miss-match) or a whole cache.
    We can use the synchronization already in place in an application to define both our lock bounderies and what "Managed Object" we are locking on.

    Hope this helps,
    Cheers,
    Steve
  50. Isolation levels in JDBC[ Go to top ]

    Steve Harris said that DSO is ACID. So, what isolation levels do you support for DSO?
    We have multiple locking levels. They range from Serializable (in database terms but we are not a database) to a few different variations on Read Committed.

    Does your system enforce 2-phase locking, which ensures conflict-serializability? Or are you talking about different kinds of _locks_?

    If the user has to decide how to lock things then the user doesn't get conflict-serializability, because he is not going to enforce 2PC.
    All of our locks guarantee a stable viewRead Locks - Allow multiple readers and a single writerWrite Locks - Allow 1 entrant at a time (Serializable)Concurrent - Allow multiple writters but maintain a stable view (last one wins).One can use any of these kinds of locks at the "object" level rather than having to lock a whole table/row (impedence miss-match) or a whole cache.We can use the synchronization already in place in an application to define both our lock bounderies and what "Managed Object" we are locking on.Hope this helps,Cheers,Steve

    It doesn't sound like you support ACID. The A in ACID stands for atomicity. That means if your jdbc driver rolls back your transaction you need to roll back the state of all the DSO objects. Or if a thread encounters a bug and throws a NullPointerException you need to roll back the state.

    It's okay - you don't have to support transactions. It sounds like you are addressing web applications rather than what are called "enterprise apps".
  51. Isolation levels in JDBC[ Go to top ]

    Steve Harris said that DSO is ACID. So, what isolation levels do you support for DSO?
    We have multiple locking levels. They range from Serializable (in database terms but we are not a database) to a few different variations on Read Committed.
    Does your system enforce 2-phase locking, which ensures conflict-serializability? Or are you talking about different kinds of _locks_?If the user has to decide how to lock things then the user doesn't get conflict-serializability, because he is not going to enforce 2PC.
    All of our locks guarantee a stable viewRead Locks - Allow multiple readers and a single writerWrite Locks - Allow 1 entrant at a time (Serializable)Concurrent - Allow multiple writters but maintain a stable view (last one wins).One can use any of these kinds of locks at the "object" level rather than having to lock a whole table/row (impedence miss-match) or a whole cache.We can use the synchronization already in place in an application to define both our lock bounderies and what "Managed Object" we are locking on.Hope this helps,Cheers,Steve
    It doesn't sound like you support ACID. The A in ACID stands for atomicity. That means if your jdbc driver rolls back your transaction you need to roll back the state of all the DSO objects. Or if a thread encounters a bug and throws a NullPointerException you need to roll back the state.It's okay - you don't have to support transactions. It sounds like you are addressing web applications rather than what are called "enterprise apps".

    Jim Gray, in the bible of transaction processing, defines the A in ACID as:
    "For transactions, all or nothing. Either all actions happen or none happen."

    That is true for us from a cluster wide sense. Either all the work done in a lock is applied to the cluster or none of it is. We do not currently support any transparent 2pc but I'm not ruling it out in the future.

    I wouldn't say that we are just for web apps, I would agree that since we aren't trying to be a database, using us as a transactional system of record would not be a great idea. But, if one wanted to have some guarantees about how transient data is shared and how memory is managed across a cluster I am quite comfortable with that use case.

    Cheers,
    Steve
  52. Isolation levels in JDBC[ Go to top ]

    Jim Gray, in the bible of transaction processing, defines the A in ACID as:"For transactions, all or nothing. Either all actions happen or none happen."That is true for us from a cluster wide sense. Either all the work done in a lock is applied to the cluster or none of it is.

    It's nice that all nodes see the same final states, but that's not what Gray was talking about. "All or nothing" means if my thread sets three fields on three different objects either then it's guaranteed that if an exception occurs after write 2, say, then writes 1 and 2 are rolled back.

    It's the old bank-account example.

    In your system I would have to explicitly grab exclusive locks on account 1 and 2, transfer the money. And I have to make sure that I don't unlock 1 before locking 2, like this:

    synchronized (account1) {
        synchronized (account2) {
            int amount = 100;
            account1.setBalance(account1.getBalance(balance) - amount);
            account2.setBalance(account2.getBalance(balance) + amount);
        }
    }

    Come to think of it, how do lock N objects in your system, where N is known only at runtime? Do you have an API for that?
    We do not currently support any transparent 2pl but I'm not ruling it out in the future.

    I think it shouldn't be too hard. From the bytecode you can tell if a method is read-only, and on those you can get shared locks. Get exclusive locks on all other methods, and release all the locks at the end (although to mark the end you may need a mini-API.)
    I wouldn't say that we are just for web apps, I would agree that since we aren't trying to be a database, using us as a transactional system of record would not be a great idea. But, if one wanted to have some guarantees about how transient data is shared and how memory is managed across a cluster I am quite comfortable with that use case.Cheers,Steve

    You could use for it a matching engine (as in bids/asks), if you are not already. Maybe they won't mind not having transactions, because the operations are very simple. But they do care about high availability, so the replication will help you there.

    I think that to save people time you might want to mention that you don't need an API because you don't have to support transactions.

    Guglielmo
  53. Isolation levels in JDBC[ Go to top ]

    FWIW, our DSO and HA-JDBC compenents to our virtualization server are not invisibly connected today. This means that if you want to update fields of objects that are in DSO, you should, for example, requery the DB through our JDBC driver and reapply the column values onto the fields. (Since JDBC is caching, you wouldn't actually hit the DB unless the data changed and thus TVS would perform roughly as well as other OR-mappers.)

    In this sense, we are no more and no less transactional than the way the app was written in the first place.

    This is completely orthoganal to the issue of "web app" or "enterprise app" applicability. DSO is an ACID-compliant VM clustering solution. JDBC is a driver that does not perturb your existing definition of transactionality (ACID or otherwise) when communicating w/ a DB. Enterprises can and do get value from these 2 solutions and have no issues w/ the "transactionality" of the product. (It is relatively easy to test our claims once our software is released in July.)

    All this said, in the future, we will implement a DSO <-> JDBC bridge and there, we intend to revisit our default behaviors around 2PC, etc. Our product has the ability to be transactional all the way back to the DB update but today, we don't express such capabilities for customers to "do harm" with.
  54. Isolation levels in JDBC[ Go to top ]

    <block>Come to think of it, how do lock N objects in your system, where N is known only at runtime? Do you have an API for that?</block>

    No we don't have an api. Since the object being synchronized on is a "Managed Object" we are essentially locking on it's identity.

    Cheers,
    Steve
  55. BTW, Steve: you never worked for a company called Pencom, right?
  56. BTW, Steve: you never worked for a company called Pencom, right?

    Nope, never heard of it :-)
  57. Small correction. DSO is ACID.

    Right...And JDBC does not disturb the ACID properties of a DB transaction. I just had a brainfart.

    Sorry for the confusion, and thanks for the catch, Steve.

    --Ari
  58. I just had a brainfart.

    Now, I am not going to ask you about the mechanics of that :)
  59. hub and spoke, peer to peer[ Go to top ]

    Guglielmo,I suggest we take this offline. You should definitely stop by our event at JavaOne if you are in town that week.

    I am probably not going to make it :(
    You should sign up for our download. And, we should speak by phone / chat...I suggest this not because I want to stop this thread--quite the opposite.

    I didn't want to read a lot of material, but maybe I will download.
    I do think there is a disconnect here, however between what you are trying to solve and what we are trying to solve...

    That is often the case, I think Cameron and possibly others here can probably attest to that.
    Regarding your questions above: no, nothing hits the server when it has info locally (roughly speaking). A clustered lock is simply that. You have to lock on the cluster, and then you trust that all your peers do the same, and then they can be blocked when you have the lock. And, reading a field is not a network-activity. If you have a non-clustered object in the VM, it is never going to cause a network hit.

    I see. I guess you can live with that because you don't support transactions ...
    Interesting. We will look into it. FWIW, we do believe that there are a class of business app that just does not need MainFrame or MPP-types of capacity.

    It's starting to sound like you are mainly targeting "web applications" (in the j2ee sense) because usually those don't worry too much about isolating the effects of one user from another. Also things like 'whiteboard' applications, all non-transactional.
    There are indicators out in the customer landscape (folks dumping proprietary HW solutions for linux / Intel) that suggest we are on to something.

    I don't deny that common-off-the-shelf components are the way to go. But infiniband is fast becoming commoditized, also. The most practical way to approach this might to get a cluster of apple boxes. I am not an apple user but I think infiniband is well supporetd there. It should be real inexpensive.

    Just for fun, here is a beautiful article about Parallel Sysplex:

    http://www.research.ibm.com/journal/sj/362/nick.html
  60. If you have 2 HTTP requests in the cluster at once (user hits refresh after a server appears to crash due to a VM pause), and both requests cause an update session...only 1 update will take. This is actually the expected behavior for ServletSession (no one expects their Servlet container to throw OptimisticConcurrencyException, for example).

    They should process in serial, which you could accomplish with synchronization; for example, that's how Coherence*Web does it to avoid the potential for missing updates. Further, you can do clustered locking without any network traffic until either sticky load balancing unsticks or a server fails.

    Peace,

    Cameron Purdy
    Tangosol Coherence: Clustered Shared Memory for Java
  61. Greedy locks[ Go to top ]

    Further, you can do clustered locking without any network traffic until either sticky load balancing unsticks or a server fails.Peace,Cameron PurdyTangosol Coherence: Clustered Shared Memory for Java

    You guys have some very smart / cool optimizations. I like it...sounds like a special case for Session of our generic "greedy locks" solution described earlier.

    --Ari
  62. Fine-grained replication[ Go to top ]

    Ari,

    Thanks for pointing out the differences. Not to steal the limelight, but just some additional points from our side. :-)
    1. We treat JDBC and Object data discreetly. JBossCache is a clustered object cache. Our HA-JDBC product can be integrated with our DSO product to do OR-mapping w/o any JDBC calls anywhere in the application code, for example. You could also implement our HA-JDBC (clustered database cache) with our DSO product quite easily. This is because we are solving both the problem of talking to systems of record as well as clustering VMs.
    Yes, JBossCache only deals with distributed stateful cache as DSO (I assume). It does not deal with entity cache syncrhonization directly. To do that, Hibernate can be used JBossCache for second level cache, for example.
    2. Terracotta's scalability is based off of lots of product hardening. We have what we think is a linearly scalable solution that is [clustered] hub &amp; spoke instead of peer-to-peer. Not sure if JBossCache can scale linearly to hundreds of servers.
    We are currently in the process of benchmarking JBossCache with 100 Linux boxes or so with help from one of our customers. I assume DSO has been tested in a massive array of servers already?
    4. Configurable fine-grained locking. There are object-level locks and they are separate and apart from objects so you could have 2 classes lock on the same lock. You could also have read/write locks, concurrent-write locks that can be entered unfettered but still guarantee stable views of the data, etc.
    JBossCache's locking philosphoy uses JDBC isolation level semantics (and this applies to both plain or aop cache). That means, user only has high-level coarse grained selection of data consistency that the application calls for. Our experience has indicated that, unless deadlock occurs, replication cost (most of the time) outweighs the fine-grained data concurrency one (within the same VM).
    5. Distributed signaling (method calls, wait/notify, etc.) as we have mentioned before.
    Yes, wait/notify may be considered part of distributed locking. But I think remote RPC calls should not be part of a distributed cache system, IMHO. As a matter of fact, we have provided an "un-advertised" RPC method call API in JBossCache. But we are not sure if we want to officially expose it or not. :-)
    I would be very happy to spend time talking about JBossCache and improving our understanding of your product. Please do shoot me your contact info (feel free to sign up for info at http://www.terracottatech.com/company/TellMeMore.html) and we can get together at JavaOne if you will be there, or at another convenient time.

    Sure thing! I'd love to know more about TVS capability as well. I will be in J1. So we can look each other up. :-)

    -Ben Wang
    JBossCache
  63. We have what we think is a linearly scalable solution that is [clustered] hub & spoke instead of peer-to-peer.

    I can't help but consider your star topology as less performant and less reliable than WebLogic's peer-to-peer clustering. Eg, I assume that routing through your hub doubles the hop latency, adds a central failure point, and causes switch contention.
  64. We have what we think is a linearly scalable solution that is [clustered] hub &amp; spoke instead of peer-to-peer.
    I can't help but consider your star topology as less performant and less reliable than WebLogic's peer-to-peer clustering. Eg, I assume that routing through your hub doubles the hop latency, adds a central failure point, and causes switch contention.


    Hi Brian,

    Before joining Terracotta, I worked at WebLogic/BEA for over 7 years as an engineer and manager responsible for JMS, EJB, Web Services, and most relevant to this discussion, the Core team responsible for WebLogic Server's clustering implementation. From a technical perspective, BEA has nothing comperable to the functionality provided by Terracotta.

    In WebLogic Server, the primary/secondary architecture used for clustering servlet session state is elegant, but embodies in it a simplifying assumption: that there is only one client to the state, i.e. the browser. The same is essentially true for stateful session beans as well. Terracotta's clustering solution makes no such assumption.

    WebLogic also replicates state via the cluster wide JNDI tree, however that was not designed as a cluster-wide object store, and I don't think it would perform well in that capacity. It certainly provides none of the transactional guarantees that DSO provides.

    WebLogic's implementation of EJB does not attempt to share object state except by storing data in a cluster-wide accessible database. It is not a light-weight model, and the locking semantics are only what is provided by the underlying database.

    There are various replication strategies for RMI objects in
    WebLogic Server, but it is the stub that is replicated, not the state, and there is no failover capability for a stateful impl.

    Terracotta's approach focuses on clustering the state. There may or may not be more hops to access some particular piece of state, depending on whether it is present in the L1 (the spoke), or needs to be refreshed from the L2 (the hub). Even in the latter case, because we perform fine grain replication (field level) we can be more efficient on the wire than serialization-based approaches.
      
    There are many technical differences between the two approaches to clustering, and I'd be happy to dive into more details. But it is hard to compare the two directly when they differ fundamentally in their models. In general, WebLogic's clustering is based on smart stubs and remote method calls, whereas Terracotta's is based on distributed state.


    Don Ferguson
    Vice President Engineering
    Terracotta, Inc
  65. A great thread[ Go to top ]

    I don't know about others, but this is one of the best threads in the last 12 months. The level of technical details is way above the average thread by an order of magnitude.

    I've been exploring these techniques the last few years to figure out an effective strategy for distributing state across a massive cluster of systems. Wish I could make it to JavaOne, but since I can't, I'll ask my questions here. How does TC handle dynamic scenarios where a node is removed/added/suspended within the cluster? In the pre-trade compliance space, a partitioning approach is pretty common. Since the first level of compliance evaluation is at the account level, partitioning is straight forward. Once the rules apply to group of accounts, mutual funds or firm wide, can terracotta handle a case where an application needs to search across the cluster.

    If I understand the Javaspace model correctly, the process would be sent out to each node in the cluster. Each node performs the calculation locally and then sends the result to a central point. Say I need to do something like:

    Sector technology cannot exceed 8% of the firms total managed assets

    Many of the accounts would be static (ie they change once a week or less), but a significant percentage of the accounts would be active and shifting constantly. From T to T+n, the weight of "sector technology" may exceed 8%. Would it be feasible to query across the cluster with minimal latency?

    peter
  66. A great thread[ Go to top ]

    Peter,

    Thanks for the compliment. We are just here to help the community grow and thrive--like everyone else here. Very encouraging to hear all the questions, feedback, and alternatives from folks thus far.

    Now, I am not sure I understand your question. Picking up on the keywords partition, dynamicness, and search, I will try to answer:

    1. partitioning of data. The entire dataset is available to any and all L1's (JVMs) at all times. However, that dataset only resides in its entirety in L2. L1's keep only the most popular components of the shared object graph in memory.

    2. Dynamicness. You can, therefore, shutdown any and all L1's and upon restart, they will page data back in from L2. Even if your application was written for a single node and new()s up the object you want to cluster, TC can cluster that object w/ no code changes. We see the constructor call and turn it into a call to get the object out of TC instead. We designed the product explicitly to make operating Java apps in a production environment easier; being able to pause or do a rolling restart (restarting one node at a time while workload spills over to other nodes) is an explicit feature. BTW, our L2 clusters will support loss or planned downtime of individual L2 nodes as well.

    3. Search. Since the dataset is dynamically paged into L1 and resides in its entirety in L2, one could get an iterator to a clustered hashmap, for example, and the map will page in fields as you come to them. You could efficiently iterate over a multiple gigabyte collection with minimal regard to performance. A non-collection example we have run with customers is a 4GB XML doc that gets loaded as a DOM. As the L1 navigates the DOM, Terracotta pages in pieces of the DOM so that the physical heap can be 128MB instead of the full 4GB+. You asked about performance. Performance in the DOM case could be better (one of our slower examples) but it was 15-minute type of PoC. In all other performance tests that our customers have run, our customers report that TC outruns the competition--not just the fine grained AOP-based guys who are most like TC.

    Now, you mention JavaSpaces and, interestingly enough, we have contemplated a JavaSpaces-like tool. If enough people think it makes sense to use JavaSpaces APIs but without a Spaces implementation (transparently cluster all the data behind the scenes) please let us know and we will get something out to the community ASAP. For that matter, we would love feedback on API's that folks out there like, but whose implementation could stand to be simplified down to a clustered collection (an example is JMS implemented via java.util.List and synchronize{}). Maybe we should put our distributed testing framework up for download as well as the core product. We used our own product to build a clustered testing tool that can do things like:

    1. Start all JVMs in a test in unison before a clustered workload can be applied to them
    2. Cause explicit race conditions and/or deadlocks and hangs in the cluster
    3. Centralize logging libraries so that all STDOUT, for example funnels to a single control point...

    Again, thanks everyone for all the interest and feedback.

    See you all at JavaOne!
  67. A great thread[ Go to top ]

    Peter,Thanks for the compliment. We are just here to help the community grow and thrive--like everyone else here. Very encouraging to hear all the questions, feedback, and alternatives from folks thus far.Now, I am not sure I understand your question. Picking up on the keywords partition, dynamicness, and search, I will try to answer:1. partitioning of data. The entire dataset is available to any and all L1's (JVMs) at all times. However, that dataset only resides in its entirety in L2. L1's keep only the most popular components of the shared object graph in memory.2. Dynamicness. You can, therefore, shutdown any and all L1's and upon restart, they will page data back in from L2. Even if your application was written for a single node and new()s up the object you want to cluster, TC can cluster that object w/ no code changes. We see the constructor call and turn it into a call to get the object out of TC instead. We designed the product explicitly to make operating Java apps in a production environment easier; being able to pause or do a rolling restart (restarting one node at a time while workload spills over to other nodes) is an explicit feature. BTW, our L2 clusters will support loss or planned downtime of individual L2 nodes as well. 3. Search. Since the dataset is dynamically paged into L1 and resides in its entirety in L2, one could get an iterator to a clustered hashmap, for example, and the map will page in fields as you come to them. You could efficiently iterate over a multiple gigabyte collection with minimal regard to performance. A non-collection example we have run with customers is a 4GB XML doc that gets loaded as a DOM. As the L1 navigates the DOM, Terracotta pages in pieces of the DOM so that the physical heap can be 128MB instead of the full 4GB+. You asked about performance. Performance in the DOM case could be better (one of our slower examples) but it was 15-minute type of PoC. In all other performance tests that our customers have run, our customers report that TC outruns the competition--not just the fine grained AOP-based guys who are most like TC.Now, you mention JavaSpaces and, interestingly enough, we have contemplated a JavaSpaces-like tool. If enough people think it makes sense to use JavaSpaces APIs but without a Spaces implementation (transparently cluster all the data behind the scenes) please let us know and we will get something out to the community ASAP. For that matter, we would love feedback on API's that folks out there like, but whose implementation could stand to be simplified down to a clustered collection (an example is JMS implemented via java.util.List and synchronize{}). Maybe we should put our distributed testing framework up for download as well as the core product. We used our own product to build a clustered testing tool that can do things like:1. Start all JVMs in a test in unison before a clustered workload can be applied to them2. Cause explicit race conditions and/or deadlocks and hangs in the cluster3. Centralize logging libraries so that all STDOUT, for example funnels to a single control point...Again, thanks everyone for all the interest and feedback.See you all at JavaOne!

    thanks for responding Ari :)

    I wish I could make it to JavaOne. It's been ages since I've seen you fact-to-face. Back to my question, since I wasn't too clear. Sorry about the poor description.

    Say I load 200K accounts into memory out of 20million. Those 200K are actively trading. Let's assume the average number of positions for the active accounts is 40. Multiply the two and we get roughly 8 million positions in memory during the active trading hours. The data would be in Object format and not DOM, so the usual DOM issues shouldn't apply here. In rigorous pre-trade compliance, the system would need to check account rules, group rules, firm wide rules and government regulations. Lets say I decide to partition the accounts such that all the systems have the same number of active accounts and I setup my JMS to route the transactions correctly. Performing account level rules is simple, since all the positions are in the same physical system and VM, so doing an aggregate calculation like "the weight of retail sector for all countries the account invests in."

    At runtime, it means the system has to look at how many countries the account invests in (ie select distinct country from positions where account=id). Once the system has the distinct countries, we take that and run the aggregate query which might look like "select sum total from positions where ..." I'm just paraphrasing the queries, so they're not real.
    So the question is specifically, does TC provide that ability through plain sql? and what kind of latencies would one expect? I'm guessing it will be much faster than going back to the db. The other part of this specific class of problems is the number of parallel queries may be pretty high. It looks like the design of TC basically has most, if not all the data in memory (L1 or L2) one would need, since the user would have loaded the data to create the transactions.

    look forward to your answer :)

    peter
  68. A great thread[ Go to top ]

    We have no search feature in DSO.

    W/o an explicit searching feature you could do what one customer is planning. They have 8GB-10GB of object data at any one time and they just grab an iterator to the entire collection (well sort of) and search in Java. We are under NDA so I can't detail THEIR solution, but here's the basic idea in generic terms.

    Instead of JMS and JDBC (SELECT * WHERE ID=1234;) you would want to partition the dataset logically as opposed to physically and start many VM's on parts of the list. Each VM adds any object references that meet the pattern (account.id=1234;) to a shared List.

    Since the input and output lists are both shared and transactional, the VM's can divide the work yet collaborate on a shared dataset. You could even get fancy and make it all dynamic. Use a 3rd collection (Array) where each node posts its existence at startup. At runtime, you could MOD the size of the dataset by the number of nodes and divide the workload based on how many servers you can actually provide at any given time. Example: 8MM data elements and 8 servers, each one can decide based on its position in the shared array to take that position's percentage of the data. If I am server 1 in array position zero, I work on the first 1MM data elements. if I am server 8 in array position 7, I work on the last 1MM data elements. This way I don't need locks of any kind. I just work on an isolated portion of the dataset. And, when I see the search pattern in my portion of the dataset, I add its reference to the output search results List.

    Then a 9th machine could come online and iterate on this List in a blocking fashion. It sleeps until any of the other 8 servers notify it (literally using wait/notify) when they have published data to the output List. It then wakes up and continues the relevant calculations on the matched objects, then goes back to sleep on the List, waiting for another notify.

    Last, you can make this capable of restarting where it left off simply by having the 9th server "cluster" its intermediate data even though no other node is sharing that data with the 9th server. That way a VM restart will find the already calculated subset of the answer because that info is persisted in L2 and pages back in on restart.
  69. sounds cool...[ Go to top ]

    We have no search feature in DSO.W/o an explicit searching feature you could do what one customer is planning. They have 8GB-10GB of object data at any one time and they just grab an iterator to the entire collection (well sort of) and search in Java. We are under NDA so I can't detail THEIR solution, but here's the basic idea in generic terms.Instead of JMS and JDBC (SELECT * WHERE ID=1234;) you would want to partition the dataset logically as opposed to physically and start many VM's on parts of the list. Each VM adds any object references that meet the pattern (account.id=1234;) to a shared List.Since the input and output lists are both shared and transactional, the VM's can divide the work yet collaborate on a shared dataset. You could even get fancy and make it all dynamic. Use a 3rd collection (Array) where each node posts its existence at startup. At runtime, you could MOD the size of the dataset by the number of nodes and divide the workload based on how many servers you can actually provide at any given time. Example: 8MM data elements and 8 servers, each one can decide based on its position in the shared array to take that position's percentage of the data. If I am server 1 in array position zero, I work on the first 1MM data elements. if I am server 8 in array position 7, I work on the last 1MM data elements. This way I don't need locks of any kind. I just work on an isolated portion of the dataset. And, when I see the search pattern in my portion of the dataset, I add its reference to the output search results List.Then a 9th machine could come online and iterate on this List in a blocking fashion. It sleeps until any of the other 8 servers notify it (literally using wait/notify) when they have published data to the output List. It then wakes up and continues the relevant calculations on the matched objects, then goes back to sleep on the List, waiting for another notify.Last, you can make this capable of restarting where it left off simply by having the 9th server "cluster" its intermediate data even though no other node is sharing that data with the 9th server. That way a VM restart will find the already calculated subset of the answer because that info is persisted in L2 and pages back in on restart.

    that sounds straight forward. So basically, the application developer would be responsible for creating some indexes. What I was thinking is it would be nice to have standard way of querying and indexing the in-memory data. Since reads for data in L1 or L2 (ie selects w/o functions) doesn't need to hit the database, it would be nice if select with functions didn't have to hit the database also. I think Javaspaces provides that capability.

    peter
  70. sounds cool...[ Go to top ]

    What I was thinking is it would be nice to have standard way of querying and indexing the in-memory data. Since reads for data in L1 or L2 (ie selects w/o functions) doesn't need to hit the database, it would be nice if select with functions didn't have to hit the database also.

    We have applications caching several terabytes of data in the application tier (indexed in memory) using parallel query to locate data from those caches.

    Real-time pre-execution compliance systems are typically much lighter weight than that, though. However, most of the work is still in the design, not the implementation. Regardless of what technology you use, the legal requirements are going to be pretty stiff nowadays (S.O. etc.).

    Peace,

    Cameron Purdy
    Tangosol Coherence: Clustered Shared Memory for Java
  71. sounds cool...[ Go to top ]

    What I was thinking is it would be nice to have standard way of querying and indexing the in-memory data. Since reads for data in L1 or L2 (ie selects w/o functions) doesn't need to hit the database, it would be nice if select with functions didn't have to hit the database also.

    We have applications caching several terabytes of data in the application tier (indexed in memory) using parallel query to locate data from those caches.

    Real-time pre-execution compliance systems are typically much lighter weight than that, though. However, most of the work is still in the design, not the implementation. Regardless of what technology you use, the legal requirements are going to be pretty stiff nowadays (S.O. etc.).

    Peace,
    Cameron PurdyTangosol Coherence: Clustered Shared Memory for Java

    As far as I know, you're totally correct on current implementations of real-time pretrade. Very few firms go beyond simple restriction rules. Firms that check weight/diversification rules in pre-trade generally don't check all the rules. To do so would be much too heavy and impact transactional throughput.

    Having said that, with all the fines the last 3 years and SOX regs, the compliance requirements are becoming more strict. For a while there, there was a lot of rumors in the industry about Hedge funds coming under regulation. Some of the more recent fines have scared alot of firms fine tracker.

    I think many of the techniques used by coherence, javaspaces and Terracotta, will become important for handling heavier real-time pre-trade compliance requirements. Many small/medium firms still handle compliance manually, but those days won't last very long. The big firms in NY and Boston are either actively researching or already using some of these techniques. Hopefully, as this field advances and matures, performing rigorous real-time compliance will be closer to reality. Atleast I hope it gets closer :)

    peter
  72. sounds cool...[ Go to top ]

    What I was thinking is it would be nice to have standard way of querying and indexing the in-memory data.

    That's where Service Data Objects would help, the little gizmo whose utility so many doubt. At least in Globus, there's a standard service for indexing/querying distributed service data.
  73. sounds cool...[ Go to top ]

    What I was thinking is it would be nice to have standard way of querying and indexing the in-memory data.
    That's where Service Data Objects would help, the little gizmo whose utility so many doubt. At least in Globus, there's a standard service for indexing/querying distributed service data.

    thanks for reminding me about globus. it's been a while since I looked at globus toolkit, but that's one of the places where I originally came across the idea of querying across a grid space. one of the reasons i asked about dynamic discovery of nodes is that i like the idea of having a light weight grid client. this way, when everyone is at home and the systems need to do batch/overnight processes, all the free CPU's in the building can take a chunk of the work. applied to overnight compliance, it could dramatically reduce the time needed to check compliance on millions of accounts. In a large firm, there might be 5k desktop systems with 1ghz+ CPU.

    Since compliance can be easily divided up into parallel tasks, it's easy to divide the work. One of the challenges is packaging the work so it is a self-contained job. In some ways, it's similar to a rendering farm scenario. to really squeeze every ounce of performance out of distributing the work, it would be good to take the aggregate results and route them to system doing group or firm level aggregations. Ideally, a master (or multiple masters) would divide up the work and send it to the available nodes. If the master needs to aggregate a group of accounts, it "could" do this in a distributed query fashion.

    peter
  74. sounds cool...[ Go to top ]

    Hey Peter - There's an example that does exactly this
    with our product download - if you're interested...

    Cheers - Gad (GigaSpaces)
  75. sounds cool...[ Go to top ]

    Hey Peter - There's an example that does exactly thiswith our product download - if you're interested...Cheers - Gad (GigaSpaces)

    I think you mentioned that in the past, but sadly I haven't had time to install it and take a look. if only i had more time, but some how working jmeter, drools, tomcat and other open source stuff seems to end up taking a higher priority in my freetime :)

    peter
  76. A great thread[ Go to top ]

    Hi Peter

    This is indeed a great thread.
    ..If I understand the Javaspace model correctly, the process would be sent out to each node in the cluster. Each node performs the calculation locally and then sends the result to a central point. ..

    Actually if you think about it, the subject here is how we manage distributed data structure across a cluster of VM's. There are various ways in which this issue had been dealt with in the past (Mostly through Data Base technologies or Distributed File Systems (NFS) models)
    The dramatic increase in memory availability as well as network speed that we had been witnessing during the past few years led to new emerging approaches to the same old problem. Caching and JavaSpaces would be some of them.

    The concept of Distributed VM is not new as well and there has been several attempts (That i'm aware of) to deal with most of those issues in the VM level in the past. One of the main challenges with transparent approach is that only few EXISTING applications can actually be taken out and transparently move to a distributed model without a need for re-architecture. Distributed Applications is not just about API it is mostly about architecture as Cameron rightly suggested earlier.

    The approach that we had taken with
    GigaSpaces was actually the opposite way around.
    We took a distributed model - the JavaSpaces model, and built a virtual middleware framework around it. In other words we took different set of distributed and none distributed data models (HashTable, JDBC,JMS to name a few) and built them on-top of our JavaSpaces cluster. Another interesting example that we had been playing with recently together with Rod Johnson is building a Spring Remoting interface on top of JavaSpaces to provide a transparent RPC like functionality using content based network of JavaSpaces.

    One of the immediate benefits that you can get out of this model is that you can mix and match between L1/L2 approach and p2p approach and even combine the two. Based on the API you can get different level of control on the consistency, transaction and query semantics.
    I'm not sure if this would be surprising but we already did that in late 2004 and actually right now we are demonstrating the ability not just to virtualize the middleware but to provision the middleware and the business logic associated with it, dynamically based on SLA agreement (Memory capacity, CPU utilization etc) dynamically.

    There is going to be an interesting panel during JavaOne

    "TS-5471 Jini™ and JavaSpaces™ Technologies on Wall Street" which I believe may of interest to you specifically (given the example you provided earlier) and probably other members on this list.

    Nati S.

    GigaSpaces – Grid Based JavaSpaces Cluster
  77. Too bad for me :([ Go to top ]

    Hi Peter

    This is indeed a great thread.
    ..If I understand the Javaspace model correctly, the process would be sent out to each node in the cluster. Each node performs the calculation locally and then sends the result to a central point. ..
    Actually if you think about it, the subject here is how we manage distributed data structure across a cluster of VM's. There are various ways in which this issue had been dealt with in the past (Mostly through Data Base technologies or Distributed File Systems (NFS) models) The dramatic increase in memory availability as well as network speed that we had been witnessing during the past few years led to new emerging approaches to the same old problem. Caching and JavaSpaces would be some of them. The concept of Distributed VM is not new as well and there has been several attempts (That i'm aware of) to deal with most of those issues in the VM level in the past. One of the main challenges with transparent approach is that only few EXISTING applications can actually be taken out and transparently move to a distributed model without a need for re-architecture. Distributed Applications is not just about API it is mostly about architecture as Cameron rightly suggested earlier.The approach that we had taken with GigaSpaces was actually the opposite way around.We took a distributed model - the JavaSpaces model, and built a virtual middleware framework around it. In other words we took different set of distributed and none distributed data models (HashTable, JDBC,JMS to name a few) and built them on-top of our JavaSpaces cluster. Another interesting example that we had been playing with recently together with Rod Johnson is building a Spring Remoting interface on top of JavaSpaces to provide a transparent RPC like functionality using content based network of JavaSpaces.One of the immediate benefits that you can get out of this model is that you can mix and match between L1/L2 approach and p2p approach and even combine the two. Based on the API you can get different level of control on the consistency, transaction and query semantics.I'm not sure if this would be surprising but we already did that in late 2004 and actually right now we are demonstrating the ability not just to virtualize the middleware but to provision the middleware and the business logic associated with it, dynamically based on SLA agreement (Memory capacity, CPU utilization etc) dynamically. There is going to be an interesting panel during JavaOne "TS-5471 Jini and JavaSpaces Technologies on Wall Street" which I believe may of interest to you specifically (given the example you provided earlier) and probably other members on this list.

    Nati S.GigaSpaces Grid Based JavaSpaces Cluster

    Man, I really wish I could make it to JavaOne. With coherence, gigaspaces and Terracotta, there's some really interesting stuff happening these days. It would appear that both hub/spoke and p2p models of distribution are finding fruitful soil and compliment each other quite nicely.

    peter
  78. A great thread[ Go to top ]

    The concept of Distributed VM is not new as well and there has been several attempts (That i'm aware of) to deal with most of those issues in the VM level in the past. One of the main challenges with transparent approach is that only few EXISTING applications can actually be taken out and transparently move to a distributed model without a need for re-architecture. Distributed Applications is not just about API it is mostly about architecture as Cameron rightly suggested earlier.The approach that we had taken with GigaSpaces was actually the opposite way around.

    First off, a distributed VM with OPT-IN/OPT-OUT distribution semantic AND runtime control is new. With Terracotta one can cluster one object and not another. One can cluster one lock and not another. One can cache on DB call and [again] not another. This concept is not an all-or-nothing proposition. If you would, please, cite an example of an equivalent past solution.
     
    Second, adding an API is getting ahead of the solution. Terracotta believes that vendors can throw the baby out with the bath water by suggesting that an automatically distributed application can't scale properly; thus an API is required. I agree that many apps inadvertantly fight clustering but scale and APIs are not related. Scale is a runtime problem that cannot be solved efficiently exclusively at development time. In fact, trying to do so would violate a basic principle of engineering--performance tuning w/o real performance data.

    An API is, IMO, a contract for conversing amongst separate components of a logical machine. The API for conversing amongst threads in Java is defined in the JDK--wait/notify, synchronize, the memory model itself, etc. And the API for speaking to a DB, for better or worse, is JDBC (or JDO). Jumping to implementation and saying "here is a separate API to cluster with" forces a move to a whole new set of problems involving application complexity and TOTAL COST OF OWNERSHIP. The more we battle as a community over more API's than are necessary to address the problem space, the more the Java community suffers from usability challenges and barriers to adoption.

    FWIW, in the solution I discussed above the customer was rewriting the app. I did not claim that Terracotta was dropping in and somehow [magically] all the problems disappeared. Our point is not just that we make the app scale, but that we can do it with a single-VM implementation that gets clustered and tuned AT RUNTIME.

    I honestly suggest that we not use this forum as a tit-for-tat. Personally, I recognize that Tangosol and GigaSpaces are different from Terracotta and that they are capable solutions...the customer will have to decide if he/she wants another API, or just wants to stick with the JDK as all that is necessary during development.

    --Ari
  79. A great thread[ Go to top ]

    Ari - not a tit-for-tat... just a clarification:
    With GigaSpaces, clustering (as well as load-balancing, replication and fail-over) does not require an API - it's entirely a configuration / runtime issue - we actually offer a GUI that allows the creation and definition of clusters dynamically.

    Great thread indeed - looking forward to meeting you all at J1.

    Gad Barnea
    GigaSpaces, Inc.
  80. A great thread[ Go to top ]

    Ari - not a tit-for-tat... just a clarification: With GigaSpaces, clustering (as well as load-balancing, replication and fail-over) does not require an API - it's entirely a configuration / runtime issue - we actually offer a GUI that allows the creation and definition of clusters dynamically. Great thread indeed - looking forward to meeting you all at J1. Gad BarneaGigaSpaces, Inc.

    Huh? Apples and Oranges my friend. Terracotta has NO imports, period. Looking over Gigapspaces, I see the following in the example code:
    package com.j_spaces.examples.dcachemap;
     
    import java.util.Calendar;
    import java.util.GregorianCalendar;
     
    import com.j_spaces.core.IJSpace;
    import com.j_spaces.core.client.SpaceFinder;
    import com.j_spaces.map.*;
    import com.j_spaces.core.client.DCacheSpaceImpl;
    import com.j_spaces.core.client.CacheException;


    I think what you mean is that once an app architecture includes GigaSpaces, many things are controlled at runtime--definitely a good system. Just different.

    Thanks,

    --Ari
  81. A great thread[ Go to top ]

    The concept of Distributed VM is not new as well and there has been several attempts (That i'm aware of) to deal with most of those issues in the VM level in the past. One of the main challenges with transparent approach is that only few EXISTING applications can actually be taken out and transparently move to a distributed model without a need for re-architecture. Distributed Applications is not just about API it is mostly about architecture as Cameron rightly suggested earlier.The approach that we had taken with GigaSpaces was actually the opposite way around.
    First off, a distributed VM with OPT-IN/OPT-OUT distribution semantic AND runtime control is new. With Terracotta one can cluster one object and not another. One can cluster one lock and not another. One can cache on DB call and [again] not another. This concept is not an all-or-nothing proposition. If you would, please, cite an example of an equivalent past solution.&nbsp;Second, adding an API is getting ahead of the solution. Terracotta believes that vendors can throw the baby out with the bath water by suggesting that an automatically distributed application can't scale properly; thus an API is required. I agree that many apps inadvertantly fight clustering but scale and APIs are not related. Scale is a runtime problem that cannot be solved efficiently exclusively at development time. In fact, trying to do so would violate a basic principle of engineering--performance tuning w/o real performance data.An API is, IMO, a contract for conversing amongst separate components of a logical machine. The API for conversing amongst threads in Java is defined in the JDK--wait/notify, synchronize, the memory model itself, etc. And the API for speaking to a DB, for better or worse, is JDBC (or JDO). Jumping to implementation and saying "here is a separate API to cluster with" forces a move to a whole new set of problems involving application complexity and TOTAL COST OF OWNERSHIP. The more we battle as a community over more API's than are necessary to address the problem space, the more the Java community suffers from usability challenges and barriers to adoption.FWIW, in the solution I discussed above the customer was rewriting the app. I did not claim that Terracotta was dropping in and somehow [magically] all the problems disappeared. Our point is not just that we make the app scale, but that we can do it with a single-VM implementation that gets clustered and tuned AT RUNTIME.I honestly suggest that we not use this forum as a tit-for-tat. Personally, I recognize that Tangosol and GigaSpaces are different from Terracotta and that they are capable solutions...the customer will have to decide if he/she wants another API, or just wants to stick with the JDK as all that is necessary during development.

    --Ari

    I hope I didn't give the impression of "tit-for-tat". I definitely see a value in both API driven and API-less clustering approaches. My interest is in figuring out the strengths and weaknesses of each approach and how it might apply to real-time systems for compliance. Many of the current third party compliance systems scale poorly due to design. If I have an existing application that needs to scale and I don't have time to rewrite 50% of the application, the Terracotta approach would be very attractive. On the other hand, if I've committed to rewriting my application to get closer to real-time, than all the options are viable solutions. My hope from an end user perspective is all these different approaches will produce more scalable solutions and reduce development cost. Rather than build my own clustering/load balancing layer, it's much more attractive to use a mature product that has been tested in different environments.

    peter
  82. A great thread[ Go to top ]

    Peter,

    Very well summarized. All vendors are able to help. All would, IMHO, improve scalability and lower TCO for apps that aren't scaling today.

    I think TC is unique in helping existing apps--we actually helped a customer improve scalability and availability of RSA ClearTrust w/o RSA's intervention.

    The key to Terracotta is that we are targeting a much larger delta in current-state vs. post-purchase TCO. How can TC help more than the other vendors? This is the point of the runtime approach. Imagine your production environment as an organic entity. Workloads, size of dataset, and hotzones / popular data change over night so some tuning decisions are impossible to make at code time.

    Last, (everything in this world has trade offs) we are a younger company than the others out there. That's why everyone should download our software in July and help us catch up (just kidding).

    --Ari
  83. A great thread[ Go to top ]

    Peter,Very well summarized. All vendors are able to help. All would, IMHO, improve scalability and lower TCO for apps that aren't scaling today.I think TC is unique in helping existing apps--we actually helped a customer improve scalability and availability of RSA ClearTrust w/o RSA's intervention.The key to Terracotta is that we are targeting a much larger delta in current-state vs. post-purchase TCO. How can TC help more than the other vendors? This is the point of the runtime approach. Imagine your production environment as an organic entity. Workloads, size of dataset, and hotzones / popular data change over night so some tuning decisions are impossible to make at code time.Last, (everything in this world has trade offs) we are a younger company than the others out there. That's why everyone should download our software in July and help us catch up (just kidding).

    --Ari

    I'd try it, but my home environment only has 4 servers, so it's kinda puny and weak. If I had a bigger/better dev environment at home, i'd gladly try it out on some compliance scenarios and send you the numbers :)

    now if Terracotta could offer trials in IBM's grid environment, that would be totally awesome.

    peter
  84. Durability, RDMA[ Go to top ]

    Two things:

    1) One nice thing about database servers is durability. If you promote using main memory to remove the database bottleneck it seems to me that you also have to replicate your objects so that you don't end up with a really fast application that has sub-standard reliability.

    2) I think it's disappointing that you are not using RDMA to send your updates. Right now your value proposition seems to be that by emphasizing main memory and network access and de-emphasizing the database you get throughput, but this is despite, not because of, your use of UDP (or worse, TCP - I hope you are not using that.) I think what you should have done is use Remote Direct Memory Access (Infiniband!) to propagate your updates. That's nice because the writes are synchronous (no switching into the kernel) and the latency is fabuluous (200ns min.)

    Anyhow, maybe those things will be in version 3.0 :)

    Guglielmo
  85. Durability, RDMA[ Go to top ]

    Guglielmo,

    I'll answer in two parts:

    1) If you're talking about the HA-JDBC driver, we use main memory to cache reads, not writes -- those gets passed through to the database directly. If you're talking about the Distributed Shared Objects technology, our server can write object changes to disk, too. In either case, information is no less durable than it was without our products.

    2) I actually co-wrote one of the earliest independent implementations of Intel's Virtual Interface Architecture, which has RDMA in it -- perhaps I can speak to this:

    While it wouldn't be hard to use RDMA as our communication channel, our 'sweet spot' is somewhat different from that of InfiniBand and RDMA. We're not attempting to replace MPI, AM/GM, VIA, or RDMA, and we don't typically expect to run in environments where sub-microsecond latency is required (or even desired) -- our target is business applications in the datacenter, not compute clusters or MPPs. In that sense, we are much more of an alternative to RMI, JMS, EJB remoting, or any number of ad-hoc home-grown solutions, where flexibility and adaptability to complex business problems and their structures is paramount. Just as you likely wouldn't choose MPI to communicate in your server farm running web-based Java apps, you wouldn't use DSO (or RMI) in MPP clusters or for your computational-fluid-dynamics code -- it wouldn't perform appropriately for those kinds of applications, just like TCP connections don't, either. (But there are a lot of exciting things going on in the Java MPP space!)

    On the other hand, in the places where people are currently doing all kinds of RMI, EJB remoting, JMS queuing, and lots of ad-hoc solutions, we believe DSO is a much cleaner, easier-to-use replacement that doesn't require you to learn a new API. In this case, standard IP-based solutions are much easier to implement and manage than something like RDMA.

    That's not to say we couldn't (or wouldn't) run atop RDMA if it seemed customers' needs were moving in that direction -- it just isn't required for the problems our customers are solving right now.
  86. Durability, RDMA[ Go to top ]

    If you're talking about the Distributed Shared Objects technology, our server can write object changes to disk, too.

    It takes time to wait for data to be forced to disk. Take a look at Birman's book on process groups. In the section called "Advanced Replication Techniques" he talks about storing copies in order to get high availability AND increase throughput.
    While it wouldn't be hard to use RDMA as our communication channel, our 'sweet spot' is somewhat different from that of InfiniBand and RDMA. We're not attempting to replace MPI, AM/GM, VIA, or RDMA, and we don't typically expect to run in environments where sub-microsecond latency is required (or even desired)

    I think you ought to use infiniband. Surely someone out there is going to wish that getting a lock on an object would take a thousand times less! If the application does a lot of writes, then the difference in lock latency will drastically affect the concurrency level you can achieve.

    You should do it even for economic reasons: with all that communication your machines will spend much of their time in the kernel. With Infiniband you are not going to be doing that - as you know :) Save money on hardware!

    Guglielmo
  87. Atomicity[ Go to top ]

    When the database rolls back the transaction because of constraint violations what happens to the modifications that were made to the objects in memory? Do they get rolled back?

    G
  88. Great thread this - thanks all.

    Reminds me of the early days of J2EE when everyone was making up new usage patterns - I actually learnt something from TSS.

    Thanks again.

    Jonathan
  89. First I'd like to congratulate Ari and his team on making this milestone - the growth in the space (welcome GridGain...) just shows the potential and the need in the market for the solutions we're developing.

    As you might know, the Jini/JavaSpaces model addresses - through a different approach - most (if not all) of the features you present.

    Except it's API - and not AOP - but it can be AOP'ized...

    We at GigaSpaces actually offer a clustered JavaSpace model that can be Hub+Spoke and/or P2P - thus allowing the developer to choose which model is most applicable.This model also offers built-in notifications and various types of distributed locking that offer finely-grained control over the state of objects in the cache - a distributed cache.

    WRT the Azul protion of this thread - I just wanted to point you guys to http://www.prnewswire.com/cgi-bin/stories.pl?ACCT=109&STORY=/www/story/06-07-2005/0003821240&EDATE=.

    We just announced an alliance with Azul - for those interested.

    Gad Barnea
    GigaSpaces, Inc.
  90. Azul-Terracotta Partnership[ Go to top ]

    WRT the Azul protion of this thread - I just wanted to point you guys to ... just announced an alliance with Azul - for those interested. Gad Barnea GigaSpaces, Inc.

    Gad:

    Congratulations on the partnership with Azul. We just announced a similar partnership with Azul today: http://www.azulsystems.com/news/press_sub20.php

    Bob Griswold
    Terracotta, Inc.
  91. What is the overhead?[ Go to top ]

    We have applications that need to access very large number objects (Many GB RAM, and tens of millions). A problem we face is the limitations of VM. We love to have 10-20GB RAM (hardware is already there) but VM is not there. We randomly access large portions of RAM. I am wondering what is the CPU/Network overhead of making these objects transparently distributed under these conditions.

    I would have guessed that object level transparency would not scale when large number of objects are accessed (by many concurrent clients) in short time scales. Moving so many objects across process boundaries and proxying all calls seem to be to a too microscopes approach. Even small amounts of overhead (per object, per method) can add up to a big unacceptable overhead.

    Best Regards
  92. What is the overhead?[ Go to top ]

    We have applications that need to access very large number objects (Many GB RAM, and tens of millions). A problem we face is the limitations of VM. We love to have 10-20GB RAM (hardware is already there) but VM is not there. We randomly access large portions of RAM.

    It really depends on what the application is trying to do. One of our first data+compute grid implementations was in financial services with about 50GB of data. They found that it was better to partition the data (keeping their JVM heap sizes down around 0.5 GB I think) and move the processing to where the data was being managed in the grid, instead of trying to move the data to where the processing was occurring.

    The only thing for certain is that every application is in some way unique, and trying to fit a one-size-fits-all solution onto all apps is a recipe for failure.

    Peace,

    Cameron Purdy
    Tangosol Coherence: Clustered Shared Memory for Java
  93. They found that it was better to ... move the processing to where the data was being managed in the grid, instead of trying to move the data to where the processing was occurring.

    Presumably a grid favors portable executables, such as bytecode or scripts. That's a departure from traditional homogeneous clusters. Indeed the Global Grid Forum's reference implementation, Globus, was recast mostly in Java. See that the globus-general email list has so many pleas for help at building or installing Globus's few remaining legacy C libraries.

    With the emerging expectation of logic mobility that Cameron suggests, security is suddenly paramount. The popular grid frameworks are anal about security, despite possibly having a conflicting requirement to avoid centralized administration.
  94. Reading about how Terracotta[ Go to top ]

    Reading about how Terracotta started with their Java clustering and caching makes me realize that every company has their humble beginnings. At that time, such a technology is new and after many updates and improvements, they have almost perfected their skill. No wonder they are one of the best at their game today.

    Paul - http://www.connetu.com/