New Open Source Cache System

Discussions

News: New Open Source Cache System

  1. New Open Source Cache System (69 messages)

    The SHOP.COM Cache System is now available at http://code.google.com/p/sccache/ The SHOP.COM Cache System is an object cache system that... * is an in-process cache and external, shared Cache * is horizontally scalable * stores cached objects to disk * supports associative keys * is non-transactional * can have any size key and any size data * does auto-GC based on TTL * is container and platform neutral It was built in-house at SHOP.COM (by me) and has powered our website for years. We are open-sourcing it in the hope that it will be useful to others and to get some help in its maintenance. This is our first open source attempt and we'd appreciate any help and comments.

    Threaded Messages (69)

  2. Re: New Open Source Cache System[ Go to top ]

    It's the gadget blowout!
  3. Comparison to other caching products?[ Go to top ]

    Nice one! How does swccache compare to other caching solutions such as JCS?
  4. How does swccache compare to other caching solutions such as JCS?
    From the FAQ for JCS, they seem very similar. I'm not familiar with JCS so I can't make an informed comparison. When I get a chance, I'll test it out.
  5. >How does swccache compare to other caching solutions such as JCS?

    From the FAQ for JCS, they seem very similar. I'm not familiar with JCS so I can't make an informed comparison. When I get a chance, I'll test it out.
    I checked some more of the JCS docs. I'd say the sccache is lot simpler than JCS but with the same (or better) feature set. But, of course, performance is key in a cache system so I'd have to a head-to-head comparison to give an better comparison.
  6. Memcached[ Go to top ]

    I checked some more of the JCS docs. I'd say the sccache is lot simpler than JCS but with the same (or better) feature set. But, of course, performance is key in a cache system so I'd have to a head-to-head comparison to give an better comparison.
    How it is better than Memcached? Thanks, Vinod http://blog.vinodsingh.com/
  7. Re: Memcached[ Go to top ]

    How it is better than Memcached?

    Thanks,
    Vinod
    http://blog.vinodsingh.com/
    Memcached has a lot of limitations: * Can't iterate over cache keys * Keys are restricted to 250 characters * Stored data cannot exceed 1 megabyte in size * It's failover/availability design isn't optimum. sccache's is much better. * I don't believe that memcached writes objects to disk (let me know if I'm wrong). One benefit to Memcached is that it has been ported to lots of platforms. sccache is currently Java only. I'm working on a comparison of sccache and other libraries. It's bare-bones at the momemnt: http://code.google.com/p/sccache/wiki/Comparison
  8. Hazelcast rules when open source caching/grid computing is about. Only Coherence is better then it, but it in completely different (price) league.
  9. Hazelcast rules when open source caching/grid computing is about. Only Coherence is better then it, but it in completely different (price) league.
    I'm not familiar with Hazelcast. But we did try Coherence. The problem is transactions. A caching system should not try to be a database. I don't know if Coherence has changed since we tried it a few years ago, but it did not scale for our purposes.
  10. Re: New Open Source Cache System[ Go to top ]

    Hazelcast rules when open source caching/grid computing is about. Only Coherence is better then it, but it in completely different (price) league.


    I'm not familiar with Hazelcast. But we did try Coherence. The problem is transactions. A caching system should not try to be a database. I don't know if Coherence has changed since we tried it a few years ago, but it did not scale for our purposes.
    That statement feels a bit too general to me. In the past I've worked on simple read-only cache for a large yellowpage/whitepage directory. I've also worked on real-time applications that needed a coherent distributed cache. Not all applications need the same type of cache. Writing a read-only cache is far easier than writing a distributed cache. My bias 2 cents. peter
  11. Re: New Open Source Cache System[ Go to top ]

    Anyone knows what algorithms are used inside Coherence? I've never seen patent section on their site so assumption is they are using (combination of) prior-arts. Comparison with open source solutions would be perfect.
  12. Re: New Open Source Cache System[ Go to top ]

    From Hazelcast documentation:
    Communication among cluster members is always TCP/IP.
    Ouch.
  13. Re: New Open Source Cache System[ Go to top ]

    Why is that ouch? I thought that it was a well known fact that TCP/IP is as performant as UDP, but you don't lose packets... Can someone enlighten me?
  14. Re: New Open Source Cache System[ Go to top ]

    Why is that ouch? I thought that it was a well known fact that TCP/IP is as performant as UDP, but you don't lose packets...

    Can someone enlighten me?
    TCP/IP scales very well. sccache uses TCP/IP.
  15. Re: New Open Source Cache System[ Go to top ]

    Why is that ouch? I thought that it was a well known fact that TCP/IP is as performant as UDP, but you don't lose packets...

    Can someone enlighten me?


    TCP/IP scales very well. sccache uses TCP/IP.
    Hazelcast is really nice product and I am definitely going to use it if there is an opportunity, however, you cant scale it effectively to 100s of nodes as it says on web site. While I can not mathematically prove it to you, you can check it yourself. I would be delighted if this was true alternative to Coherence (and I could use it to make some cash =), and while reading this article on job I couldn't wait to come home and try it, however, afterwards I was little disappointed, but nonetheless it is great library. Keep up the good work.
  16. Re: New Open Source Cache System[ Go to top ]

    TCP/IP scales very well
    Yes. But I think perhaps the missing bit is "scales very well to a point". After that one will need to do something else. Where that "point" is and what the "scalability" means is application and somewhat business dependent. Let's say we have 5 JVMs in a cluster. Let's also say that naively each server holds a connection to each other server (so every member can talk directly with other members for maximum parallel data access etc). Thus for n servers, each server would need to open and maintain (n-1) connections (no need to connect to yourself), for a total of n * (n-1) connections per cluster. Even for a small cluster of 5 servers, you'd need to configure 20 TCP/IP connections (unless you use some type of discovery server - which then becomes a single point of failure / bottleneck and each member would then need to hold a connection to it, so that'd be n * n connections). For 10 servers, still a small cluster, we'd need to configure 90 TCP/IP connections. Now let's say we have a decent sized data grid, with 100 servers (not clients), we'd need to configure 9,900 TCP/IP connections. Let's add a few thousand clients connecting to the cluster, and we've now got in excess of 10,000 TCP/IP connections. That doesn't sound that great. In fact it isn't. (NOTE: The above is not how Coherence works ;) Of course one would obviously automate all of this, but it's easy to see how just managing the connectivity in a reasonable sized cluster can get out of hand very quickly. Alternatively one could opt for not maintaining connections, and simply connect to servers as an when needed or pool connections, but then we'd need to add connection latency to all of our cache requests. This latency can be very significant on some platforms (like beyond 1 second for TCP/IP). It's often significantly worse on cloud and virtualized infrastructure. So if we need to run a query across a 100 server system, without maintaining connections, we may need to wait 100 seconds (worst case) just for connections! Ok, even if it was done in parallel, one would still have to wait a few seconds for connections, which is pretty bad. Sounds unrealistic right? Actually it's not. We see this kind of latency all of the time (not with Coherence ;). That's why Coherence was designed from the ground up (around 8 years ago) to avoid this kind of connection scalability and performance problem - and use UDP uni-cast and optionally multi-cast. ASIDE: Historically it seems almost all Data Grid / Caching solutions (except Coherence) started with TCP/IP as the underlying networking infrastructure. Over the years it's clear that almost all vendors of caching products have removed their product dependencies on TCP/IP and move to a reliable UDP-based solution. That is, most solutions have had their network layer retro-fitted. From what I can tell - and I may be completely wrong here - most have done so in the last two years (or planning to do so). UDP, while not reliable out of the box, can be made reliable. This is what Coherence does and always has, but it's multi-way and not point-to-point like TCP/IP. In fact Coherence essentially implements reliable messaging over UDP uni-cast and multi-cast and can intelligently switch between them on the fly (based on demand). It can also work without multi-cast ;) It also provides packet-level bundling (configurable to the micro-second where as TCP/IP Nagel is fixed at 250ms). A really simple test for Coherence is the out-of-the-box network throughput tests (both uni and multicast). These tests will easily push a 1Gb network to it's theoretical maximum throughput... something that's difficult to achieve with TCP/IP out of the box. Personally I think the true test of TCP/IP is in it's responsiveness. ie: Can it be used to detect remote GC? (so we can, say, re-route traffic to another server). Basically no. This is because all of the packet retransmission is handled by the operating system / hardware - neatly hidden from applications. Our applications can't respond to the "micro failures" like dropped packets as it's all hidden from us. In fact with TCP/IP an application has absolutely no clue as to the availability of a remote server until the connection fails, which for TCP/IP may be minutes (depending on the operating system) This essentially means that a GC or in fact simple availability of a remote server may delay a cache operation by seconds or worse minutes! It will certainly impact the availability of other servers and an application. Don't get me wrong, TCP/IP does work ;) We use and depend on it everyday, but when pushed, it can bottleneck and exhibit poor response reasonable quickly. I guess this is why it isn't used in many high-performance or high-through-put messaging systems. I suppose the old adage applies here; use the right tool for the job. Just my 20p (probably now 10euro cents) worth. Brian Oliver --------- Global Solutions Architect | Oracle Coherence Oracle Corporation
  17. Re: New Open Source Cache System[ Go to top ]

    TCP/IP scales very well


    Let's also say that naively each server holds a connection to each other server (so every member can talk directly with other members for maximum parallel data access etc).
    The elephant in the room here is transactions. Coherence (and similar products) are quasi-databases that guarantee ACID for objects. This is not what sccache (and its ilk) does. sccache does not keep a connection to every peer in the farm. The clients have connections to each cache server, but not to the other clients. In practice, there's something like a 10-1 client to server ratio (or less). Further, server connections are re-used in the clients so threads/sockets may be low or high depending on usage. Most modern servers can handle thousands of connections/threads with no trouble at all.
  18. Re: New Open Source Cache System[ Go to top ]

    TCP/IP scales very well


    Let's also say that naively each server holds a connection to each other server (so every member can talk directly with other members for maximum parallel data access etc).

    The elephant in the room here is transactions. Coherence (and similar products) are quasi-databases that guarantee ACID for objects. This is not what sccache (and its ilk) does.

    sccache does not keep a connection to every peer in the farm. The clients have connections to each cache server, but not to the other clients. In practice, there's something like a 10-1 client to server ratio (or less). Further, server connections are re-used in the clients so threads/sockets may be low or high depending on usage. Most modern servers can handle thousands of connections/threads with no trouble at all.
    the argument feels a bit stretched in my mind. If someone needs a read only cache, then tools like sccache are fine. I've used this type of cache and it can be very effective. On the other hand, if I need a coherent distributed cache, a simple read-only cache isn't going to cut it. To say there's an elephant in the room is misleading and bias in my mind. Just because you don't need a coherent distributed cache and use read-only cache, doesn't mean there's something inherently wrong or bad. It would be more productive to give specific examples of when read-only cache is appropriate. my bias opinion. peter
  19. Re: New Open Source Cache System[ Go to top ]

    It would be more productive to give specific examples of when read-only cache is appropriate.
    Well, SHOP.COM is a specific example :) HTTP pages: Our pages are all dynamically generated from DB queries, etc. Once the HTML is generated, we cache the result keyed on the page URL. DB queries: The vast majority of data in our system never changes. So, we can cache the results of most of our queries. HTTP Sessions: Here's a case where a transactional cache would be useful, but it's not necessary. We use the cache in this case as a failover mechanism. If a server goes down the user's session can be retrieved from the cache. POJOs: There are lots of use-cases where composed POJOs are expensive to build and can be cached.
  20. Re: New Open Source Cache System[ Go to top ]

    It would be more productive to give specific examples of when read-only cache is appropriate.

    Well, SHOP.COM is a specific example :)

    HTTP pages:
    Our pages are all dynamically generated from DB queries, etc. Once the HTML is generated, we cache the result keyed on the page URL.

    DB queries:
    The vast majority of data in our system never changes. So, we can cache the results of most of our queries.

    HTTP Sessions:
    Here's a case where a transactional cache would be useful, but it's not necessary. We use the cache in this case as a failover mechanism. If a server goes down the user's session can be retrieved from the cache.

    POJOs:
    There are lots of use-cases where composed POJOs are expensive to build and can be cached.
    that sounds very similar to the cache I worked on at an old job. In my case, it was yellowpage/whitepage data, which changed every month, but not daily. We also used the URL as the cache key and we also saved the data to disk. I know other people have also used the same technique, going back to 2000. Every night, we flushed the caches and free up the disk space. In our case, the website got over 20 million unique page views a day. Using the simple caching approach, we were able to scale out quite nicely. In contrast, I wouldn't use this approach for real-time systems like order management (aka trading systems). If someone uses a coherent distributed cache like coherence, gigaspaces or terracotta when they need a read-only cache, that could be considered "user error". the only real way to tell is to look at the specific requirements. I think everyone can agree that no single caching approach will work optimal for all cases. peter
  21. In contrast, I wouldn't use this approach for real-time systems like order management (aka trading systems). If someone uses a coherent distributed cache like coherence, gigaspaces or terracotta when they need a read-only cache, that could be considered "user error"
    Peter that is a fairly accurate observation. I'd like to add the following: If we look at our specific requirement we can come up with specific set of assumptions and optimization that wouldn't apply in the more general purpose scenario. This approach assumes that we have fairly good picture of the current and future requirements of our system. In reality we know the current system but we hardly know what the future would look like. There is a good chance that in next few month a new requirement will come along which will require better indexing or support for read/write behavior. What then? Do we end up with one solution for one set of problem and another set of solution for another set of problems,... you can easily see how in long term this sort of thinking can be too short term. I wrote a post on this sort of buy/build questions where i referenced to Fred Brooks excellent article on "No Silver bullet". Quoting from the article on : "The key issue, of course, is applicability. Can I use an available off-the-shelf package to perform my task? A surprising thing has happened here. During the 1950's and 1960's, study after study showed that users would not use off-the-shelf packages for payroll, inventory control, accounts receivable, and so on. The requirements were too specialized, the case-to-case variation too high. During the 1980's, we find such packages in high demand and widespread use. What has changed? Not the packages, really. They may be somewhat more generalized and somewhat more customizable than before, but not much. Not the applications, either. If anything, the business and scientific needs of today are more diverse and complicated than those of 20 years ago. The big change has been in the hardware/software cost ratio." The full post is available here Now I'm not saying that this was a wrong decision to go with an optimized solution for this sort of specific requirements but to highlight the consequences of such a decision. Nati S natishalom.typepad.com GigaSpaces
  22. Re: New Open Source Cache System[ Go to top ]

    It would be more productive to give specific examples of when read-only cache is appropriate.

    Well, SHOP.COM is a specific example :)

    HTTP pages:
    Our pages are all dynamically generated from DB queries, etc. Once the HTML is generated, we cache the result keyed on the page URL.

    DB queries:
    The vast majority of data in our system never changes. So, we can cache the results of most of our queries.

    HTTP Sessions:
    Here's a case where a transactional cache would be useful, but it's not necessary. We use the cache in this case as a failover mechanism. If a server goes down the user's session can be retrieved from the cache.

    POJOs:
    There are lots of use-cases where composed POJOs are expensive to build and can be cached.
    Jordan I think that you clearly state those set of assumptions in your implementation. Nati S natishalom.typepad.com GigaSpaces
  23. Re: New Open Source Cache System[ Go to top ]

    Brian, that was a very good [& open] response.
  24. TCP and scaling[ Go to top ]

    Note that JGroups (www.jgroups.org) does exactly what was described before, namely to add reliability on top of UDP. Reliability includes - Retransmission of dropped packets (datagrams) - Once-and-only-once message delivery semantics: a message is delivered only once, and there are no duplicates - Fragmentation - Encryption, authentication - Ordering (FIFO and TOTAL) - Message batching/bundling - Failure detection: detect crashed nodes and remove them from the cluster view - Choice of transport: can be run over UDP or TCP Regarding scalability, I found that UDP with IP multicasting scales better for larger clusters (> 6 nodes), but... only if you tune the TCP/IP stack a bit. In my experience the biggest perf increase resulted from increasing the input buffer size (e.g. sysctl -w net.core.rmem_max=50000000). More tips&tricks are at http://www.jboss.org/community/docs/DOC-11595. Regarding the N-1 problem for TCP connections, I agree this is an issue, but modern operating system easily manager thousands of connections. The bigger issue IMO is that we have to handle those connections, and that is usually done with threads. Even with wisely chosen values for thread pool min and max sizes, this can quickly become a problem, especially the added context switching overhead. There's research on eliminating the N-1 problem, and some approaches have node connecting only to their neighboring node, ie. a logical overlay ring is placed on the cluster and in {A,B,C,D,E}, A connects only to B, which connects to C and so on. E connects back to A. Messages are then sent via the neighbor, which forwards them on your behalf and so on. This can even be combined with a token passing scheme, where a token is rotated around the ring and only the holder can send messages, increment the seqnos and so on. For the interested, this is research by Yair Amir (John Hopkins) Cheers, Bela
  25. Re: TCP and scaling[ Go to top ]

    I failed to say in my last paragraph that this eliminates the N * N-1 problem to N-1.
  26. Re: TCP and scaling[ Go to top ]

    I failed to say in my last paragraph that this eliminates the N * N-1 problem to N-1.
    Any plans to make JGroups' API "standards compliant", so it has similar API as Coherence and other implementations (distributed Set, List implementations). This is very important step in order to push cluster systems into main stream and make them commodity, starting from open source implementation.
  27. Re: TCP and scaling[ Go to top ]

    Any plans to make JGroups' API "standards compliant", so it has similar API as Coherence
    Since when is Coherence a standard ? :-) I assume you're not referring to the JSR107 (JCache) work, are you ?
    and other implementations (distributed Set, List implementations).

    This is very important step in order to push cluster systems into main stream and make them commodity, starting from open source implementation.
    Not sure if I understand: JGroups is a reliable message transport, *not* a replicated cache. Coherence and JGroups cannot be compared. There are replicated caches built on JGroups, like JBossCache, but caching itself is outside the scope of JGroups. Well, almost... I wrote a ReplicatedHashtable a long time ago, and it still ships with JGroups, but it's more of a toy application. The focus of JGroups is definitely on reliable messaging and the protocols required to do so. Hope I didn't misread your comment
  28. Re: TCP and scaling[ Go to top ]

    Since when is Coherence a standard ? :-)
    It's not, that's why I put it in quotes =), but its nice API nonetheless.
    Not sure if I understand: JGroups is a reliable message transport, *not* a replicated cache. Coherence and JGroups cannot be compared. There are replicated caches built on JGroups, like JBossCache, but caching itself is outside the scope of JGroups. Well, almost... I wrote a ReplicatedHashtable a long time ago, and it still ships with JGroups, but it's more of a toy application. The focus of JGroups is definitely on reliable messaging and the protocols required to do so.
    Well my assumption was that Coherence (at kernel level) has same feature set as JGroups (ordering, memebership...), and they have build extensions (for lets say caching functionality) with API on top of that foundation, so I was wondering if it is possible to provide similar lightweight API on top of kernel. Pardon my ignorance, learning still in progress =).
  29. Re: TCP and scaling[ Go to top ]



    Well my assumption was that Coherence (at kernel level) has same feature set as JGroups (ordering, memebership...), and they have build extensions (for lets say caching functionality) with API on top of that foundation, so I was wondering if it is possible to provide similar lightweight API on top of kernel.

    Pardon my ignorance, learning still in progress =).
    I see what you mean now. Well, in Coherence's case, their focus is on caching, whereas JGroups' focus is on messaging. There are a few caches built with JGroups used for replication/distribution, and those *could* be compared to Coherence. There once was a jcluster project with the goal of coming up with an API for 'clustering', ie. cluster membership management and communication. That was a few years ago, and the few folks on it never agreed on what the functionality should be...
  30. Re: TCP and scaling[ Go to top ]

    I don't know why the discussion was dragged to this level of comparing TCP to UDP again. What i wanted to to add few points to balance the picture: 1. UDP has advantage in the way connection are management however once you add reliability to the picture things are not as they seem: - In replicated cluster members will need to assume that not all nodes are equal and fill in the gaps for individual nodes as they get out of sync. As those gaps are expected to be sent to individual nodes TCP is normal chosen for that purpose. This leads to the fact that when you add reliability to UDP each member may still need to maintain individual TCP connections with other members. Client connection can be easily load-balanced between the replication nodes so each server will need to manage only a small group of the entire set of clients. - In partitioned environment on the server side each partition needs not to see the entire cluster - so TCP is not really an issue on that part. The only issue is with large number of clients that wants to access the entire cluster members. In reality however not all client are equal either. In many large cluster deployment each client agent had to access a particular partition. Only some required access to the entire cluster. So in most real life scenario this limitation is fairly manageable in partitioned scenario. - UDP vs TCP for large data packets UDP packet are normally limited by size which lead to a need to deal with packet fragmentation on the software side. This makes UDP based transport fairly inefficient when it comes to large data-sets. In many cases you would be required to change network configuration to change the limit on the packet size. - UDP vs TCP when things get wrong When things get wrongs (a group of cluster member get out of synch at the same time). UDP can easily lead to a "perfect storm". - UDP vs TCP on the cloud Most cloud environment has restriction to the use of UDP as UDP tend to break the isolation level between VM's - If your solution relies entirely on UDP that means that your cluster wouldn't work. I've came across scenarios where organisations had policies for not enabling UDP based cluster or at least requires specific permission to run any software that utilizes UDP. Bottom line: 1. Most data clusters tend to use a combination of TCP and UDP (As far as i know this applies to Coherence as well) 2. UDP is good for discovery but not for reliable data dissemination. 3. UDP should be optional and not mandatory as in some cases it is not even an option. Having a TCP based cluster should be good enough in majority of cases, certainly for the scenarios mentioned in this thread. There would be very rare cases in which UDP will have an advantage. Again my point was not to dismiss any of the arguments previously stated but to balance the picture a bit. We at GigaSpaces use both TCP and UDP. In most of the cases we ended up using mostly TCP for data replication and UDP for discovery. M2c Nati S - natishalom.typepad.com GigaSpaces
  31. Re: TCP and scaling[ Go to top ]

    I don't know why the discussion was dragged to this level of comparing TCP to UDP again.
    Probably because folks are interested ? :-)
    1. This leads to the fact that when you add reliability to UDP each member may still need to maintain individual TCP connections with other members
    Not if you use UDP as transport, then TCP is completely out of the picture
    UDP vs TCP for large data packets
    UDP packet are normally limited by size which lead to a need to deal with packet fragmentation on the software side. This makes UDP based transport fairly inefficient when it comes to large data-sets.
    I don't agree, can you backup your assertion with data ? In my tests, I haven't seen this. To send 100MB messages across the wire, UDP is about the same as TCP. Are you talking about messages larger than that ?
    In many cases you would be required to change network configuration to change the limit on the packet size.
    Are you referring to UDP datagrams ? You *cannot* change the packet size limit, it is always 2 bytes == 1 short == ~65 KB. Did you refer to the max IP receive buffer size (net.core.rmem_max) ?


    - UDP vs TCP when things get wrong
    When things get wrongs (a group of cluster member get out of synch at the same time). UDP can easily lead to a "perfect storm".
    Not if you use (exponential) backoff for retransmissions, like in JGroups. Assuming you're referring to broadcast storms, those can be prevented easily. Also, a lot of broadcast storms in the past were caused by misconfigured networks, e.g. networks with 'cycles'.
    UDP vs TCP on the cloud
    Most cloud environment has restriction to the use of UDP as UDP tend to break the isolation level between VM's
    I have no idea what you're talking about. Can you elaborate ?
    I've came across scenarios where organisations had policies for not enabling UDP based cluster or at least requires specific permission to run any software that utilizes UDP
    Correct, that's my experience too. In those cases, clustering software has to fall back to TCP/NIO.
    UDP is good for discovery but not for reliable data dissemination
    I disagree completely
    Where TCP does have an advantage (flow control), UDP these days has almost no issues because modern switches provide very good isolation of collision domains, ie. very few dropped packets *IF* you configure them correctly. E.g. some of the pricier managed switches have drop setups which drop broadcasts, multicasts, TCP, HTTP, RTP/SIP, in this order. Those can be configured correctly to provide excellent performance over UDP.
    The disadvantage of not having flow control as part of the stack (in TCP) is made up for with excellent scaling of multicasts: instead of sending the same message N-1 times, you simply send it once. In my perf measurements, UDP passes TCP in terms of performance with growing cluster size.
    In my experience with JGroups over the last 10 years, UDP is so good performance wise that we use it as the default transport in JBoss Clustering.
    Note that you can also switch to using TCP for delivery and UDP for discovery, as you mentioned, and this is done by simply using a different stack within an XML file, but the only reason to do so is if a customer cannot use UDP, e.g. because it is disabled in their network.
  32. Re: TCP and scaling[ Go to top ]


    I've came across scenarios where organisations had policies for not enabling UDP based cluster or at least requires specific permission to run any software that utilizes UDP


    Correct, that's my experience too. In those cases, clustering software has to fall back to TCP/NIO.

    We dont allow cluster communication at all on normal server subnets, except pure heartbeat communication for clusters spanning 2+ sites. Anything else, RAC:s, host based mirroring, etc, goes on private subnets. It shouldnt really be about TCP versus UDP (although its certainly easier to get it wrong in a way that hearts network with UDP), but clusters tend to pretty chatty.
  33. Re: TCP and scaling[ Go to top ]

    Let me clarify my points: There are more interesting question to be asked here that were missing from the discussion such as: 1. What led a company such as Shops.com come up with their own caching implementation when there are already various pretty mature implementation out there. 2. What should be expected from this project i.e. who is going to maintain it and what's the future plan for this project? One of the argument from Rubys guys is that the problem with Java is the amount of frameworks and option for doing similar things makes Java programming complex. So do we realy benefit as community from having another caching project or would be it better if instead the effort would be directed to improve one of the existing framework? The discussion on UDP is a complete disruption to this specific thread. Having said that I'd like to clarify some of my point WRT to UDP: You can build fairly scalable implementation using TCP. The world wide web is the best example for that. The cases in which UDP would provide significant advantage under caching context are relatively rare and would become even more anecdotal as the trend to offload most of the networking to the network card and network switches is growing. I expect that solution that relies on putting too much networking logic on the application stack will have issue to benefiting from this changes. Now again i have nothing in favor or against TCP or UDP all i wanted to say is that the choice of using TCP for Shops.com cache implementation is fairly reasonable. The complexity involved in supporting reliable UDP doesn't make sense for all scenarios. I would also like to point to the following references that clarifies my previous points: UDP Buffer Sizing Another article worth looking at is: Characteristics of UDP Packet Loss: Effect of TCP Traffic
    UDP vs TCP on the cloud Most cloud environment has restriction to the use of UDP as UDP tend to break the isolation level between VM's
    I have no idea what you're talking about. Can you elaborate ?
    See Clustering JBoss in the Amazon EC2 Cloud: The problem with deploying clusters in EC2 is that multicasting is not enabled in the EC2 network. This means that JGroups will not work by default in the Cloud, which in turns means that JBoss Clustering will not work either HTH Nati S.
  34. Re: TCP and scaling[ Go to top ]

    Two interesting quotes from same article :
    This essentially boils down to "TCP is OK for latency-sensitive applications if nothing ever goes wrong."
    Said differently, TCP adds latency whenever necessary to maintain equal bandwidth sharing and network stability. Latency-sensitive applications typically don't want a transport protocol deciding when they can send.
  35. Re: TCP and scaling[ Go to top ]

    Two interesting quotes from same article :

    This essentially boils down to "TCP is OK for latency-sensitive applications if nothing ever goes wrong."


    Said differently, TCP adds latency whenever necessary to maintain equal bandwidth sharing and network stability. Latency-sensitive applications typically don't want a transport protocol deciding when they can send.
    And I almost forgot quote from same article:
    It's probably best to avoid using TCP for latency-sensitive applications.
  36. Re: TCP and scaling[ Go to top ]

    Uuuuu, this is getting even better =). (Pasting here for those who bother to read entire article)
    TCP cuts its transmission rate by 1/2 every time it detects loss. This can add additional latency after the original loss. Even small amounts of loss can drastically cut TCP throughput.
  37. Re: TCP and scaling[ Go to top ]

    Uuuuu, this is getting even better =). (Pasting here for those who bother to read entire article)

    TCP cuts its transmission rate by 1/2 every time it detects loss. This can add additional latency after the original loss. Even small amounts of loss can drastically cut TCP throughput.
    Note that this is the typical Reno based impl of TCP (IIRC). You can configure TCP to not do exponential backoff, but (e.g.) linear backoff.
    But the issue of overwhelming the receiver is the same for UDP: you cannot (continually) send at a faster rate than the receiver(s) can process your data, or else the receives have to either buffer your data (and run out of memory) or drop messages (which leads to costly retransmission).
    In such a case, the senders have to get throttled to adjust their send rates to the receivers' processing rates. This is what JGroups does for example. And it is not unlike TCP's sliding window, JGroups also uses a credit based system.
    Now having said that, for latency sensitive applications, e.g. video / audio streams like RTP, it is better to drop packets than to retransmit them. So here a flow control based mechanism would be bad. The art here is to buffer just enough data on the receiver and - if the receiver starts getting overwhelmed - to drop just the 'right' packets. It is for example better to drop fewer packets more frequently that to drop nothing and then to drop everything for a few seconds. This is not unlike garbage collection...
  38. Re: TCP and scaling[ Go to top ]

    You can build fairly scalable implementation using TCP. The world wide web is the best example for that.
    We're not talking about the WWW here. Our target environments are local area networks; all cluster nodes are within the same LAN.
    Yes, JGroups can span wide area networks, but my points really apply to LANs. WANs are a totally different beast.
    The cases in which UDP would provide significant advantage under caching context are relatively rare and would become even more anecdotal as the trend to offload most of the networking to the network card and network switches is growing.
    Again, you have much better scalability in UDP based clusters of significant size. One customer for example runs a flat cluster with 400+ nodes, and TCP just doesn't scale here. Note that there's a different discussion as to whether it makes sense to have flat clusters of this size, for example with buddy replication or partitioning you can avoid having to communicate with the entire cluster...
    I would also like to point to the following references that clarifies my previous points:

    UDP Buffer Sizing
    Ah, I see: you mean sizing of receive buffers
    Another article worth looking at is:
    Characteristics of UDP Packet Loss: Effect of TCP Traffic

    This article does not apply to us at all: if a customer uses clustering, we always recommend to have a separate network for cluster traffic, to separate cluster traffic from e.g. client facing traffic. Of course if there is not much traffic anyway, this is not needed.
    So we would never see our clustering traffic (e.g. if we used UDP) getting mixed with TCP traffic.
    I agree with the article, but it doesn't apply to us.
    The problem with deploying clusters in EC2 is that multicasting is not enabled in the EC2 network. This means that JGroups will not work by default in the Cloud, which in turns means that JBoss Clustering will not work either

    HTH
    Nati S.
    Yes - Amazon don't enable multicasting in EC2. This is what I mentioned in this thread before, and you can simply take a different XML config to fall back to TCP in this case.
  39. Re: TCP and scaling[ Go to top ]

    Nati -
    I don't know why the discussion was dragged to this level of comparing TCP to UDP again.
    Probably because it's an interesting topic .. especially since most people don't get to spend a lot of time working at such a low level, which makes it even more interesting.
    UDP has advantage in the way connection are management ..
    True. UDP is a "connectionless" protocol. Basically, any port on any IP address can send a unicast UDP packet to any port on any IP address. That means you can use one unicast UDP socket to talk to as many other servers as you'd like. (UDP multicast is also "connectionless".)
    In replicated cluster members will need to assume that not all nodes are equal and fill in the gaps for individual nodes as they get out of sync. As those gaps are expected to be sent to individual nodes TCP is normal chosen for that purpose.
    TCP is certainly the simplest solution.
    Client connection can be easily load-balanced between the replication nodes so each server will need to manage only a small group of the entire set of clients.
    Outside of specific latency-sensitive applications (e.g. telephony, media, other types of streaming such as time-sensitive feeds), client/server connections should use TCP. It's much simpler to build and maintain, relatively robust for that purpose, and easier to manage / load-balance / monitor / etc.
    In partitioned environment on the server side each partition needs not to see the entire cluster - so TCP is not really an issue on that part. [..] Most data clusters tend to use a combination of TCP and UDP (As far as i know this applies to Coherence as well)
    Coherence does not use TCP for data communication within the cluster. However, Coherence has a much different architecture than your product, so I understand why you would use TCP and it's probably the best choice in your case. When Coherence partitions information, it does so dynamically, and it not only partitions the primary copies of information, but by default it further partitions the backup copies (and also the backup copies of those backup copies and so on), thus ensuring that the system grows even more resilient to node failures as the size of the system increases.
    The only issue is with large number of clients that wants to access the entire cluster members. In reality however not all client are equal either. In many large cluster deployment each client agent had to access a particular partition. Only some required access to the entire cluster. So in most real life scenario this limitation is fairly manageable in partitioned scenario.
    Again, with clients "inside the cluster", the Coherence clustering protocol provides direct (via UDP) access to all the servers. In a client/server mode (outside the cluster, e.g. desktop applications or WAN-remote sites), the communication is done via TCP/IP. In that case, clients still have access to the entire cluster, but it is no longer direct.
    UDP packet are normally limited by size which lead to a need to deal with packet fragmentation on the software side. This makes UDP based transport fairly inefficient when it comes to large data-sets.
    TCP obviously has to do the same thing, since TCP and UDP both put data into the same underlying packet structure. With TCP though, it's done for you i.e. it's easier. Also, sometimes TCP can offload these tasks to hardware (called a "TOE").
    When things get wrongs (a group of cluster member get out of synch at the same time). UDP can easily lead to a "perfect storm".
    Any unregulated network use can do that. TCP has built-in flow control algorithms, so again it's easier. With UDP, you have to build your own.
    Most cloud environment has restriction to the use of UDP as UDP tend to break the isolation level between VM's - If your solution relies entirely on UDP that means that your cluster wouldn't work.
    This is incorrect, since Coherence runs very well on EC2 for example; I've recently seen Coherence configurations on EC2 running 3000 nodes ;-). I think what you mean is that most cloud environments restrict multicast UDP, which is just one form of UDP traffic.
    I've came across scenarios where organisations had policies for not enabling UDP based cluster or at least requires specific permission to run any software that utilizes UDP.
    Again, I believe that you mean multicast UDP.
    UDP is good for discovery but not for reliable data dissemination.
    Multicast ;-)
    UDP should be optional and not mandatory as in some cases it is not even an option.
    Multicast ;-) Peace, Cameron Purdy Oracle Coherence: Data Grid for Java, .NET and C++
  40. Re: TCP and scaling[ Go to top ]

    Cameron Your emphasis on Multicast is correct. I should have mention it more explicitly. I assumed that it is obvious that most of the reference to the benefit of UDP under the context of this discussion was related to UDP Multicast it is true that UDP datagram can be used for Unicast as well. Thanks for the clarification. BTW its quite refreshing to see that we can finally agree on something :) Nati S - natishalom.typepad.com GigaSpaces
  41. Re: TCP and scaling[ Go to top ]

    Hi Nati -
    BTW its quite refreshing to see that we can finally agree on something :)
    We agree on lots of stuff .. but the interesting parts are always going to be the parts that we don't agree on ;-) Peace, Cameron Purdy Oracle Coherence: Data Grid for Java, .NET and C++
  42. Re: TCP and scaling[ Go to top ]

    I guess, it is worth dragging the TCP vs UDP discussion ;-) It is interesting what is achievable in theory in a well-oiled lab environment and comparing this to observed behavior in practice. Also, it gets quite interesting when reliability is not just about delivering a message but also about how it gets applied in the context of a cache - paying attention to data consistency. So, here is more opinion (based on my experiences with some very large deployments with GemStone's software) What is good with TCP? - The best part of TCP is that it is well tuned for flow control. It acts as its own traffic cop and most networks (routers, switches) deal with this traffic well when faced with congestion issues. UDP, obviously, can be unfair, especially when flow control is not implemented properly. I see that some network administrators simply don't trust your software if it is based on UDP - and, for good reasons; it makes their life more difficult. The network in most fortune 1000 companies is a mix of new networking hardware and quite a bit is old - and, in practice, routers are careless with UDP; it is first thing that is dropped when a router runs low on memory. - When making UDP work in a caching environment, it aint enough to get the message delivered reliably - often, you have to make sure the message has been processed by the receiver requiring a explicit application level ACK back to the sender(The cache say makes a "write through" callback and you have to make sure the cache update is synchronously written to the database). So, you have to build all this machinery on top - retransmission buffers, congestion detection, flow control, ordering, switch/router configuration assuming it is possible, TTL for multicast, etc, etc. Meanwhile, most of this is handled by TCP quite well and done by experts who have tuned this at the driver level. Why implement all this in Java when it can be done more efficiently in the kernel layer. In fact, with Nagle OFF, we see little unnecessary buffering, little context switching and often much better relative throughput than UDP. - And, finally there are so many options with TCP; Like mentioned, you can use offloading engines, switch to something like Infiniband over SDP, etc much more easily. I am not implying UDP has no advantages over TCP, but, in practice (have I used this word enough?) it gets lot more difficult even when you have a good implementation. Where we find immediate value for UDP is in failure detection. In our implementation, we use a combination of a neighborhood ping protocol and UDP for fast, predictable failure detection and cluster view maintainence. The GemFire product allows the developer to choose the protocol to use for each data type - so, if I am distributing data (say local orders) within say my subnet or within a blade center you can use reliable UDP multicast with unicast ACKs (once you know how your network works) or switch to TCP when the data is distributed to a much larger member count (say product catalog) that has potentially many (likely unknown) number of hops through routers/switches. Cheers! Jags Ramnarayan GemFire - Enterprise Data Fabric
  43. Re: TCP and scaling[ Go to top ]

    Jags -
    So, you have to build all this machinery on top - retransmission buffers, congestion detection, flow control, ordering, switch/router configuration assuming it is possible, TTL for multicast, etc, etc. Meanwhile, most of this is handled by TCP quite well and done by experts who have tuned this at the driver level. Why implement all this in Java when it can be done more efficiently in the kernel layer. [..] GemFire product allows the developer to choose the protocol to use for each data type ..
    Gemfire uses jGroups for network communication, correct? Peace, Cameron Purdy Oracle Coherence: Data Grid for Java, .NET and C++
  44. Re: TCP and scaling[ Go to top ]


    Gemfire uses jGroups for network communication, correct
    GemFire incorporates a native TCP based communication layer with multiple failure detection protocols. We also leverage the excellent work done by Bela Ban and others in Jgroups for UDP Multicast.
  45. Re: New Open Source Cache System[ Go to top ]

    Why is that ouch? I thought that it was a well known fact that TCP/IP is as performant as UDP, but you don't lose packets...

    Can someone enlighten me?
    Synchronous systems can not scale as good as asynchronous. Each TCP packet uses additional low level control packets while UDP does not. Maybe in small scale performance is comparable, however, according to research papers I have been reading, asynch wins. Imagine cluster with 200 nodes and in worst case scenario all nodes are connected with each other with TCP, and imagine how much resources are going to be spent on TCP low level handshakes and stuff.
  46. Re: New Open Source Cache System[ Go to top ]

    Maybe in small scale performance is comparable, however, according to research papers I have been reading, asynch wins. Imagine cluster with 200 nodes and in worst case scenario all nodes are connected with each other with TCP, and imagine how much resources are going to be spent on TCP low level handshakes and stuff.
    But there's a huge catch - UDP is allowed to drop packets. I imagine you could build a network where you could guarantee that no packet is dropped but I don't know enough about that. In any event, you'd then have a hardware/software solution. The take-away is that the "handshakes" are unavoidable. With TCP they occur in an extremely optimized way. With UDP you have to replicate this in software. P.S. The main 3-way handshake that TCP/IP does only happens on connections. Any reasonable solution will persist TCP/IP connections.
  47. Re: New Open Source Cache System[ Go to top ]

    But there's a huge catch - UDP is allowed to drop packets.
    Imagine you have cluster of 200 nodes in Hazelcast, and you want to send one message to all members. If you use master sequencer, that 200 TCP connections open all the time. If you send all messages from let's say master, that's either 200 threads running concurrently, or you perform sends (partially) sequentially, both of which are good but just not optimal. If you use multicast, you achieve same result in better way.
  48. Re: New Open Source Cache System[ Go to top ]

    With UDP you have to replicate this in software.
    Exactly, and that's why Coherence is so good.
  49. Re: New Open Source Cache System[ Go to top ]

    With UDP you have to replicate this in software.


    Exactly, and that's why Coherence is so good.
    Coherence is transactional so I can see why this is needed. sccache is not. UDP wouldn't be of much use. I tried it a few years back and performance was not better but that code was much more complicated. As an aside, there was a very interesting talk at last year's SD West on NIO vs thread-per-socket. I tried the NIO libraries and found them to be very poor (for Sockets). The speaker made the same point and explained that with NIO you, in effect, have to implement your own thread scheduler.
  50. Re: New Open Source Cache System[ Go to top ]

    Transactions are not relevant to this talk. It is all about what event ordering you want to provide. Async > sync.
  51. Re: New Open Source Cache System[ Go to top ]

    Transactions are not relevant to this talk. It is all about what event ordering you want to provide.

    Async > sync.
    Call it whatever you want, but if a library is trying to maintain any type of atomicity of data it will perform much differently that a library that doesn't.
  52. Re: New Open Source Cache System[ Go to top ]

    Why is that ouch? I thought that it was a well known fact that TCP/IP is as performant as UDP, but you don't lose packets...

    Can someone enlighten me?
    TCP/IP and UDP/IP both use the same underlying IP packet structure. UDP adds four bytes of header information (IIRC 2-bytes each for to- and from-port), while TCP adds a whole bunch of header information related to the various QoS that it provides. With both UDP/IP and TCP/IP, packets are lost. The difference is the response to lost packets: * With UDP, there is no response to lost packets. In fact there's no hint that a packet was even lost, so you'd have to write your own "reliability layer" on top of UDP to do ACKs and/or NACKs to keep track of what's getting through and what's getting lost. * With TCP, the two end-points communicate lots of information about what packets have been received and things like that, so that lost packets can be re-transmitted. However, when packets get lost, the data rate gets scaled back to prevent more packets from being lost; this is called "flow control". TCP flow control dramatically reduces the throughput of heavily loaded TCP-based systems, which is one of the reasons why it's not used for systems that have to get data through in a deterministic period of time. In general, TCP/IP is orders of magnitude simpler to use, since it is a reliable delivery implementation with flow control built in, and it makes connecting remotely and "streaming" of arbitrarily-sized information very easy. In general, UDP/IP gives you much finer control of the network, but requires that _you_ do all the work. Implemented correctly, you can achieve higher throughput, lower latency, lower resource utilization, reliability, or some specific combination of those. There aren't too many cases in which you would use UDP, unless you're building systems software. Nowadays, even TCP is considered systems level, since you can just use HTTP (which is implemented over TCP/IP) as a simple request/response messaging system. Peace, Cameron Purdy Oracle Coherence: Data Grid for Java, .NET and C++
  53. Re: New Open Source Cache System[ Go to top ]

    If multicast is not prefered way of discovery for your environment, then you can configure Hazelcast for full TCP/IP cluster.
  54. Re: New Open Source Cache System[ Go to top ]

    Anyone knows what algorithms are used inside Coherence? I've never seen patent section on their site so assumption is they are using (combination of) prior-arts.
    The fundamental Coherence algorithms are trade secrets, and are not based on prior art. Peace, Cameron Purdy Oracle Coherence: Data Grid for Java, .NET and C++
  55. Re: New Open Source Cache System[ Go to top ]

    Hazelcast rules when open source caching/grid computing is about. Only Coherence is better then it, but it in completely different (price) league.


    I'm not familiar with Hazelcast. But we did try Coherence. The problem is transactions. A caching system should not try to be a database. I don't know if Coherence has changed since we tried it a few years ago, but it did not scale for our purposes.
    I'm surprised to hear that, since a single Coherence cluster can easily scale to hundreds of servers, supporting thousands of application servers and millions of concurrent users. When I checked back through my email, I only found one conversation (23 December 2003) that you had with us, when you first became aware that Coherence existed and could potentially address your needs; you indicated at the time that you had already written your own solution, something akin to our "near cache" topology (local caching, backed up by a larger distributed cache, with entry-level invalidation). Nonetheless, kudos on building a successful distributed cache implementation; it's obviously a lot of fun. Peace, Cameron Purdy Oracle Coherence: Data Grid for Java, .NET and C++
  56. Re: New Open Source Cache System[ Go to top ]

    When I checked back through my email, I only found one conversation (23 December 2003) that you had with us
    We tested it as well. I want to be clear that Coherence is an incredible product.
  57. Does it implement org.hibernate.cache.CacheProvider ? In other words can it be used with Hibernate
  58. Does it implement org.hibernate.cache.CacheProvider ?
    In other words can it be used with Hibernate
    Not currently. One of my hopes with open sourcing it is that others can add things such as this. If there is enough requests and I have time, I can do it.
  59. TCP vs. UDP/IP-Multicast[ Go to top ]

    from our experience... I would add that depending on QoS settings on your router/switch the UDP traffic can be configured to drop first when the overload is detected and TCP will get priority. To my knowledge this is the default config for most of the switches which leads to a massive packet loss under the load for UDP/IP-multicast communication. I think that with proper switch/router configuration (programming) the difference between UDP and TCP can be somewhat equalized under the load. It certainly differs from router to router and obviously possible only with high-end equipment. As Cemeron mentioned, however, UDP gives fine grain control over ACK/NACK management when resolving packet loss. Nikita Ivanov. GridGain - Open Cloud Platform
  60. I've been thinking about building a Coherence-like partitioned cache on top of JGroups, and I wasn't aware of Hazelcast until I read this thread. So I took a look at the Hazelcast source code for their HashMap and if my first impressions are correct, there are a couple of things about it that I don't like. - There seems to be no concept of a storage node and client node in the cluster. All (perhaps with some exceptions -- see the next point) nodes in the cluster gets a part of the data. I would like to have some stable storage nodes that can run for long periods uninterrupted, while clients can join the cluster and leave again without causing a lot of data migration. - It looks like each hashmap is divided into 10 blocks, and these 10 blocks are distributed among the nodes in the cluster. So if my cluster is larger than 10 nodes, the map only get partitioned across ten of the nodes(?) Or maybe 20 -- each block has a backup copy, so with 20 nodes I suppose ten of them could have one primary block each and the other 10 could have a backup block each. But if my cluster is 100 JVMs, only a minority of them would be used. I guess a remedy for this could be to change the hard-coded constant from 10 to 100 and re-compile. - The way the entries get partitioned across the 10 blocks is essentially: block = Math.abs(key.hashCode()) % 10; There seems to be no dynamic re-balancing of the load if there is data skew. Does anyone have enough experience with Hazelcast to verify or refute these observations?
  61. Re: New Open Source Cache System[ Go to top ]

    I've been thinking about building a Coherence-like partitioned cache on top of JGroups
    Cool ! Maybe start with this (http://www.jgroups.org/javagroupsnew/docs/memcached.html). It is a partitioned cache, but has no redundancy built in for failover.
  62. Memcache comparison[ Go to top ]

    It would be nice to produce a performance comparison between sccache and memcache. (It is clear that sccache has more feature, but most of them is not necessary for most of the web applications.) One more question which was not clear from the wiki: are you using a consistent caching algorithm for deciding which object you put on which server? (Which would minify the number of objects should recache when a server dies.)
  63. Re: Memcache comparison[ Go to top ]

    It would be nice to produce a performance comparison between sccache and memcache. (It is clear that sccache has more feature, but most of them is not necessary for most of the web applications.)

    One more question which was not clear from the wiki: are you using a consistent caching algorithm for deciding which object you put on which server? (Which would minify the number of objects should recache when a server dies.)
    Hey! A question about sccache! Excellent. By default, it uses the hash of the key. But, the hashing algorithm is abstracted into a pluggable class. So, you can substitute any algorithm you like.
  64. Re: Memcache comparison[ Go to top ]

    It is clear that sccache has more feature, but most of them is not necessary for most of the web applications.
    I think some of the feature differences are very important. sccache supports keys and data of any size, you can iterate over the keys and it automatically deletes stale objects.
  65. FYI - new version released[ Go to top ]

    FYI - there was a bug in the storage code. It's now fixed. Issue: http://code.google.com/p/sccache/issues/detail?id=1&can=1 Fix: http://sccache.googlecode.com/files/sccache_0_4.zip
  66. FYI - a bug fix patch has been released
  67. FYI - one more bug fix patch is available
  68. A new patch is available: http://code.google.com/p/sccache/
  69. I am newer to this open source. At a glance, seems it can not flush other local caches if I update a elements existing in multiple JVMs.
  70. Yes - it does flush cached objects in other VMs. It's not transactional, but the other VM local caches will periodically check to see if a newer version of the object exists. BTW - I just posted a new patch.