Cameron Purdy: Defining a Data Grid

Discussions

News: Cameron Purdy: Defining a Data Grid

  1. Cameron Purdy: Defining a Data Grid (18 messages)

    Cameron Purdy has posted "Defining a Data Grid" on his blog, saying that "most people know what a database is, but not everyone knows what an In-Memory Data Grid is, so we set out to define it in a concise manner..." Of course, this focuses on Oracle Coherence since Cameron's a VP for Oracle on that product, but it's worth looking into even if you're not using Coherence, for a few reasons. First, maybe you should be using Coherence! Second, it might show you some concepts that your current data caching solution should provide. Thirdly, perhaps it illustrates some uses that your architecture might not be taking advantage of. The first few definitions:
      A Data Grid is a system composed of multiple servers that work together to manage information and related operations – such as computations – in a distributed environment.
    • An In-Memory Data Grid is a Data Grid that stores the information in memory in order to achieve very high performance, and uses redundancy – by keeping copies of that information synchronized across multiple servers – in order to ensure the resiliency of the system and the availability of the data in the event of server failure.
    • The application objects are the actual components of the application that contain information that is shared across multiple servers and that must survive server failure in order for the application to be continuously available. These objects are typically built in an object-oriented language such as Java (e.g. POJOs), C++, C#, VB.NET or Ruby. Unlike a relational schema, application objects are often hierarchical in nature, and may contain information that is pulled from many database tables.
    • The application objects must be shared across multiple servers because a middleware application (such as eBay and Amazon.com) is horizontally scaled by adding servers, with each server running an instance of that application. Since the application instance running on one server may read and write some of the same information that an application instance running on another server reads and writes, that information must be shared. The alternative is to always access that information from a shared resource, such as a database, which will lower performance by requiring both remote coordinated access and Object/Relational Mapping (ORM), and decrease scalability by making that shared resource a bottleneck.
    • Because an application object is not relational, in order to retrieve it from a relational database the information must be mapped from a relational query into the object; this is known as Object/Relational Mapping (ORM). Examples include Java’s EJB 3.0 and JPA, and ADO.NET. The same technology allows that object to be stored in a relational database by deconstructing the object (or changes to the object) into a series of SQL inserts, updates and deletes. Since a single object may be composed of information from many tables, the cost of accessing objects from a database using Object/Relational Mapping can be significant, both in terms of the load on the database and the latency of the data access.
    • An In-Memory Data Grid achieves low response times for data access by keeping the information in-memory and in the application object form, and by sharing that information across multiple servers. In other words, applications may be able to access the information that they require without any network communication and without any data transformation step such as ORM. In cases where network communication is required, the Oracle Coherence avoids introducing a Single Point of Bottleneck (SPOB) by partitioning – spreading out – information across the grid, with each server being responsible for managing its own fair share of the total set of information.
    Of course, there's more, but the goal isn't to replicate the blog post in toto. What do you think of Cameron's definition? Is it too focused on Coherence to be useful elsewhere?

    Threaded Messages (18)

  2. I think it's a good job of describing most products including ObjectGrid. The vendors need to start educating the market using common concepts etc. Nati and I were talking about this at the Spring conference. A standard set of terms would be very useful and benefit everyone. The best definition of data grids for me is still the hstore paper which describes architecturally what kind of application datagrids can support well. http://www.devwebsphere.com/devwebsphere/2007/12/hstore-at-mit-v.html
  3. Nice post by Cameron and I agree with Billy. All of the data grid products -- ObjectGrid, Coherence, GigaSpaces and others -- would gain from broader awareness to and understanding of the notion of in-memory data grids. It's remarkable how a very rudimentary solution such as memcached has caught on so fast because it had some high-profile web sites using it. But there really are much better solutions out there, such as the products I mention above. Billy -- in the "old days" we may have talked about creating a standard, but I think you are right that given things such as Spring, what's really important now is common terminology and just getting the word out. Data Grid Alliance, anyone? :-) Geva Perry GigaSpaces
  4. I'd like to hear what Java-based data grids have over memcached. What (common) features does Coherence/Gigaspaces/etc have that memcached does not? Thanks, Steve
  5. :-) /me sits down comfortably to watch what will happen inevitably...
  6. I'd like to hear what Java-based data grids have over memcached. What (common) features does Coherence/Gigaspaces/etc have that memcached does not?
    After looking at memcached for several days (admittedly only Java API's provided) I can say that one of the most distinguishing factors would be Documentation and Flexibility of Deployment and Caching Protocols. But lack of documentation in memchached is really striking. Best, Dmitriy Setrakyan GridGain - Grid Computing Made Simple
  7. After looking at memcached for several days (admittedly only Java API's provided) I can say that one of the most distinguishing factors would be Documentation and Flexibility of Deployment and Caching Protocols.

    But lack of documentation in memchached is really striking.
    We've had some major sites (MySpace, Ning, ..) switching from Memcached to Coherence recently, and the reasons given always seem to be the same: quality, performance, scalability, dynamic partitioning, ease of configuration, and information reliability. Peace, Cameron Purdy Oracle Coherence: The Java Data Grid
  8. A Data Grid is a system composed of multiple servers that work together to manage information and related operations – such as computations – in a distributed environment.
    I don’t think that data grids manage anything beyond data which is exactly what they supposed to do. In the essence, data grids are distributed caches for RDBMSs. That’s what they evolved from (Coherence, GigaSpaces). Computations are managed by a different type of grids – compute grids (which have little to do with data grids). As far as overall definition I do like it too. If anything, I think it’s a bit too generic and certainly doesn’t have any bias towards Coherence (which I wouldn’t mind). My 2 cents, Nikita Ivanov. GridGain - Grid Computing Made Simple

  9. A Data Grid is a system composed of multiple servers that work together to manage information and related operations – such as computations – in a distributed environment.

    I don’t think that data grids manage anything beyond data which is exactly what they supposed to do. In the essence, data grids are distributed caches for RDBMSs. That’s what they evolved from (Coherence, GigaSpaces). Computations are managed by a different type of grids – compute grids (which have little to do with data grids).

    As far as overall definition I do like it too. If anything, I think it’s a bit too generic and certainly doesn’t have any bias towards Coherence (which I wouldn’t mind).

    My 2 cents,
    Nikita Ivanov.
    GridGain - Grid Computing Made Simple
    Mmm, I wonder why you make this statement.... Let me venture a wild guess, to distinguish GridGain, which is just a compute grid from GigaSpaces, Coherence, and ObjectGrid. It is ridiculous to assume that GigaSpaces and Coherence succeed so much at what they do by just being a DataGrid. GigaSpaces grew out of the JavaSpaces model with one of its prime design patterns being Master Worker, which is all about compute grid. It is very simple to build map reduce on top of the Master Worker design pattern and it inherently supports advance compute grid concepts such as work stealing. Now, the fun part (which is when we watch our application scale) is the combination of compute and data grid allowing, for example, to collocate work and data. This is similar to what Hadoop does in its implementation for Map Reduce (though through the narrow view of just Map Reduce and HDFS) and it is the bread and butter for GigaSpaces (as well as Coherence). Cheers, Shay Banon Compass and GigaSpaces
  10. I think Nikita has a point to a degree. We're seeing customers using ObjectGrid to hold data for compute grid stuff built with grid products like data synapse, xd compute grid or home grown stuff. They pull the data from the data grid because the data grid makes the data fault tolerance, network attachable and is more cost effective to deploy/scale than a dbms or shared file system. Compute grid stuff is pretty different where jobs are scheduled, decomposed, suspended, resumed which is very different than what a data grid typically does. This is indeed how we differenciate XD's compute grid and Objectgrid components. They are complimentary. Now, are their scenarios where it could all be done using just the data grid, yes but most scenarios are probably better served by the specialist product using something like a datagrid for intermediate storage or for holding data required by all the compute jobs.
  11. I think Nikita has a point to a degree. We're seeing customers using ObjectGrid to hold data for compute grid stuff built with grid products like data synapse, xd compute grid or home grown stuff. They pull the data from the data grid because the data grid makes the data fault tolerance, network attachable and is more cost effective to deploy/scale than a dbms or shared file system.

    Compute grid stuff is pretty different where jobs are scheduled, decomposed, suspended, resumed which is very different than what a data grid typically does. This is indeed how we differenciate XD's compute grid and Objectgrid components. They are complimentary. Now, are their scenarios where it could all be done using just the data grid, yes but most scenarios are probably better served by the specialist product using something like a datagrid for intermediate storage or for holding data required by all the compute jobs.
    This is probably why I did not mention ObjectGrid :). Myself, as a user, would like to use the same product for both (and add messaging just to make it more interesting). This is certainly not far fetched as it is implemented today with GigaSpaces. It is a similar thing with Hadoop or Google Map Reduce implementation (just focused on the map reduce aspect). While just distributing jobs is one thing, the combination of it with HDFS or GFS is where it really shines. For example, with GigaSpace, you can create a job (it can even be a scripting language based job) and execute it on your compute Grid (with pure compute agents running). You can get a Java Future back where you can cancel, and wait for the result of the job. You can also, very simply, split a job to several jobs, execute them on your compute grid, and join them back. The nice additional features (that stems from the fact that you also work with a DataGrid) is the fact that you can take a job and tell it to run where its relevant data resides in memory. You can even take a Job and run it on all the active nodes and reduce the results from it (for example, to compute the average object count that you have in your grid). Coherence has a subset of this features (at least to my knowledge) using their InvocableMap. I an sure that ObjectGrid has the basics to support a compute grid (and I agree that there is more work that needs to be done beyond the basics), though probably it is a bit difficult to market it since XD is around ;)
  12. To build on Billy's point... One important question to ask is "how do I bring my data and business logic within close proximity of each other?" Another equally important question is "how do I achieve this close proximity while ensuring I have full operational control of my system as well as integration with my existing enterprise assets?" Several customers we're working with are viewing compute grids as the job management mechanism, whose purpose is to provide operational control of the jobs in execution and to also integrate with existing infrastructures (enterprise job schedulers, existing auditing/archiving systems, etc etc). While the data grid side of the architecture is to transparently ensure that data is as close to the business logic as possible while ensure data integrity and availability. The following section goes into more details on the ways XTP (and grid, hpc, and utility computing) have influenced the compute grids space: http://www.ibm.com/developerworks/websphere/techjournal/0804_antani/0804_antani.html?ca=drs-#xtp
  13. Let me venture a wild guess, to distinguish GridGain, which is just a compute grid from GigaSpaces, Coherence, and ObjectGrid.
    Hi Shay, I was just making the point along the side of Billy’s. Compute grids and data grids are way too different although they are used almost always together. Many data grids have some rudimentary capabilities for computational processing and, of course, nothing prevents you to do everything manually - “Manual Sunset” is the saying we have in Russia for this type of engineering… :) As far as GigaSpaces it has some basic Master/Worker, but you quickly run into limits once you need to control scheduling or load balancing... plus lines of code comparison and simplicity of doing computations does not even come near to what GridGain provides. Take a look at our screencast (Grid Application in 15 minutes) and see what can GigaSpace do in 15 minutes... With GridGain we integrate with Data Grids (natively with JBoss Cache and Coherence and less natively with GigaSpaces because it is not pluggable at all) so that you have freedom of choice which data grid product you would like to use. Obviously we support affinity split (collocation of data and processing logic) – which is an important point but only a one item on a long list of capabilities you need to have for a full-stack grid computing platform. Best, Nikita Ivanov. GridGain - Grid Computing Made Simple
  14. I too agree that Cameron's definition is a good one, and applies to most clustered caching/data grid products out there. I don't think it's *too* generic though, as Nikita suggests. Anything further would take you into the realm of vendor-specific detail. Regarding the overlap with computational grids, this doesn't belong in a definition of data grids. It could be fair to say that Coherence is a data grid with *some* compute grid capabilities, but then you're describing a specific product. -- Manik Surtani www.jbosscache.org
  15. It could be fair to say that Coherence is a data grid with *some* compute grid capabilities, but then you're describing a specific product.
    That's a fair statement. "Compute grids" provide a lot more than the compute capabilities that we include. Our compute capabilities (which pre-date map/reduce etc.) are in most cases data-centric, and natural off-shoots of the data functionality that we provide. Peace, Cameron Purdy Oracle Coherence: Data Grid for Java and .NET
  16. I would like to second opinions of others in this thread that the definition of Data Grid provided is very beneficial to the community. Just want to add that GridGain provides the most deepest integration with Oracle Coherence, to the extent that GridGain totally reuses Coherence Cluster Discovery and Communication protocols and practically becomes as reliable as Coherence. With such integration, users can get advanced job scheduling, load balancing, zero deployment, failover of both, logic and data, and many other features that GridGain provides in addition to state of the art caching features provided by Oracle Coherence. You can check our Coherence integration with provided out-of-the-box SPI's on our Wiki. Best, Dmitriy Setrakyan GridGain - Grid Computing Made Simple
  17. Re: Cameron Purdy: Defining a Data Grid[ Go to top ]

    I like the definitions and have passed on the link to this blog (I saw a few days(?) ago). I am wondering if analytical product vendors, such as SAS, are [re]designing their products to work with Data Grids. General reporting tools such as BIRT can already to this by their ability to call any code behind the scenes.
  18. Re: Cameron Purdy: Defining a Data Grid[ Go to top ]

    Mark -- I can tell you from our experience that, yes, there are several BI/Analytics vendors who are incorporating data grids, as well as many home-grown analytical applications, especially in capital markets. Geva Perry GigaSpaces
  19. Re: Cameron Purdy: Defining a Data Grid[ Go to top ]

    Mark -- I can tell you from our experience that, yes, there are several BI/Analytics vendors who are incorporating data grids, as well as many home-grown analytical applications, especially in capital markets.

    Geva Perry
    GigaSpaces
    Geva, thanks for your reply. I think one of our developers talked to Nati at the Spring conference. We are looking to do something unique and get away from the standard ETL (move the piles) Warehouse/analysis. We use SAS and since our analysts know that I am hoping SAS is working on this. If not, we will need to do something "home-grown". We have a "small" set of "data" to prototype with. We are in the health care arena.