Discussions

News: The new IT confusion: Grid and Utility Computing

  1. The new IT confusion: Grid and Utility Computing (29 messages)

    The new IT buzz seems to be around "grid" and "utility" computing. However, do most of the vendors, let alone the IT guys understand what they mean?

    Philip Brittan, CEO of Droplets, weighs in, and tell us what the vendors are doing. IBM and HP are focused on infrastructure tools, hardware and managed services. Oracle is providing grid-capable software and hosted business applications. Sun is focused on "virtualizing" hardware resources.

    Read: The new IT confusion

    How are you guys using, or being asked to look into Grid and Utility computing?

    Threaded Messages (29)

  2. Suitable domains[ Go to top ]

    I think use of Grid Computing(GC) is more suitable for scientific applications.
    That is why most of the usage scenarios I have encountered on the web belong to academic domains.
    That does not ,of course, mean that GC is only specific to academia.
    GC is perfect for computer graphics,3D animation applications which require heavy CPU power.
    As far as business applications concerned, the only reasonable area where GC is suitable to apply in my opion is some financial domain applications where compute-intensive tasks such what-if analysis for derivatives calculations, etc.
    performed.
  3. Oracle has recently announced version 10g of their database. Along with this they're in the process of rebranding all their products (Application Server and JDeveloper) to take on this new '10g' label. In keynotes Larry Ellison distinguishes Oracle 10g(rid) as being an 'enterprise grid computing' architecture in contrast to the scientific purposes mentioned here.

    Ah well, I think it is all about jumping on the bandwagon of the 'on demand' scene. Although, Oracle's approach is more technical/architectural in nature than the one that has been chosen by IBM.
  4. Suitable domains[ Go to top ]

    I think use of Grid Computing(GC) is more suitable for scientific applications. That is why most of the usage scenarios I have encountered on the web belong to academic domains.

    I think the leading gridware, Globus, entirely lacks commercial use. Sun's Grid Engine has a diverse list of industrial success stories, ranging from special effects to aerial photography.
  5. Suitable domains[ Go to top ]

    Platform has a Globus commerical package with pretty serious commercial installs.

    http://www.platform.com/products/build.asp
  6. Suitable domains[ Go to top ]

    We've seen both Platform and DataSynapse in some of the grid accounts we've been in.

    As for scientific or academic .. nope, grid is already in use big-time for financial modeling. That's where the money spent on grid computing is / will be.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  7. Hey Guys

    As an architect with IBM I would be happy to answer any questions (to the best of my ability) about our On Demand strategy and what the real technical nitty gritty is. I'm not the official evangelist for this stuff,so I'm really not out here with a bias to evangelize our stuff but willing to help out if you have any questions.

    Just my 2 cents,dont forget about Autonomic computing. As a technophile this stuff is really cool ! Orchestrations that are leveraging SOA integration with built in customisable Business Rules per customers(IBM calls these Policies). Its kind of a bit of SOA AI in a way. Autonomic can bring together Grid, SOA's and Utility computing to create something pretty cool.

    Simple Autonomic example. I have a PO Process with Amazon. My orchestration invokes Amazon's web service, but that particular service is down. I then invoke one of Amazon's Grid Services that launches an internal system process that launches a new process of their initial Web Service (i.e. Its back up) then my orchestration resubmits my original request and I get my response. That is pretty cool.

    Ok, I'm gonna stop before I get too carried away. This is what happens when I drink 2 cups of coffee before eating any breakfast.

    Cheers
    Steve Watt
    Software Architect, IBM
  8. Simple Autonomic example. I have a PO Process with Amazon. My orchestration invokes Amazon's web service, but that particular service is down. I then invoke one of Amazon's Grid Services that launches an internal system process that launches a new process of their initial Web Service (i.e. Its back up) then my orchestration resubmits my original request and I get my response.

    Yeah, I totally dig. Deployment on demand can also be applied to horizontal scaling. When a service becomes overloaded, another host is recruited on the fly. I also like IBM's Blueprint.
  9. Steve: Simple Autonomic example. I have a PO Process with Amazon. My orchestration invokes Amazon's web service, but that particular service is down. I then invoke one of Amazon's Grid Services that launches an internal system process that launches a new process of their initial Web Service (i.e. Its back up) then my orchestration resubmits my original request and I get my response. That is pretty cool.

    Uh, that's just not cool. It shouldn't be down in the first place. No matter what. That's what clustering and grid computing and geographical failover can provide to start with. Not web services that external companies use to manage someone else's internal systems. Ouch.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  10. Hey Cameron

    I totally understand what you're saying. I was just using it as an example to illustrate how autonomic computing can use orchestrations to act on the responses received from both Grid and Web Services and other applications to do a bit of its own thinking.

    For the benefit of the group, can you give us a scenario which includes Grid, Web Services, Clustering and Failover ? I think it would spark a cool discussion on some of the better practices available for using this technology.
  11. For the benefit of the group, can you give us a scenario which includes Grid, Web Services, Clustering and Failover ?

    Web service and the grid are synonomous in Globus -- all Globus components are deployed as web services. Failover is absent in Globus, and possible only with a smart stub, as per your earlier example. I disagree with Cameron's complaint that a smart stub isn't part of autonomic computing. An autonomic cluster may be self healing, but some hiccups necessarily percolate to the client. A smart stub is essential to resubmitting an idempotent request, ala WebLogic.
  12. Brian: Web service and the grid are synonomous in Globus -- all Globus components are deployed as web services. Failover is absent in Globus, and possible only with a smart stub, as per your earlier example.

    Isn't Platform somehow based on Globus? (Am I showing my ignorance again? ;-) I haven't had a chance yet to play with these things at the low level, so I'm still more at the buzzword level on some of it.

    Brian: I disagree with Cameron's complaint that a smart stub isn't part of autonomic computing. An autonomic cluster may be self healing, but some hiccups necessarily percolate to the client. A smart stub is essential to resubmitting an idempotent request, ala WebLogic.

    Well, for EJB stubs, maybe. For Web Services, though, I would expect a stateless load balancer in front of an application server cluster, be it a J2EE cluster or whatever. For geographic load balancing, and more importantly failover, you use global load balancers and an expensive dynamic DNS option from your telco. I don't know all the details, having never personally set one up, but I once stayed at a Holiday Inn Express ;-). What I'm saying is that for all the "inbound receiver" servers for Web Services, it should be "impossible" for them all to be down at once (save an act of God), and in terms of the "stub", it is something written by the client, not by the provider. In other words, you're not able to dictate the client-side code for public Web Services.

    Wide-spread use of public Web Services is a "relatively new thing (tm)" but with so much money pumped into it in the past five years, I'd expect it to grow rather quickly. Only maybe 1% or so of these services need 100% uptime and can afford the cost of geographic distribution and/or hot site failover (ludicrously expensive to do, from what I have seen.) The question is (in my mind, anyway) is that 1% of 10 thousand or 1% or 1 million.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  13. Well, for EJB stubs, maybe. For Web Services, though, I would expect a stateless load balancer in front of an application server cluster, be it a J2EE cluster or whatever.

    Yes, a load balancer guarantees a new request will be routed to a healthy server. But it doesn't help if the server subsequently crashes with the transaction in progress. Only a smart stub helps that.
  14. Brian: Yes, a load balancer guarantees a new request will be routed to a healthy server. But it doesn't help if the server subsequently crashes with the transaction in progress. Only a smart stub helps that.

    Hmm. That's a good concern to consider. I'm not sure about the technical details though. SOAP requires a HTTP request and a response IIRC. If the response doesn't come back and an I/O error occurs (e.g. the socket is closed), the request should be resent, I would expect. That shouldn't require a smart stub, just a compliant HTTP/SOAP client library. Obviously, with Web Services, you want some way to do idempotent processing. That means that regardless of how many times it is submitted, the request is only processed once. Also, the first response back saying "yes I got it" should only occur after the SOAP request has been made cluster-durable to survive node failure.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  15. Can you guys go into a bit of detail about what a smart stub is ?
  16. I thought that the whole idea of a Grid Service was transparency. From a client's perspective, connect to the service endpoint and attach it to a task. If the endpoint fails, just reconnect. The next on demand endpoint that meets QoS requirements is provisioned.

    The smart stub that you're speaking of should really be part of the endpoint provisioning service shouldn't it.

    The thing that I'm having problems grasping is the idempotency issue in a Grid. In a cluster, if you reconnect, the TM has heuristics to assure idempotency but , in a grid, since there may be multiple TM's operating in parallel I can't see how it's dealt with. Or are we talking about exotic n-phase commit stuff.

    - Frank Bolander
  17. The smart stub that you're speaking of should really be part of the endpoint provisioning service shouldn't it.

    Resumption is the kind of coordination problem that tends to occur with Message Oriented Middleware (MOM) and is solved by the combination of an integration broker and a durable queue. Indeed I see now, IBM's Autonomic Blueprint specifies an asynchronous "messaging bus". I'm begining to agree with you that a smart stub might be unecessary for reliable use of an autonomic cluster. Maybe MOM is the right metaphor for autonomic processing.
  18. The smart stub that you're speaking of should really be part of the endpoint provisioning service shouldn't it.

    >
    > Resumption is the kind of coordination problem that tends to occur with Message Oriented Middleware (MOM) and is solved by the combination of an integration broker and a durable queue. Indeed I see now, IBM's Autonomic Blueprint specifies an asynchronous "messaging bus". I'm begining to agree with you that a smart stub might be unecessary for reliable use of an autonomic cluster. Maybe MOM is the right metaphor for autonomic processing.

    Actually, I like the JXTA idea in your previous post -- sort of an IM for grid services. A service endpoint could create a peer group for the service and then have various nodes "chat" on the channel to fullfill the request. The peer fabric would handle all the QOS, replication and presence functions of the agents.

    - Frank Bolander
  19. A service endpoint could create a peer group for the service and then have various nodes "chat" on the channel to fullfill the request.

    When you say 'channel', presumably you mean a JXTA 'pipe'. Does JXTA have a durable pipe that retains its payload if there is a crash? Or is persistence concerns pushed to a high level, perhaps some other standard JXTA service, or perhaps all the way out to a smart client? JXTA looks great for shuffling transient messages around, but there's more to reliable workflow than exchanges.

    Anyway, a trivial grid has one machine. Also there's the client's machine, which isn't part of the grid. So traditional client-server is a trivial grid use case. In JXTA the client is called an 'edge peer', and the server would be a 'rendezvous peer'. If the rendevous peer is rebooted while running a job, does JXTA have any facility for restarting the job? No. Pending jobs need to be restarted from a durable list on the grid. That's how MOM or BPEL work. Or the client needs to manage the restart, which is what I originally insisted.
  20. When you say 'channel', presumably you mean a JXTA 'pipe'.


    Yes, I mixed my JavaGroups nomenclature with JXTA. I was referring to pipes. Sorry it caused you so much confusion.

    > Anyway, a trivial grid has one machine. Also there's the client's machine, which isn't part of the grid. So traditional client-server is a trivial grid use case. In JXTA the client is called an 'edge peer', and the server would be a 'rendezvous peer'. If the rendevous peer is rebooted while running a job, does JXTA have any facility for restarting the job? No. Pending jobs need to be restarted from a durable list on the grid.

    First, I'm not arguing that MOM has a place in fullfilling a service workflow. The discussion I believe is whether a smart stub is required on a client side or could it be located on the grid service side to increase transparency of the grid.

    There is nothing to say that a Grid Service can't be a client "edge peer" to the rest of the grid fabric as a "rendezvous peer". In fact this and other issues you raise above are being researched using JXTA (I don't know if its feasible but it sounds promising) via replication and messaging.

    http://www.jxta.org/project/www/docs/mdejxta-paper.pdf

    > That's how MOM or BPEL work. Or the client needs to manage the restart, which is what I originally insisted.

    Since we're dealing with Web Services over a Grid, I presume you mean "BPELWS" not "BPEL". Anyway, it's the grid service's job to make sure it fullfills the client request. Using your MOM insistence, why would a client need to manage the restart. Push a message on the bus, wait for the reply queue. The Grid Service should handle everything after that or at a minimum the MOM server should handle any retries.

    The tone of this thread is getting nasty. I mean no disrespect. In fact, this is one of the better discussions I've seen on this board with which you've contributed a lot of cool food for thought. I'm involved with these same discussions with real projects so if I'm coming off as confrontational, it's not personal, it's just an extension of the design group discussions(smart stubs were a sticky subject). Everybody is new with Grid and it's up to us as engineers to address the hype.
  21. http://www.jxta.org/project/www/docs/mdejxta-paper.pdf

    That paper is awesome! It convinced me that redundant peers can orchestrate job workflow without the need for durable queue and MOM broker. And I see that JXTA.org intends to integrate the job architecture described in the paper with Globus, an open source UNIX grid. Globus.org has its ownsimilar effort to merge JXTA and Globus for job submission.
  22. SOAP requires a HTTP request and a response IIRC. If the response doesn't come back and an I/O error occurs (e.g. the socket is closed), the request should be resent, I would expect. That shouldn't require a smart stub, just a compliant HTTP/SOAP client library.

    The SOAP specification doesn't mention retransmission. Does anyone know if Axis stubs retransmit? But even if messaging were reliable, the coordination you're suggesting still seems inadequate. Presumably you intend the server not to fully answer until the transaction is complete. Ie, you seem to want the socket connection kept alive for the life of the transaction, and that might not be viable on the extranet. A client's firewall might forbid HTTP streaming, which precludes long lasting connections. Polling (such as with JXTA) is the only reliable way to keep client and server communicating. When the server crashes, it wouldn't be the business request that goes unanswered, but rather some subsequent poll request. This most definitely reaches into the realm of smart stubs.
  23. Brian: The SOAP specification doesn't mention retransmission.

    That is correct, but the request has to get through. The client has to write the last byte, then it turns around and reads a response to make sure that there is not HTTP error code, for example (like a "service has been temporarily moved" error (???), or a "not found" error (404) or an "internal server error" (500).)

    Brian: But even if messaging were reliable, the coordination you're suggesting still seems inadequate. Presumably you intend the server not to fully answer until the transaction is complete. Ie, you seem to want the socket connection kept alive for the life of the transaction ...

    No, absolutely not! I'm just talking about the SOAP request. There is no such thing as a request without some form of response, IIRC. The response isn't the "answer" (the transaction result) but is rather something that says "yes, I got the request".

    Brian: Polling (such as with JXTA) is the only reliable way to keep client and server communicating. When the server crashes, it wouldn't be the business request that goes unanswered, but rather some subsequent poll request. This most definitely reaches into the realm of smart stubs.

    That is one way to accomplish it. I am not a Web Services expert. I think there are a couple of ways that I have seen Web Services request/response (higher level, not the low-level HTTP stuff) be implemented, such as the "server" calling back an "answer" Web Service on the "client", or the "client" getting a "job id" that it goes back to poll the server with. That is more in the realm of BPI than SOAP itself (AFAIK).

    However, like I said, this is an area in which I have little-to-no practical experience.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  24. I forgot to mention that a combination of smart stub and Globus's Monitoring and Discovery Service could give fail over entirely in software, without the expense or cluster locality constraint of a load balancer. It should be possible for an autonomic grid to span organizations. A rice genome service could be autonomic without needing any particular hardware topology -- just a scattering of universities around the globe willing to host the replicated service. The smart stub could be especially important since a genomic computation may run for hours. Automatic resubmittal by the client might be the only way that ensures an academic researcher has his results when he returns to his office in the morning. The academic researcher can't afford a load balanced cluster, and he can always manually resubmit a failed request. But the grid could easily accomodate him automaticly, without a fee, and without introducing service-reliability uncertainty into his project plan.
  25. Steve: I totally understand what you're saying. I was just using it as an example to illustrate how autonomic computing can use orchestrations to act on the responses received from both Grid and Web Services and other applications to do a bit of its own thinking.

    I apologize; I shouldn't have been so instantly critical.

    Steve: For the benefit of the group, can you give us a scenario which includes Grid, Web Services, Clustering and Failover ? I think it would spark a cool discussion on some of the better practices available for using this technology.

    Telcos, Financial Services and "academic research" (e.g. weather modeling done by some universities and utilized by commercial sponsors) are probably about the only places that I would publicly predict seeing something this complex be deployed in within the next few years.

    Geographic failover is used by most big financial services companies now (banks, trading companies, etc.) to withstand the loss of a data center. It's an obvious requirement for many of our customers (most of which are also big customers of yours. ;-)

    Grid computing, in the sense that I am referring to, is used to provide a large amount of processing power, often a dynamic amount, to do certain types of calculations that you simply cannot perform in a reasonable time (or for a reasonable cost) on a single large computer. I'm talking about hundreds to thousands of processors, typically hung as 2-CPU blades in racks, and connected with redundant wide backplanes (typically custom in-box and fiber between.) In other words, cheap processors with expensive interconnects. ;-) Typically, if you are seeing grid with Web Services, you have to think of the web services as being a "give me the up to date solution" that is being calculated in the grid or "here is some additional data" to be processed by the grid. The latter could be a dispersed feed model, for example, where data is reported asynchronously from many external sources.

    Web Services are simply public APIs over HTTP(S) to request and/or submit data, or to drive processes. As such, if you have to accept and/or expose data among separate companies / organizations / business units / whatever, it is likely that Web Services will be used. I'm not a big gung-ho Web Services proponent, so I don't see it being used quite as much for internal integration, but I won't mind if I'm wrong on that guess either.

    Clustering (and failover) is used so that components, such as Web Services, appear to be available to the "client" regardless of the death (or planned downtime) of specific pieces of hardware.

    I doubt you'll see a whole bunch of applications using these all together, at least in the near future (since so much of grid computing is still custom), but it would look like an external API via Web Services, via a global traffic balancer / router, clustered for failover, pushing data into (or collecting results out of) a grid.

    I would not expect to see Web Services being served directly from a grid; Web Services are better served by a simple farm/cluster model. Grid is primarily useful for massive and scalable computing throughput.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Clustered JCache for Grid Computing!
  26. Java left behind without Infiniband[ Go to top ]

    Larry Ellison calls it a Grid, some hardware vendors are calling it “stateless server blades”. A cluster cabinet of 16, 32, 48 cpu boards uses 10 Gbits/sec Infiniband connections for all communication within the cabinet, saving the expense and complexity of providing Backplanes, Ethernet, Fibre Channel, and SCSI. The interconnect is so fast disk drives are then removed from the cpu board and all disks are shared.

    Current server blade cpu's must have their private disk pre-loaded, stateless blades don't need this, they use only shared disks and so can add cpu's to an application “on demand”. You put all your corporate Windows servers and production Linux servers in the same cabinet and dynamically use cpu's where needed.

    It's curious that announced plans to ship stateless blades next year don't mention Java. Infiniband could be the physical layer at the bottom of a TCP/IP protocol stack called from Java but the stack adds a huge amount of overhead to recover from dropped or disordered packets. Often a third of cpu time in J2EE applications is spent on TCP/IP overhead. A cabinet using dual star Infiniband connections for all internal communication won't need to deal with soft failures.

    If Java had direct access to Infiniband connection overhead would improve 2, maybe 3 orders of magnitude. Imagine JDBC calls and cluster cache synchronization with near zero latency, we wouldn't have to manage expectations about performance like we do now.

    Gary
  27. I don't think Infiniand is going anywhere. I think that 1Gb/10Gb ethernet with RDMA (remote memory to memory block transfers) and TOE (tcp accelerated) cards is the technology that will/is being actually be used in the end. You can buy 1GB TOE cards today from Adaptec and Alacratech for a 600-1000 bucks. <5% CPU overhead at 50MB/sec. That cost should come down and is already ahead of Infiniband or other such technologies. I see TCP/IP, iSCSI and RDMA technology on top of the Gb ethernet wagon rather than infiniband.

    If the problem was just TCP/IP then TOE cards would solve that problem today. The main problem with most distributed Java applications is the overhead of serialization which can be around the 30-40% mark CPU wise as well as GC. Why have 10-20uS latency over the remote link when a GC can kill 300ms or more?

    So, the use of TOE cards eliminated your TCP stack overhead but I think the challenge is to get Java to keep up with the potential throughput and latency offered by the new comm stacks that are either here now or on the near horizon. JSR-1 and NIO/AIO does some way towards that but we're not there yet.

    Billy
  28. I don't think Infiniand is going anywhere. I think that 1Gb/10Gb ethernet with RDMA (remote memory to memory block transfers) and TOE (tcp accelerated) cards is the technology that will/is being actually be used in the end.


    Walking around Oracle World in September it seemed that every server blade vendor had a Topspin Infiniband unit on display. Investors got burned speculating on Infiniband stocks but that's their own fault. I want to see someone provide a Java api for Infiniband.

    10 Gb ethernet isn't ready yet. And TOE cards are more appropriate for 2-4 way servers.

    Gary
  29. The main problem with most distributed Java applications is the overhead of serialization which can be around the 30-40% mark CPU wise as well as GC.

    Occasionally SOAP or RMI suffer from graph serialization, but the worst culprit is JavaSpaces, where the goal is to share copies. And I suppose garbarge collection would drag on graph serialization.
  30. ObjectWeb has developed a framework called as ProActive. It is a Java - RMI implementation that interfaces with Globus.

    http://www-sop.inria.fr/oasis/ProActive/

    Question: Is it possible that software development will change substantially to avail benefits from GRID hardware and network innovations? Does GRID really mean new frameworks, APIs, App. Servers or even JVMs and language?