64-bit Itanium II not much better than Xeons for serverside apps

Discussions

News: 64-bit Itanium II not much better than Xeons for serverside apps

  1. It turns out that running application servers on the latest 64-bit Itanium II systems may not offer a sufficient performance boost to justify the cost over existing 32-bit Xeon hardware. Just-in-time (JIT) compilers for both Java and .net do not yet produce machine code that can take advantage of unique features in Itanium processors.

    Read Itanium stumbles on software code

    I found this article on Computing Weekly and thought I'd post it here to see what people think? and does anyone know whether Sun / BEA etc. are addressing this?

    Threaded Messages (18)

  2. Itanium II rules[ Go to top ]

    Hi all,

    Itanium II just needs an optimized JVM. BUT, nobody has enough interest in making it. Sun obviosly does not have interest, nor Microsoft.

    HP could make one.

    MC
  3. Itanium II rules?[ Go to top ]

    The only benchmarks that really show the Itanium family "ruling" are SSL connection handling benchmarks. Since (for high-connection sites) that is mostly done by the DSP accelerator cards that plug into the hardware load balancers, I'm not sure where Itanium is going to shine.

    There was an interesting article a while back on Ace's about Java performance on various processors, including the new 64-bit AMD processor. Unfortunately, it did not include benchmarks on Itanium. The only Java benchmarks I've seen on Itanium were very poor, but you're right, that could be the early state of the JVMs.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Easily share live data across a cluster!
  4. JRockit 8.1 supposedly now supports Itanium II, I'm not sure to what degree but at least they seem to be moving in the right direction.
  5. Itanium II rules[ Go to top ]

    Itanium II just needs an optimized JVM. BUT,

    > nobody has enough interest in making it.

    It isn't that simple. The IA64 concept is based on using extreme compile-time optimisation, plus a number of techniques that avoid the worst of the problems that arise when trying to optimise C/C++ and similar languages. The compile-time optimisation is based on computationally-expensive static code analysis, whereas Hotspot presumes computationally-cheap run-time dynamic analysis.

    For a number of years I have presumed, without evidence, that IA64 would turn out to be good for "number-crunching" applications (possibly where the inner-loops are hand-coded), and possibly for hand-optimised conventional JIT JVM operations. I have been unconvinced about its benefits for gerneral-purpose "server type" codes.

    Those presumptions are weakly supported by the remarkably few benchmarks that have been released for any of the IA64 family, e.g. SSL and the JVM benchmarks.
  6. Itanium II rules[ Go to top ]

    It would be nice to see how well the AMD Opteron based servers compare.
  7. EPIC compilers[ Go to top ]

    To put things into a perspective with respect to compilers, consider the following.

    There was a computer named “Elbrus 3”, which had an EPIC architecture somewhat similar to the Itanium. So, the highly optimizing Fortran compiler for it was making more than 60 passes over the code representation to produce an efficient code.

    Taking that into consideration, I do not think that the dynamic compilation would be that much cheaper. What the dynamic analysis might bring is the information about the most often executed code, memory access patterns and virtual function call substitution. After dynamic analysis is performed a JIT would still need to fall back onto the classical optimization and compilation techniques in order to produce the efficient code.


    Artem
  8. One main advantege of using Itanium over Pentium would be amount of memory JVM can use. That can enable much bigger caches or even solutions like object prevalence that could have significant impact on overall application performance - much bigger than any raw processing speed gain.

     -- Krzysztof
  9. One main advantege of using Itanium over Pentium would be amount of memory JVM can use. That can enable much bigger caches or even solutions like object prevalence...

    I don't understand. What's address space got to do with object prevalance?
  10. I don't understand. What's address space got to do with object prevalance?


    Applicability of bbject prevalence is limited by amount of memory. You can only put so many objects in 1.6GB you can address on Pentium (at least on Windows). If you run JDK 1.4 on Itanium II and any 64-bit OS you can address (and thus use) much more memory (I'm not sure what's the exact limit, but in theory it could be in 10^9 GB order of magnitude).

    I personally believe that not very long from now we gonna buy computers with 100s of GB of RAM (at least on enterprise level), and that will change many architectural patterns we are using now significantly.

     -- Krzysztof
  11. BEA JRockit, Itanium[ Go to top ]

    Has anybody tried BEA's JRockit on Intel's Itanium?

    http://news.com.com/2110-1001-984878.html

    http://edocs.bea.com/wljrockit/docs81/certif.html

    http://www.intel.com/ebusiness/affiliates/bea/index.htm
  12. The comparison is not valid[ Go to top ]

    The Computer Weekly article asserts that "Intel's latest 64-bit systems will give little or no advantage over 32-bit systems running application server software based on code from Java or Microsoft," citing SPECjAppServer2002 Java benchmark results for Itanium 2-based and Intel Xeon processor-based systems as the basis for this assessment.
     
    The comparison of a multi-node configuration with a dual-node configuration results in a skewed perspective of the two processors' performance and price/performance (in fact, such comparisons violate the SPEC Fair Use Rules; see http://www.spec.org/jAppServer2002/docs/RunRules.html#S3_5 for more information). There are good reasons why these types of comparisons across categories are not allowed. For one, the scaling characteristics of dual node configurations and multiple node configurations are different (scale out versus scale up). In addition, price/performance comparisons across categories are not really meaningful. One of the reasons why results are categorized is to group systems according to their maintenance and support requirement similarities, thus ensuring that price/performance comparisons are as fair as possible.

    Two other aspects make the comparison in the article invalid: First, the Itanium 2-based result the article referenced uses the older Itanium 2 processor 3M @ 1 GHz, not the newest Itanium 2 processor 6M @ 1.5 GHz(introduced last week). Second, the software stacks used by the two publications are completely different: WebLogic versus WebSphere, IBM JVM versus HP JVM and Windows versus HP-UX.

    Unfortunately, because SPECjAppServer2002 is a relatively new benchmark, and it is complex and expensive to run, there are a limited number of results posted as of today to use in fair comparisons. We hope this will change as we move forward.
     
    Alternatively, SPECjbb2000 is another Java benchmark that has been in use longer and has more results available for comparison (http://www.spec.org/jbb2000/results/jbb2000.html). Note that the Itanium 2 processor holds the highest four-processor performance, with a result of 116K ops/s (vs. 76K ops/s for an Intel Xeon processor MP-based system).
     
    Finally, the article also incorrectly asserted that "the EPIC architecture does not lend itself well to just-in-time compilers used for Java because these generate a continuous stream of instructions." In fact, just-in-time (JIT) compilers do have the ability to schedule instructions taking full advantage of the EPIC architecture. Moreover, JIT technology has the advantage of being able to dynamically generate code that is optimized for the actual run-time characteristics of the application, as the JIT has the ability to profile and select the hottest methods to optimize. While there is an added burden to the compilation process as compared with non-EPIC processors, there is nothing in the EPIC architecture that would prevent a JIT from scheduling instructions that take advantage of the exposed processor parallelism.

    Ricardo Morin
    Software and Solutions Group
    Intel Coporation
  13. The comparison is not valid[ Go to top ]

    Isn't benchmark(et)ing fun! There are just _so_ many holes
    that "the other side" can claim you have fallen into :)

    >In fact, just-in-time (JIT) compilers do have the ability to
    >schedule instructions taking full advantage of the EPIC
    >architecture. Moreover, JIT technology has the advantage
    >of being able to dynamically generate code that is optimized
    >for the actual run-time characteristics of the application,
    >as the JIT has the ability to profile and select the hottest
    >methods to optimize. While there is an added burden to the
    >compilation process as compared with non-EPIC processors,
    >there is nothing in the EPIC architecture that would prevent
    >a JIT from scheduling instructions that take advantage of
    >the exposed processor parallelism.
    > Ricardo Morin

    Are you distinguishing classic JIT technology from
    Sun's HotSpot technology in that respect?

    How would you characterise and quantify the extent
    of the "added burden"?
  14. The comparison is not valid[ Go to top ]

    Hi Tom:

    > Isn't benchmark(et)ing fun! There are just _so_ many holes
    > that "the other side" can claim you have fallen into :)
    >

    We need to make sure that all claims are supported with valid data.

    > Are you distinguishing classic JIT technology from
    > Sun's HotSpot technology in that respect?

    No, I was not specifically referring to Sun's HotSpot technology. Any modern JIT compiler needs to include some form of profiling to selectively optimize code. There is an interesting article here, which illustrates many of the tasks a JIT needs to accomplish (includes some discussion of JITing for Itanium):

    http://www.intel.com/technology/itj/2003/volume07issue01/art02_starjit/p01_abstract.htm

    (BTW, there are other articles on the same issue that may be of interest to this community: http://www.intel.com/technology/itj/2003/volume07issue01/ )
     
    > How would you characterise and quantify the extent
    > of the "added burden"?

    One of the key characteristics of the EPIC architecture is the shift from the hardware to the compiler (static or JIT) for exploiting instruction level parallelism. Itanium 2 is an in-order machine, allows multiple instructions to be issued in parallel, and it includes a number of explicit features that enable low level compiler optimizations, such as predication, speculation, and a large register set (among others). So there is more work for the compiler to do as compared with other architectures, but at the same time, the opportunities for improving performance over time through more advanced optimization techniques are much greater.

    This is another reason why it is important for the JIT to rely on profiling, as you do not want to waste time optimizing code that would not yield performance benefits (e.g. rarely executed).

    Thanks,

    Ricardo
  15. The comparison is not valid[ Go to top ]

    So why doesn't Intel publish the "hotspot" for the Itanium? The traditional "JIT" isn't going to do the trick, when the code can be so much more "hyper optimized" for the Itanium. Since the hotspot JVM already does some profiling as it runs, that model should be able to be extended to produce some relatively large bundles for epic (i.e. it should be possible to make relatively good use of the processor.)

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Easily share live data across a cluster!
  16. The comparison is not valid[ Go to top ]

    Hi Cameron:

    BEA JRockit is optimized for the Itanium Processor Family. Check out this article: http://cedar.intel.com/media/pdf/Java_64bit_final.pdf

    Thank you,

    Ricardo
  17. The comparison is not valid[ Go to top ]

    So why doesn't Intel publish the "hotspot" for the Itanium?


    You may find that "has" to come from HP.

    > The traditional "JIT" isn't going to do the trick, when
    > the code can be so much more "hyper optimized" for the Itanium.

    I've seen nothing to demonstrate that conjecture is true (or
    false); I would *really*& like to see an dependable answer.

    I suspect the key issue is how many processor cycles
    have to be expended on optimising down to the bundle level.
    I also expect that the time rises as the square of the
    number of instructions being optimised.

    Certainly the optimisations depend critically of the
    *details* of the processor's internal micro-organisation
    and on that of the memory controller. I've seen somebody
    discuss how they spent months hand-coding the inner loops
    of vector number-crunching algorithms; they were
    distressed when Intel modified the internal pipeline since
    they had to recode everything from scratch. Maybe the
    optimising compilers have reached the state in which
    the optimisation is fully automated, but it is known to be
    a seriously hard compiler problem.

    > Since the hotspot JVM already does some profiling as
    > it runs, that model should be able to be extended
    > to produce some relatively large bundles for epic
    > (i.e. it should be possible to make relatively good
    > use of the processor.)

    One would think so, but the *practical* ability to
    do such "extension" is not clear in my mind. If it
    isn't practical, then I conjecture that hand-optimisation
    of a classic JIT would be almost as good.
  18. The comparison is not valid[ Go to top ]

    Alternatively, SPECjbb2000 is another Java benchmark that has been in use

    > longer and has more results available for comparison
    > (http://www.spec.org/jbb2000/results/jbb2000.html). Note that the
    > Itanium 2 processor holds the highest four-processor performance, with a
    > result of 116K ops/s (vs. 76K ops/s for an Intel Xeon processor MP-based
    > system).

    We have recently been involved with a couple H/W OEMs to publish new SPECjbb2000 results on the following two system configs based on the latest Intel processors running with the latest version of BEA WebLogic JRockit:
    (1) 4-way Itanium 2 (1.5 GHz) system running the 64-bit JRockit JVM (optimized for Itanium 2)
    (2) 4-way Xeon (2.8 GHz) system running the 32-bit JRockit JVM (optmized for IA32 Xeon)

    Since they are still under review, I canmot disclose the actual results until they are published (sometime within the next couple weeks). But, for the sake of this discussion, the latest 64-bit JRockit JVM, that is optimized to take advantage of the EPIC architecture of Itanium and does the necessary code scheduing and optimizations, _outperforms_ the latest 32-bit JRockit JVM on the Xeon system.

    Given that JRockit has already been the fastest 32-bit JVM on IA32 Xeon systems, one can safely conclude that JRockit will be the fastest 64-bit JVM on Itanium systems as well.

    So, aside from realizing the true benefit of 64-bits, which is the ability to address heaps >2GB, Java performance with JRockit on the latest Itanium system is actually better than on the latest Xeon system.

    Arvind Jain
    BEA Systems
  19. impressive either way[ Go to top ]

    (1) 4-way Itanium 2 (1.5 GHz) system running the 64-bit JRockit JVM (optimized for Itanium 2)
    (2) 4-way Xeon (2.8 GHz) system running the 32-bit JRockit JVM (optmized for IA32 Xeon)


    I'm very curious to see 1-way, 2-way, 4-way and 8-way numbers. The Itanium scales up much (much!) better than the Xeon for SMP machines. The Xeon "sweet spot" is a 2-CPU configuration, and even there, it is (with the Intel chipsets) severely hobbled compared to other server systems.

    However, the price/performance of the Xeon is and has been (in the 1x and 2x CPU servers) simply amazing, and the performance in and of itself is amazing. (There are a lot of applications that will run faster on a 1-CPU Xeon than on a "big name" 2-CPU Unix server.)

    So if Itanium can squeeze out more performance than the Xeon, then it's doing pretty well, because that's a hard bar to get over already.

    On a side topic, any plans to support the weird >32-bit memory extensions that IA32 supports? So that you could support >2GB heap on a 32-bit JVM? Any plans for x86/64 support?

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Easily share live data across a cluster!