WebLogic on Sun T2000 tops SPECjAppServer2004

Discussions

News: WebLogic on Sun T2000 tops SPECjAppServer2004

  1. WebLogic on Sun T2000 tops SPECjAppServer2004 (15 messages)

    Sun Microsystems has published its latest SPECjAppServer2004 benchmark result featuring BEA WebLogic Server 9.0 on a SunFire T2000 cluster (the "Niagara" line) for a world record of 3,328.80 JOPS@Standard. This beats the current high mark of 2,921.48 JOPS@Standard, published just 4 weeks ago by IBM. Since going GA in July, BEA WebLogic Server 9.0 has held the highest mark on the SPECjAppServer2004 benchmark for all but four weeks. Sun also submitted two other scores with WLS9, on far smaller setups, to provide performance benchmarks for different deployment environment sizes.

    The result of 3,328.80 JOPS@Standard features WLS 9.0 and Sun's HotSpot JVM on the Solaris 10 64-bit operating system, hosted on six Sun Fire T2000 UltraSPARC T1 "Niagara" chips, each with 8 cores and 32 gigabytes of memory. (An interesting point is that the previous high score was held on an eight-node deployment.) The database hardware for this benchmark was Oracle Database 10g Enterprise Edition Release 10.1.0.2 running on a single Sun Fire E6900 with 16 UltraSPARC IV+ chips, each with 2 cores (total: 32 cores).

    SPEC and SPECjAppServer2004 are trademarks of the Standard Performance Evaluation Corp. (SPEC). Competitive claims reflect results published on www.spec.org as of December 8, 2005. For the latest SPECjAppServer2004 results visit the SPEC results page.

    Threaded Messages (15)

  2. GC threads[ Go to top ]

    They set ParallelGCThreads=32. This is the same as the total number of threads as the hardware itself. I guess that makes sense because GC does no I/O. It does spend most of its time waiting for the memory, but doing task switches would only make it worse. I wonder if they actually do use all those threads or not.

    Interesting how the CPUs on the database server have a 32MB L3 cache (or whatever they call it).
  3. GC threads[ Go to top ]

    They set ParallelGCThreads=32. This is the same as the total number of threads as the hardware itself. I guess that makes sense because GC does no I/O. It does spend most of its time waiting for the memory, but doing task switches would only make it worse. I wonder if they actually do use all those threads or not.Interesting how the CPUs on the database server have a 32MB L3 cache (or whatever they call it).

    With the hardware threads in the cores, thread swaps are "almost" free for the ones that are loaded in to the CPU.

    The truth is, what is happening is you have a single core with multiple hardware threads. The threads on each core are in fact not executing "simultaneously", but they are so well set up that the slightest delay in execution within the core will prompt a thread switch, so the core is always busy actually running code vs waiting on things like memory. Those thread switches happen for basically zero cost.

    For example, let's take a generic mutli-processing OS Kernel (OS processes, not an SMP CPU system).

    In the beginning, you'd have cooperative multitasking, where each process yields to the scheduler when it basically feels like in. On the old Mac OS systems, the Event queue served much of this role.

    Next, you get time slice multi-processing, where you have processes running until a timer interrupts the process, then the scheduler takes over, finds a new process, and restarts that one.

    Now take that kind of kernel and add wait states for things like I/O. So, when a process goes off to wait for a disk buffer to fill, the scheduler "sleeps" the process, resumes another one, and simply monitors the disk buffer. When its full, it will allow the original process to be scheduled again.

    So, take all of the concepts, the time slicing, the waiting on "slow" resources, etc. That's basically what the Niagra cores are doing for "hardware threads". A hardware thread has all of its context state maintained within the CPU in a bank of special registers. But now, rather than waiting on something REALLY slow, like a disk drive, we can use things like fetching uncached memory as "slow" events.

    Nowadays, for modern CPUs, regular, every day RAM is very slow, and programs are run within CPU based caches. Where as users we'd think "RAM fast, Disk slow, Internet real slow". So, cache internet content on our disk, then cache our disk in RAM. CPUs go ever farther with cache memory on the chip that's faster than the main memory of the computer.

    So, when it tries to fetch memory that's not in its local cache, the CPUs have to do a lot of work by flushing buffers (called pipelines) to refill their cache so they can continue running.

    What the Niagras are doing is they're using events like this to perform thread switches. If one threads stalls for some reason, then it can instantly switch to another handy thread and keep running. While the stalled thread gets itself back in to a running state (filling its cache perhaps), then it to can be executed. Meanwhile, the CPU isn't stalled waiting on it, it's off doing something productive on another thread. The hardware thread switch is cheaper than a pipeline refresh or out of cache memory hit.

    This means that when up and running, the CPU is always doing SOMETHING productive, and rarely just sitting there waiting for slower devices. So, even though it easy to say "Oh, it's just a bunch of 1GHz CPUs", the overall throughput is quite high because traffic rarely stalls, and the CPU is rarely idle. We just get that much better utilization out of the system.

    With many of todays IT applications more I/O bound than CPU bound (like, say, engineering applications), the Niagras are taking the SMP concept just that much farther and integrating them on to a single core. Like someone said, the T1000 (I think) chip is like smashing a Sun E6900 down into a single package, saving all that power and hardware to boot.
  4. GC threads[ Go to top ]

    With the hardware threads in the cores, thread swaps are "almost" free for the ones that are loaded in to the CPU.The truth is, what is happening is you have a single core with multiple hardware threads. The threads on each core are in fact not executing "simultaneously"

    I see. So they are not separate cores in the same sense as Xeon, Opteron or Power have separate cores. It's just a way to hide the memory bottleneck in highly multithreaded application. Can you publish here the L2 cache utilization when you ran this benchmark?
    So, even though it easy to say "Oh, it's just a bunch of 1GHz CPUs", the overall throughput is quite high because traffic rarely stalls, and the CPU is rarely idle. We just get that much better utilization out of the system.With many of todays IT applications more I/O bound than CPU bound (like, say, engineering applications), the Niagras are taking the SMP concept just that much farther and integrating them on to a single core. Like someone said, the T1000 (I think) chip is like smashing a Sun E6900 down into a single package, saving all that power and hardware to boot.

    It looks like an E6900 has a 32MB L3 cache :) Therefore an application that really loves that kind of cache (relational database doing OLAP) will hit the breaks on the T1000. As they say, you get what you pay for.
  5. GC threads[ Go to top ]

    I see. So they are not separate cores in the same sense as Xeon, Opteron or Power have separate cores.
    Niagara has 8 real cores each capable to execute 4 threads. So one Niagara procesor can handle 32 'simultaneous' threads.
  6. GC threads[ Go to top ]

    I see. So they are not separate cores in the same sense as Xeon, Opteron or Power have separate cores.
    Niagara has 8 real cores each capable to execute 4 threads. So one Niagara procesor can handle 32 'simultaneous' threads.

    Oh, right. So the number of really concurrent threads is the number of cores (8). But on workloads without good cache locality (or i/o bound workloads ..) it's 32.
  7. hmmmm....[ Go to top ]

    since the hardwares are all different, how can I know which app server is faster??
  8. RE. ...hmm[ Go to top ]

    since the hardwares are all different, how can I know which app server is faster??

    You can't and that's what makes these benchmarks so uninteresting. There are too many variables in each and every test to be able to make any decisions from posted results. The even bigger problem - from a customer point of view - is that performance and scalability seem to be opposites; you can't have them both.

    Since DB's are extremely hard (and expensive) to scale, I'm not impressed by benchmarks where there are 8 cores in the app server tier and 32 in the DB tier. But then again....scalability is not what is tested, even though what most customers are looking for.
  9. RE. ...hmm[ Go to top ]

    since the hardwares are all different, how can I know which app server is faster??
    You can't and that's what makes these benchmarks so uninteresting. There are too many variables in each and every test to be able to make any decisions from posted results.

    This type of complex workload doesn't tell you anything accurate about how "good" the app server is, but it does tell an IT manager if he/she can afford a certain platform.
    The even bigger problem - from a customer point of view - is that performance and scalability seem to be opposites; you can't have them both.

    I guess you mean latency and throughput. Actually, there is a third parameter in there which is cost. You can't have the lowest latency, the highest throughput, and the lowest costm all at the same time.
    Since DB's are extremely hard (and expensive) to scale, I'm not impressed by benchmarks where there are 8 cores in the app server tier and 32 in the DB tier. But then again....scalability is not what is tested, even though what most customers are looking for.

    I don't think the benchmark is supposed to showcase how efficient an architecture is. They are just telling you "if you run Oracle and WebLogic you'll pay _this_ much", or "this other combination will cost this much". It's quite useful from the point of view of someone who has a budget to spend and isn't quite sure how to get enough of what he needs and not get ripped off at the same time.

    Actually, in my comment above I said that in the case of OLAP queries, which contain big joins, a big cpu cache is essential. So I was speaking about a specific, and very simple, workload - just a join of two tables.
  10. RE. ...hmm[ Go to top ]

    This type of complex workload doesn't tell you anything accurate about how "good" the app server is, but it does tell an IT manager if he/she can afford a certain platform.

    Hmm, I always thought the budget is telling me if I can affort a platform :-)

    Regards,
       Dirk
  11. RE. ...hmm[ Go to top ]

    This type of complex workload doesn't tell you anything accurate about how "good" the app server is, but it does tell an IT manager if he/she can afford a certain platform.
    Hmm, I always thought the budget is telling me if I can affort a platform :-)Regards,   Dirk

    The budget is fixed. How you use it is not.
  12. not exactly related... but...[ Go to top ]

    does any one know of a good book or web resource where i can find great insight into GC, HEAP, STACK etc. I searched a lot but havent come across somehting which details out things. Sorry to have taken this thread a bit off track.
  13. not exactly related... but...[ Go to top ]

    does any one know of a good book or web resource where i can find great insight into GC, HEAP, STACK etc. I searched a lot but havent come across somehting which details out things. Sorry to have taken this thread a bit off track.

    You probably won't find a good book on the use of those terms today, but you can find classics (Aho et al) on data structures including heaps and stacks.

    The use of the term "heap" in Java refers to the fact that the executable module that makes up the runtime environment for most JVMs (including most JITs and Sun's own Hotspot) manages a large amount of memory (usually a big chunk or group of chunks that it has allocated from the OS), from which it allocates space for Java objects. Memory allocation has traditionally been done using a data structure called a "heap", so regardless of the data structure(s) used by JVMs, people refer to objects being allocated "from the heap". Many years ago, in the dark ages, heaps were used because they were efficient for sorting chunks of memory in a way that the desired size chunk could be efficiently found, and for being able to recombine contiguous chunks of free memory into a single larger chunk of free memory.

    The use of the term "stack" refers to three different aspects of the JVM, some of which may be the same. The first is the stack of a process or thread itself, which on x86 is the good old SS:EBP. The second is the logical stack of execution frames, which may actually be managed via the thread stack itself. The third is the stack within a particular execution frame, since the JVM is a "stack machine" (as opposed to a register machine). For example, the Java expression "x=y+z" will compile to two stack push operations ("push y", "push z") followed by an add operation. The add operation will take the top two items off "the stack", add them together, and push the result back onto "the stack". That "the stack" is the stack used by the "stack machine" (as represented by the JVM byte codes).

    The use of the term "GC" refers to the objects that were allocated from "the heap". Basically, the JVM is responsible for determining which of those objects are no longer in use, and thus are "garbage" in need of "collection". There are a number of very good articles on Java GC, both in the academic world and from the commercial software companies (e.g. Sun, IBM, BEA) that make JVMs. For example:

    http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)
    http://www.petefreitag.com/articles/gctuning/
    http://portal.acm.org/citation.cfm?id=512436
    http://www.memorymanagement.org/
    http://www-128.ibm.com/developerworks/java/library/j-jtp10283/
    http://www.cs.kent.ac.uk/people/staff/rej/gc.html

    Peace,

    Cameron Purdy
    Tangosol Coherence: The Java Data Grid
  14. not exactly related... but...[ Go to top ]

    does any one know of a good book or web resource where i can find great insight into GC, HEAP, STACK etc. I searched a lot but havent come across somehting which details out things. Sorry to have taken this thread a bit off track.
    You probably won't find a good book on the use of those terms today, but you can find classics (Aho et al) on data structures including heaps and stacks.The use of the term "heap" in Java refers to the fact that the executable module that makes up the runtime environment for most JVMs (including most JITs and Sun's own Hotspot) manages a large amount of memory (usually a big chunk or group of chunks that it has allocated from the OS), from which it allocates space for Java objects. Memory allocation has traditionally been done using a data structure called a "heap", so regardless of the data structure(s) used by JVMs, people refer to objects being allocated "from the heap". Many years ago, in the dark ages, heaps were used because they were efficient for sorting chunks of memory in a way that the desired size chunk could be efficiently found, and for being able to recombine contiguous chunks of free memory into a single larger chunk of free memory.The use of the term "stack" refers to three different aspects of the JVM, some of which may be the same. The first is the stack of a process or thread itself, which on x86 is the good old SS:EBP. The second is the logical stack of execution frames, which may actually be managed via the thread stack itself. The third is the stack within a particular execution frame, since the JVM is a "stack machine" (as opposed to a register machine). For example, the Java expression "x=y+z" will compile to two stack push operations ("push y", "push z") followed by an add operation. The add operation will take the top two items off "the stack", add them together, and push the result back onto "the stack". That "the stack" is the stack used by the "stack machine" (as represented by the JVM byte codes).The use of the term "GC" refers to the objects that were allocated from "the heap". Basically, the JVM is responsible for determining which of those objects are no longer in use, and thus are "garbage" in need of "collection". There are a number of very good articles on Java GC, both in the academic world and from the commercial software companies (e.g. Sun, IBM, BEA) that make JVMs. For example:http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)http://www.petefreitag.com/articles/gctuning/http://portal.acm.org/citation.cfm?id=512436http://www.memorymanagement.org/http://www-128.ibm.com/developerworks/java/library/j-jtp10283/http://www.cs.kent.ac.uk/people/staff/rej/gc.htmlPeace,Cameron PurdyTangosol Coherence: The Java Data Grid
    Thanks a lot for the info.
  15. not exactly related... but...[ Go to top ]

    does any one know of a good book or web resource where i can find great insight into GC, HEAP, STACK etc. I searched a lot but havent come across somehting which details out things. Sorry to have taken this thread a bit off track.

    GC I think is a big subject - you probably should have your pick of CS books. As far as STACK and HEAP is concerned I don't think it warrants a whole book.

    Are you just asking because you want to be able to tune Java application or are you trying to implement a brand-new languages?

    If you just want to be good at tuning, probably a good starting point is the hotspot article over at sun. You can't miss it - it's huge :)
  16. not exactly related... but...[ Go to top ]

    does any one know of a good book or web resource where i can find great insight into GC, HEAP, STACK etc. I searched a lot but havent come across somehting which details out things. Sorry to have taken this thread a bit off track.

    Back in the day, found Inside JVM: Bill Venners to be useful. It is a pretty old book by now, but the concepts should remain the same.