Anatomy of a flawed microbenchmark

Discussions

News: Anatomy of a flawed microbenchmark

  1. Anatomy of a flawed microbenchmark (22 messages)

    Brian Goetz has written an article which discusses microbenchmarks, and why they are normally so flawed. He takes a benchmark that he was sent which tested whether the synchronized primitive or the new ReentrantLock class was "faster."

    Read the article: Anatomy of a flawed microbenchmark

    Threaded Messages (22)

  2. Flawed or not, but I would like to know performance difference between ConcurrentHashMap and plain synchronization around HashMap.

    For example, what is faster:

      ConcurrentHashMap concurrentMap = new ConcurrentHashMap();
      HashMap map = new HashMap();
      Object mutex = new Object();
      ...
      concurrentMap.get("mykey");
      // or ?
      synchronized (mutex) {
        map.get("mykey");
      }

    In my own "flawed" test that I did about a year ago, plain synchronization turned out to be faster.

    Did anybody else play with that?

    --Dmitriy
  3. Anatomy of a flawed microbenchmark[ Go to top ]

    Flawed or not, but I would like to know performance difference between ConcurrentHashMap and plain synchronization around HashMap.For example, what is faster:  ConcurrentHashMap concurrentMap = new ConcurrentHashMap();  HashMap map = new HashMap();  Object mutex = new Object();  ...  concurrentMap.get("mykey");  // or ?  synchronized (mutex) {    map.get("mykey");  }In my own "flawed" test that I did about a year ago, plain synchronization turned out to be faster.Did anybody else play with that?--Dmitriy


    ConcurrentHashMap is for concurrency, not for single threaded access.

    Bill
  4. Anatomy of a flawed microbenchmark[ Go to top ]

    ConcurrentHashMap is for concurrency, not for single threaded access.

    Bill

    Of course.

    That's why when I tested it I started many threads that were contending to access the same map object (Unfortunately I don't have the code anymore).

    Did you happen to explicitly test it or did you just assume that ConcurrentHashMap is better?

    Mind you, concurrent hash map code also must use plain synchronization internally to achieve concurrency.

    --Dmitriy.
  5. Brian Goetz writes here about internals of ConcurrentHashMap:

    http://www-106.ibm.com/developerworks/java/library/j-jtp08223/
  6. Anatomy of a flawed microbenchmark[ Go to top ]

    I have looked at the article (albeit briefly). What caught my eye is the method get() of the ConcurrentHashMap class:
    public Object get(Object key) {
        int hash = hash(key);

        // Try first without locking...
        Entry[] tab = table;
        int index = hash & (tab.length - 1);
        Entry first = tab[index];
        Entry e;
            …
    What is interesting is that since the code is not synchronized somebody can change “tab” array in between “int index = hash & (tab.length - 1);” and “Entry first = tab[index];” lines. Can somebody explain how it doesn’t not produce ArrayIndexOutOfBoundsException?

    Thanks,
    Nikita.
  7. I have looked at the article (albeit briefly). What caught my eye is the method get() of the ConcurrentHashMap class:
    public Object get(Object key) {    int hash = hash(key);     // Try first without locking...    Entry[] tab = table;    int index = hash & (tab.length - 1);    Entry first = tab[index];    Entry e;        …
    What is interesting is that since the code is not synchronized somebody can change “tab” array in between “int index = hash & (tab.length - 1);” and “Entry first = tab[index];” lines. Can somebody explain how it doesn’t not produce ArrayIndexOutOfBoundsException?Thanks,Nikita.
    Methinks he may have paraphrased - the current Sun implementation uses segment tables rather than a monolithic table. In this case the code is operating within a segment of constant size.
  8. Anatomy of a flawed microbenchmark[ Go to top ]

    Can somebody explain how it does not produce ArrayIndexOutOfBoundsException?

    Sure. 'tab' is a local variable so it points to the same array in both accesses, no matter what happens to the value of 'table' it was copied from. Arrays never change length once instantiated so it's safe. Although the segment table answer might be right too...
  9. Can somebody explain how it does not produce ArrayIndexOutOfBoundsException?
    Sure. 'tab' is a local variable so it points to the same array in both accesses, no matter what happens to the value of 'table' it was copied from. Arrays never change length once instantiated so it's safe. Although the segment table answer might be right too...
    D'oh. Wish I'd noticed that :P
  10. Can somebody explain how it does not produce ArrayIndexOutOfBoundsException?
    Sure. 'tab' is a local variable so it points to the same array in both accesses, no matter what happens to the value of 'table' it was copied from. Arrays never change length once instantiated so it's safe. Although the segment table answer might be right too...
    D'oh. Wish I'd noticed that :P
    Self reply is bacd form (I'm ona roll), but I ought to emphasise - I was completely wrong (yes segments are used, but segment tables are variable - so it's all about the local)... I shouldn't have gotten out of bed this morning.
  11. Anatomy of a flawed microbenchmark[ Go to top ]

    For example, what is faster: ... In my own "flawed" test that I did about a year ago, plain synchronization turned out to be faster.

    Did you even read the article? Because question like the one you pose is meaningless.

    ConcurrentHashMap is deals with what you might call concurrent load conditioning, i.e. several buckets have their own lock rather than a structure-wide lock.

    So, basically, when a plain old synchronized hash map would saturate the collection access, ConcurrentHashMap will happily continue to provide access.

    The results of your (indeed flawed) test would vary dramatically with the number of threads, the number of processors, access pattern, etc, etc.
  12. Anatomy of a flawed microbenchmark[ Go to top ]

    Sure, but mustn't there be a valid question in there somewhere? Like on platform X, using JIT Y, with Z number of processors, spending Q% of their time accessing the hash table, at what point does the overhead of using a concurrent hashmap lead to higher throughput than a synchronized one (maybe solving for Z and Q in terms of X and Y)? Does it maybe involve more insidious factors, such as the hash function/equality test used (which may be executed either inside or outside the lock depending on the implementation)?

    Is the best approach to try to come up with a rule of thumb using "unflawed" microbenchmarks, simply engineer for predictable scalability or performance as called for by the situation, or to measure it in every application to see which works best?

    Personally, I try to guess which implementation is right given the situation (sometimes using what are probably flawed microbenchmarks), and then go back and tweak if the application fails to perform as desired (after careful analysis of the cause of the performance problem of course). Anybody have a better way?
  13. Anatomy of a flawed microbenchmark[ Go to top ]

    Did you even read the article? Because question like the one you pose is meaningless.

    That is exactly what I wanted to find out - if anyone has any meaningful metrics that proves that under certain thread-count/load/number-of-cpus concurrent hash map is definitely faster.

    --Dmitriy.
  14. Flawed or not, but I would like to know performance difference between ConcurrentHashMap and plain synchronization around HashMap.For example, what is faster:  ConcurrentHashMap concurrentMap = new ConcurrentHashMap();  HashMap map = new HashMap();  Object mutex = new Object();  ...  concurrentMap.get("mykey");  // or ?  synchronized (mutex) {    map.get("mykey");  }In my own "flawed" test that I did about a year ago, plain synchronization turned out to be faster.Did anybody else play with that?--Dmitriy

    ConcurrentHashMap allows concurrent iteration over the Map, not just synchronization of gets and puts. i.e. multiple threads can iterate over the map while other threads put and get. If you extended your simple mutex example for iteration, you'd have to synchronize on the mutex to get exclusive access for iteration, blocking all other threads while you iterate. This is main point of ConcurrentHashMap. (BTW, Collections.synchronizedMap() already implements your mutex approach).
  15. Flawed or not, but I would like to know performance difference between ConcurrentHashMap and plain synchronization around HashMap.For example, what is faster:  ConcurrentHashMap concurrentMap = new ConcurrentHashMap();  HashMap map = new HashMap();  Object mutex = new Object();  ...  concurrentMap.get("mykey");  // or ?  synchronized (mutex) {    map.get("mykey");  }In my own "flawed" test that I did about a year ago, plain synchronization turned out to be faster.Did anybody else play with that?--Dmitriy
    ConcurrentHashMap allows concurrent iteration over the Map, not just synchronization of gets and puts. i.e. multiple threads can iterate over the map while other threads put and get. If you extended your simple mutex example for iteration, you'd have to synchronize on the mutex to get exclusive access for iteration, blocking all other threads while you iterate. This is main point of ConcurrentHashMap. (BTW, Collections.synchronizedMap() already implements your mutex approach).
    Yup, if you iterate, definitely use CHM; I would also have mentioned the atomic conditional put/replace/remove operations.
    wrt Dmitriy's question
    TFA covered this - CHM is highly tuned towards highly concurrent access based around mostly sucessful gets - which a common bottleneck case (eg a cache doing its job - ie the system is being pounded: misses are rare, thoughput is high; otherwise: who cares?).
    Optimisation of non-bottleneck code is almost a complete waste of time (unless it makes for a cleaner design).
    The way I read the benchmarks - if the Map is really being (or will probably be) used concurrently there is no question - ConncurrentHashMap wins. If concurrency is a rare/freak occurence go with synchronization.
    It's only one line of code anyway, since you are being a good coder and accessing the map only though the Map interface IF your app is having throughput problems you have only to change the class used from a synchronized hashmap wrapper to a concurrent map (or vice versa) and see if the problem stops.
  16. Flawed or not, but I would like to know performance difference between ConcurrentHashMap and plain synchronization around HashMap.

    Interestingly, we did a huge amount of performance and thread-scalability testing with hashmap, hashtable, a synchronized wrapper around hashmap, our our safehashmap and some other implementations. The amazing thing was how tiny (TINY!) little changes would make massive differences.

    For example, changing the test harness slightly (not the implementations of the classes being tested) would occasionally alter the results significantly. Changing JVM settings could reak havoc on the results as well.

    In another example, changing the 32-bit modulo to a 64-bit modulo (to treat the hash code as an unsigned 32-bit value, which does result in a better distribution) and also using a guaranteed-prime modulo (even after resizing) ended up performing significantly more poorly on most JVMs, although it dropped the entry collision rate compared to the JDK implementation by well over 90% at the same load factors. Moral: Better engineering, worse performance.

    On the other hand, the multi-threaded results were much more predictable. Predictable==good.

    Also, a side-effect of better distribution is a better tolerance for higher load factors, potentially resulting in a significantly reduced numbers of bucket array resizes / rehashes.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Shared Memories for J2EE Clusters
  17. Anatomy of a flawed microbenchmark[ Go to top ]

    In another example, changing the 32-bit modulo to a 64-bit modulo (to treat the hash code as an unsigned 32-bit value, which does result in a better distribution) and also using a guaranteed-prime modulo (even after resizing) ended up performing significantly more poorly on most JVMs, although it dropped the entry collision rate compared to the JDK implementation by well over 90% at the same load factors. Moral: Better engineering, worse performance.
    Very interesting. Do you have any feelings as to whether there is a root cause for those effects? One possibility that springs to mind is that slight changes in data structures might lead to significant changes in the proportion of data that is ready and waiting in the processor's L1,L2 caches.
    On the other hand, the multi-threaded results were much more predictable. Predictable==good.

    Possibly because of flushing dirty L1,L2 caches out to shared memory?

    But then coming from a hard-real-time background, I've always been suspicious of "statistical performance hacks" like caches :)
  18. Anatomy of a flawed microbenchmark[ Go to top ]

    Very interesting. Do you have any feelings as to whether there is a root cause for those effects?

    I think a lot of it was due to Hotspot. I am sure there is some deterministic behavior to Hotspot; it's just that I haven't been able to prove it yet ;-)

    Also, some of it is no doubt related to poor optimizations for 64-bit math.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Shared Memories for J2EE Clusters
  19. There are alternative to HashMap/ConcurrentHashMap which might be faster than both (with some limitations). For example, Javolution FastMap is significantly faster than HashMap and supports concurrent access as long as the keys are not removed (e.g. look up table). If keys are removed unfrequently it might be the best option (the whole map can be replaced when keys are removed).
    Having performed several benchmarks for the Javolution (http://javolution.org) open source library. I can tell you that JIT, GC, class initialization and high level abstraction (e.g. CharSequence instead of String) can easily fool you.
    I am now opting more and more for "realistic" scenarios with interchangeable pieces (the part you want to measure). This results in relative performance numbers (instead of absolute numbers). But most benchmark errors are then canceled out.
  20. Regards the article, I came across two replys of the people that are behind the JSR 166, Doug Lea and Brian Goetz itself:

    "Why is ReentrantLock faster than synchronized?
    Hi,

    a recent article by Brian Goetz made me wonder about this again. Brian
    demonstrates, how ReentrantLock has a performance and scalability
    advantage over synchronized.

    Did anyone investigate why that would be the case? Is it due to the fact
    that the VM has lock-enable *any* java.lang.Object through things like
    header displacement, lock inflation/deflation? Or has GC an advantage
    over manual management of lock records?

    Other than that I cannot think of any edge ReentrantLock could have over
    synchronized: both codes are generated inline by the Hotspot compiler,
    both can and probably do use the same suspension/resumption methods, the
    same atomic instruction sequences, ...

    Just wondering
    Matthias "

    ------------------------------------------------------

    "There's some friendly competition among those doing Java-level sync
    vs VM-level sync. The underlying algorithms are increasingly
    pretty similar. So you should expect relative performance differences
    in typical server applications to fluctuate across releases. The main
    goal of ReentrantLock is NOT to uniformly replace builtin sync, but to
    offer greater flexibility and capabilities when you need them, and to
    maintain good performance in those kinds of applications.

    The people doing VM-level sync also have to concentrate on issues that
    we don't with ReentrantLock, like the need to use only a few bits of
    object header space to avoid bloat in objects that are never
    locked. This impacts impmentation details allowing ReentrantLock to
    sometimes require a few cycles less overhead.

    Also VM-levl support must deal with the fact that many
    programs/classes do a lot of sync that is entirely useless because it
    can never be contended. (We assume people hardly ever do this with
    ReentrantLock so don't do anything special about it.) Techniques to
    greatly cheapen this case are probably coming soon in hotspot and
    other VMs. (This case is already pretty cheap on uniprocessors, but
    not multiprocessors.)

    There are currently still a few things that can be done at JVM level
    that we can't do at Java level. For example, adapting spinning to vary
    with load averages. We're working on leveling the playing field here
    though :-)

    -Doug "


    --------------------------------------------------------


    "Performance is a moving target. In the first JVM, performance for
    everything sucked (locking, garbage collection, allocation, you name it)
    because the first JVM was a proof-of-concept and performance wasn't the
    goal. Once the VM concept was proven, engineering resources were then
    allocated to improve performance, and there is no shortage of good ideas
    for making things faster, so performance in these areas improved and is
    improving with each JVM version.

    So, one factor in why ReentrantLock is faster than built-in
    synchronization is that the JSR 166 team spent some effort building a
    better lock -- not because the JVM folks didn't have access to the same
    papers on lock performance, but because they had other priorities of
    where to spend their efforts. But they will get around to it and the
    scalability gap will surely close in future JVM versions.

    Interestingly, the algorithm used under the hood of ReentrantLock is
    easier to implement in Java than in C, because of garbage collection --
    a C version of the same algorithm would be a lot more work and would
    require more bookkeeping in the algorithm. As a result, the approach
    taken by ReentrantLock makes more garbage and uses less locking than the
    obvious C analogue, and it turns out that, given the current relative
    cost between memory management and memory synchronization, an algorithm
    that makes more garbage and uses less coordination is more scalable.
    This week. Might be different next week. Performance is a moving target.

    Brian Goetz"


    --------------------------------------------------------
  21. Atomic variables in java.util.concurrent:
    Nearly all the classes in the java.util.concurrent package use atomic variables instead of synchronization, either directly or indirectly. Classes like ConcurrentLinkedQueue use atomic variables to directly implement wait-free algorithms, and classes like ConcurrentHashMap use ReentrantLock for locking where needed. ReentrantLock, in turn, uses atomic variables to maintain the queue of threads waiting for the lock.

    These classes could not have been constructed without the JVM improvements in JDK 5.0, which exposed (to the class libraries, but not to user classes) an interface to access hardware-level synchronization primitives. The atomic variable classes, and other classes in java.util.concurrent, in turn expose these features to user classes. "

    Nearly all of the classes in java.util.concurrent are built on top of ReentrantLock, which itself is built on top of the atomic variable classes. So

    while they may only be used by a few concurrency experts, it is the atomic variable classes that provide much of the scalability improvement of the java.util.concurrent classes.



    When using the JDK5.0 package java.util.concurrent.atomic - and the CAS interaction uses the compareAndSwap:
    import sun.misc.Unsafe;

    // setup to use Unsafe.compareAndSwapInt for updates
    private static final Unsafe unsafe = Unsafe.getUnsafe();

    /**
         * Atomically set the value to the given updated value
         * if the current value <tt>==</tt> the expected value.
         *
         * @param expect the expected value
         * @param update the new value
         * @return true if successful. False return indicates that
         * the actual value was not equal to the expected value.
         */
        public final boolean compareAndSet(boolean expect, boolean update) {
            int e = expect ? 1 : 0;
            int u = update ? 1 : 0;
            return unsafe.compareAndSwapInt(this, valueOffset, e, u);
        }

    - As you can see below, the beckport concurrent, uses synchronise on this method!

    /**
         * Atomically set the value to the given updated value
         * if the current value <tt>==</tt> the expected value.
         * @param expect the expected value
         * @param update the new value
         * @return true if successful. False return indicates that
         * the actual value was not equal to the expected value.
         */
        public final synchronized boolean compareAndSet(long expect, long update) {
            boolean success = (expect == value);
            if (success)
                value = update;
            return success;
        }

    The low-level classes in java.util.concurrent -- ReentrantLock and the atomic variable classes -- are far more scalable than the built-in monitor (synchronization) locks. As a result, classes that use ReentrantLock or atomic variables for coordinating shared access to state will likely be more scalable as well.

    ----------------------------------------

    Doug Lee, in his document regards backport-util-concurrent, deals with the comparison between Hashtable and the ConcurrentHashMap and releases some figures as well:

    "Hashtable vs. ConcurrentHashMap

    As an example of scalability, the ConcurrentHashMap implementation is designed to be far more scalable than its thread-safe uncle, Hashtable. Hashtable only allows a single thread to access the Map at a time;
    ConcurrentHashMap allows for multiple readers to execute concurrently,readers to execute concurrently with writers, and some writers to execute concurrently. As a result, if many threads are accessing a shared map frequently, overall throughput will be better with ConcurrentHashMap than with Hashtable.

    The table below gives a rough idea of the scalability differences between Hashtable and ConcurrentHashMap. In each run, N threads concurrently executed a tight loop where they retrieved random key values from either a Hashtable or a ConcurrentHashMap, with 60 percent of the failed retrievals

    performing a put() operation and 2 percent of the successful retrievals performing a remove() operation. Tests were performed on a dual-processor Xeon system running Linux. The data shows run time for 10,000,000 iterations,

    normalized to the 1-thread case for ConcurrentHashMap. You can see that the performance of ConcurrentHashMap remains scalable up to many threads, whereas the performance of Hashtable degrades almost immediately in the presence of lock contention.

    The number of threads in this test may look small compared to typical server applications. However, because each thread is doing nothing but repeatedly hitting on the table, this simulates the contention of a much larger number of threads using the table in the context of doing some amount of real work.

    Threads ConcurrentHashMap Hashtable

    1 1.0 1.51

    2 1.44 17.09

    4 1.83 29.9

    8 4.06 54.06

    16 7.5 119.44

    32 15.32 237.2"


    ----------------------------------

    Regards

    Gershon
    GigaSpaces Technologies
    www.GigaSpaces.com

    GigaSpaces Technologies
  22. Another last note, since we also have ran many benchmarks on the conservative Java synchronisation vs. util.Concurrent new objects, I can contribute another 2 tips.

    1. In case you are using ConcurrentHashMap, be aware of the extreme cost of the isEmpty() and size() methods, due to the segments partitioning model.

    2. HashMap was rewritten in 1.4 to use a power-of-two sized closed hash table. The performance of this implementation is, generally speaking, better than that of the previous implementation, but the new implementation is much more sensitive to poor hash functions. To mitigate this problem, the new implementation applies a supplemental hash function to the value returned by hashCode().

    The new HashMap is little slower if the hash codes of the
    keys are sequential - perhaps this is what was meant by
    "iteration". E.g. if the keys are Integers with a bunch of
    sequential values, the 1.3 HashMap puts these all in
    separate buckets, no collisions, while the 1.4 HashMap does
    some extra scattering which causes some extra
    (pseudo-random) collisions. However if the hash codes all
    differ by an integer multiple of the Map's current capacity,
    then the 1.3 HashMap gets *many* more collisions, while 1.4
    avoids ths thanks to the extra scattering. So the 1.4 code
    is more robust overall.


    To those of you that know who is Joshua Bloch (which wrote Hashtable for instance) is his reply to the discussion at http://www.javaspecialists.co.za/archive/Issue054.html (Feb 24):

    quote:
    --------------------------------------------------------------------------------

    Hi. This is the first I've heard of this. I looked into it briefly and I believe I know what's going on. I don't have time to write it up tonight, but I'll try to do it soon. For starters, the microbenchmark is not good. It should time the test repeatedly so that you can make sure you're getting a repeatable result. Timing the first run gives you startup transients, which are not what you want. Also you have to figure out how you want to deal with garbage collection: either eliminate it from the timings by calling System.gc() outside of the timing loop, or time runs long enough to predictably include GC.
    That said, Hashtable really is faster than HashMap for the data set in this benchmark: strings formed from sequential numbers. The pre 1.4 HashMap implementation, which was retained in Hashtable, does very well on sequential keys because it doesn't scatter them at all: it puts them in sequential buckets. As a result, there are *no* collisions. The price you pay is that certain regular but non-consecutive sequences will all hash to the same bucket, resulting in horrible performance. The 1.4 version (and especially the 1.4.1 version) scatters the keys pseudo-randomly, resulting in occasional collisions when hashing consecutive sequences like the one in the microbenchmark. (The mathematics of this is pretty simple; you can easily calculate roughly how many collisions you'll get.) To test this conjecture, use pseudo-random numbers in the keys instead of consecutive. I did, and found that HashMap is a bit faster than Hashtable, as expected.

    Loosely speaking, then, the 1.4.1 implementation is a better general purpose implementation. If you know that you'll be dealing with a consecutive sequence, the 1.3 implementation (still present in Hashtable) could well be faster. On the other hand, for consecutive keys you could just use an array (or List), which would be *really* fast : )


    -------------------------------------------------------


    Regards
    Gershon

    GigaSpaces Technologies
    www.GigaSpaces.com
  23. Micro vs macro benchmark[ Go to top ]

    Goetz is fully right to denounce over generalization and over abuse around micro benchmark. He is really true also when he says that doing macro benchmark usually give more insight because they apply on a specific situation.

    Simply, my experience is that macrobenchmark are also very prone to abuse of interpretation.

    http://jroller.com/page/design4speed/20050227#micro_vs_macro_benchmark

    A macro benchmark is more typically done by large companies or organization. The principle is basically the same as for micro-benchmark:
        * choose one vendor out of two or three
        * have a go - no go decision on a project based on performance assessment

    Typically, this will happen before or at the early stage of big IS projects costing the company several tens of millions dollars. Management want to be sure in advance that their investment will match the expected performances. So they are at ease to spend a few hundred thousand dollars on a benchmark to be sure.

    And indeed, you need that money to rent a realistic infrastructure (though 1/10 scale of the final one), have sufficient horsepower to simulate hundreds of users, rent the software to do it and have all the network, OS, hardware, software specialists around the table during the 4 to 8 weeks this will last.

    What is the difference with micro-benchmark ? Despite all this money, you will have no chance of covering all the spectrum of possible hardware / software / architecture / network / application configuration blindly. What you do is usually start with an initial configuration and remove progressively the successive bottlenecks you will discover in the system. Until it is finished. Then, you can prove the amazed managers that starting from an initial configuration that were so poor that someone should get fired, you divided the response time by one thousand and are in the acceptable limits thanks to the 2 weeks of 24h a day work you did with your team at the end of the benchmark.

    As for the micro benchmark,what you did is perfectly valid, until we go to the conclusion. Because the conclusion will lie in the way you handle the measures and conclusion. Because again, the issue is in over interpretation of the results. As an example, this will go wrong because:
        * the benchmark was here to validate the future system, and no the one built for the benchmark. There will be differences (different hardware, OS version, even OS types, etc.) and the result you had won't be valid any longer
        * there is no guarantee the user behavior that you simulated is typical of the behavior of future users
        * as you basically struggled to get the best results, all that will be kept is the final and best figure you will give, so there is again one point of measurement, which is not good to try to build a model of your system. Anyway, you will never have the time and money to do it seriously.

    When you have paid more than one hundred thousand dollars to have an answers, you are not ready to hear that the conclusion is only valid for the current system and that doing proportionality rules on a highly non linear system is not a good idea. So the error is usually the same as for micro-benchmark : wrong extrapolation of a single point of measure. Hopefully, this is not so bad because the team of specialist will hopefully get a lot of insight on the system and have a fairly good knowledge of what is important to make it work optimally. As Goetz said, the only way to go on the performance road is empirical. You need to struggle with the system and understand its very precise mechanism to get a chance of doing a right benchmark.
    So what?

    Well, OK, this is based on my personal experience. There may be around teams that are doing macro benchmark scientifically to build systematic models of the system they are studying, with no pressure on time and money. May be in software vendors ? I would really like to meet them.

    But so far, what I learned is that macro-benchmark are more or less just bigger micro-benchmark, as Goetz described. The errors done on them are the same. Instead of using them as facts to learn better the system, they are used to draw general conclusion from a very limited experience. And the only benefit you can get is by running the benchmark and learning how the system works, not in the conclusions or the results published to everyone.