Discussions

News: Article: No Objects Left Behind

  1. Article: No Objects Left Behind (19 messages)

    Java performance tuning specialist Kirk Pepperdine attempts to clarify some common pre-conceptions around Java memory management. He examines how poor coding practices can lead to inefficiencies in garbage collection and explores some techniques for ensuring that objects are managed more effectively by the GC. From the article:
    "In his recent series of talks, Brian (Goetz) makes the claim that memory allocation in Java is far cheaper than it is in C/C++. The justifications for this are based on his in-depth conversations with those that are responsible for developing the JVM. Despite understanding the source of Brian’s information, a number of people have been openly critical of his musings because it simply does not match their experiences in the trenches. It is not the first time theory and real world experience have collided at complete right angles to each other. For example, if you can find the advice originally given by Sun on how to tune the 1.4.1's generational spaces you will see that today’s advice is quite different. The advice given in that document appears to be based on the assumption that memory-to-memory copying of objects from one space to another is expensive. The conclusion one would draw from this is one should configure the JVM to move objects from young to old space as soon as it is practical. In technical parlance this meant reducing the size of the survivor spaces so that objects would be immediately promoted to the old generation. The first time I tried to follow this advice I quickly discovered that I need to ignore the “expert advice” and make survivor spaces large to delay promotion to old space. Now we have some new “expert advice” that is being questioned by other “experts”. The question is: what is it that they are seeing that is leading them to question Brian’s advice. Even more importantly, who should we be listening to?"
    Read No Objects Left Behind

    Threaded Messages (19)

  2. It will be interesting to readdress this when (if?) stack allocation through escape analysis is included in the J7 VM.
  3. It will be interesting to readdress this when (if?) stack allocation through escape analysis is included in the J7 VM.
    NUMA aware GC combined with escape analysis is already present in 1.6_02. IBM is already doing Escape analysis and I have some side by side comparisons that show it does make a measurable difference. I've not yet rerun with the latest 1.6 but now that the question has been posed.... Kind regards, Kirk
  4. It will be interesting to readdress this when (if?) stack allocation through escape analysis is included in the J7 VM.


    NUMA aware GC combined with escape analysis is already present in 1.6_02. IBM is already doing Escape analysis and I have some side by side comparisons that show it does make a measurable difference. I've not yet rerun with the latest 1.6 but now that the question has been posed....

    Kind regards,
    Kirk
    As I understand it, EA is in Hotspot for 1.6 (apparently you must enable it?) but the ability to move object allocation to the stack is not yet implemented. I think it's only used to remove unnecessary synchronization and some other minor improvements in 1.6. I was really excited about this when I first read about it but I've had a hard time finding news about it. Do you know if stack allocation is implemented in IBM's VM? I'm not sure if the VM spec had to be modified to allow this. I didn't think it needed to but maybe that's the hold-up.


  5. As I understand it, EA is in Hotspot for 1.6 (apparently you must enable it?) but the ability to move object allocation to the stack is not yet implemented. I think it's only used to remove unnecessary synchronization and some other minor improvements in 1.6.

    I was really excited about this when I first read about it but I've had a hard time finding news about it. Do you know if stack allocation is implemented in IBM's VM? I'm not sure if the VM spec had to be modified to allow this. I didn't think it needed to but maybe that's the hold-up.
    This is a funny thing because when I talked to the JVM guys I asked about what amounted to cross frame optimizations. I also had a discussion with Cliff Click on the subject. HotSpot builds an execution graph of your running program and then does some path analysis on it. When it finds a reduction that makes sense to apply in the graph then it applies it. When I talked to the JVM guys I was interested in lock elision and elimination. They said that it is difficult to apply across frames. Cliff clarified this by indicating that the interactions cross threads is a very difficult problem to solve. This is some what solved in an Azul box using hardware assist. Without hardware assist apparently it gets to be almost impossible. Bottom line is that is suggests that most lock optimizations cannot be realized. Which fits with my observations in the field. Ok, getting back to EA and stack allocation. It is my understanding that stack allocation in the object sense can never be like stack allocation in the C sense. So data is not allocated on the stack, only the pointer will end up on the stack but I think it gets marked as being localized to the stack. So that helps GC identify it as being dead with the stack frame gets marked as being dead. That said, I think NUMA (which needs EA IIRC) is mixed in with stack allocation (ie localization). The IBM JVM is a different beast altogether. With 6.0 we will see real generational spaces. I don't know what is happening with p and k clusters as well as wilderness. I've not seen anything that dives to that level of details yet. I think the difference in GC performance with IBM was due to the coarsening of small objects with the clusters and dumping large objects into wilderness (aid with compaction) . IBM was focused on size instead of age so it will be interesting to see the 6.0 in action. As for stack allocation I think I saw it on the table but I wouldn't bet on that. I don't think the VM spec needed to be modified for this. I think the only thing that has been touched recently is the memory model. But that is a different topic. -- Kirk
  6. This is a funny thing because when I talked to the JVM guys I asked about what amounted to cross frame optimizations. I also had a discussion with Cliff Click on the subject. HotSpot builds an execution graph of your running program and then does some path analysis on it. When it finds a reduction that makes sense to apply in the graph then it applies it. When I talked to the JVM guys I was interested in lock elision and elimination. They said that it is difficult to apply across frames. Cliff clarified this by indicating that the interactions cross threads is a very difficult problem to solve. This is some what solved in an Azul box using hardware assist. Without hardware assist apparently it gets to be almost impossible. Bottom line is that is suggests that most lock optimizations cannot be realized. Which fits with my observations in the field.
    According to a research paper (see below) thread escape analysis can be done to determine that an object is only referenced by a single thread. According to this paper, this was implemented successfully in a prototype VM. What you are talking about about sounds like something else to me.
    Ok, getting back to EA and stack allocation. It is my understanding that stack allocation in the object sense can never be like stack allocation in the C sense. So data is not allocated on the stack, only the pointer will end up on the stack but I think it gets marked as being localized to the stack.
    I'm not sure how having the pointer on the stack is really different that how Java has always worked but I'm pretty sure there's more to it anyway. http://www.research.ibm.com/people/g/gupta/toplas03.pdf "This article presents an escape analysis framework for Java to determine (1) if an object is not reachable after its method of creation returns, allowing the object to be allocated on the stack, and (2) if an object is reachable only from a single thread during its lifetime, allowing unnecessary synchronization operations on that object to be removed." This document confirms (and references) the above document: http://developers.sun.com/learning/javaoneonline/2006/coreplatform/TS-3412.pdf?
  7. Fantastic, you pulled the information I should have before posting ;-) Some times it just doesn't pay to rush...
    According to a research paper (see below) thread escape analysis can be done to determine that an object is only referenced by a single thread. According to this paper, this was implemented successfully in a prototype VM. What you are talking about about sounds like something else to me.
    Same thing as what I said only you did it with fewer words. There is another obversation being used here and that is; the vast majority of objects are only ever touched by the thread that created them. Now that maybe a function of current programming models also but.. It has driven development in favor single thread optimizations... which doesn't bode well for things like locks. The best thing that can happen is the application figures out you really didn't need them and then gets rid of them in the compiled code. It is a decision you don't want to be overly aggressive about ;)
    I'm not sure how having the pointer on the stack is really different that how Java has always worked but I'm pretty sure there's more to it anyway.
    http://www.ibm.com/developerworks/java/library/j-jtp09275.html would seem to suggest I'm wrong. And, I can't find a reference that supports what I heard. So much for trusting one's memory of a hallway conversation. Kirk
  8. Same thing as what I said only you did it with fewer words. There is another obversation being used here and that is; the vast majority of objects are only ever touched by the thread that created them. Now that maybe a function of current programming models also but.. It has driven development in favor single thread optimizations... which doesn't bode well for things like locks. The best thing that can happen is the application figures out you really didn't need them and then gets rid of them in the compiled code.
    To me the lock removal is a small benefit. I can see it helping with the performance of code that uses old classes like Vector or StringBuffer but generally I only use a lock when I need one. Stack allocation, I think, could make a big difference in my applications. I try to limit scope as much as possible and use a lot of purely local objects.
  9. To me the lock removal is a small benefit. I can see it helping with the performance of code that uses old classes like Vector or StringBuffer but generally I only use a lock when I need one.

    Stack allocation, I think, could make a big difference in my applications. I try to limit scope as much as possible and use a lot of purely local objects.
    The GC efficiency afforded by Stack Allocation may [negatively] surprise most people. Eliminating allocation all-together through in-frame escape analysis (e.g. turning objects into registers) is one thing - it gives you faster programs, regardless of GC issues. However, if you actually have to allocate the thing in memory and in an object structure (which is generally the case for cross-frame situations), placing it in a thread-local area or a stack arrangement turns out to be a very small win, and if the VM is not careful about it, may often not be worth the extra processing overhead compared to bump-a-pointer contiguous heap allocation. The reason behind this is that generational collection of short lived stuff is already super-efficient, and most stack-allocatable objects are exactly the sort of objects that generational GC spends zero time on: short lived enough that they are never visited by the collector and never copied anywhere, not even to a survivor space. (read: no cost at all, except for a somewhat earlier trigger of a new-gen collection cycle through increased allocation pressure). So, if you already have to pay for allocation in one form or another, generational collection of most stack-allocatable objects is actually cheaper than dealing with stack "de-allocation" (or alternate forms of per-frame bookkeeping) of object allocation frames. Escape analysis’s relative, which we dub "escape detection" (and for which we have hardware assists in Azul machines), allows us to evaluate allocation behaviors and opportunities across frames, and can offer interesting optimizations, but the resulting GC efficiency is not nearly as big as a win as you may think. At Azul, we simply eliminate or change the question to begin with: our GCs don't stop your program, and use plentiful cores for background execution while your program threads keep executing happily. We eat GC pressure for breakfast, as in tens of GBytes per second of sustainable pauseless allocation with GC happily keeping up in the background. Taking pressure off such a collector is simply not so much an issue or a goal any more, so you can make a really big mess if you want, and focus on making your program do interesting, useful stuff... -- Gil. Azul Systems
  10. The GC efficiency afforded by Stack Allocation may [negatively] surprise most people. Eliminating allocation all-together through in-frame escape analysis (e.g. turning objects into registers) is one thing - it gives you faster programs, regardless of GC issues. However, if you actually have to allocate the thing in memory and in an object structure (which is generally the case for cross-frame situations)
    Can you explain or point me to an explanation of what a 'cross-frame' situation is? I'm just not sure whether this applies to the kind of situation that I am concerned with. thanks.
  11. The GC efficiency afforded by Stack Allocation may [negatively] surprise most people. Eliminating allocation all-together through in-frame escape analysis (e.g. turning objects into registers) is one thing - it gives you faster programs, regardless of GC issues. However, if you actually have to allocate the thing in memory and in an object structure (which is generally the case for cross-frame situations)

    Can you explain or point me to an explanation of what a 'cross-frame' situation is? I'm just not sure whether this applies to the kind of situation that I am concerned with.

    thanks.
    [You can find plenty of material on this subject by google'ing for "escape analysis inter procedural".] By cross-frame analysis (also referred to as inter-procedural, which mostly but not entirely overlaps with with cross-frame analysis) I refer the practice of trying to prove that objects don't escape the scope of a certain set of stack allocations frames even though they are passed between frames (as parameters to methods, for example). While this gets very complex in the presence of polymorphic method calls, it is still doable to some degree when the compiler can prove things about all possible [currently loaded] implementations of a method and it's possible calling graphs, for example. In contrast, frame-local analysis looks for objects that are never visible outside the current allocation frame [could be the current method, or the current loop iteration in some situations], and may enable the compiler to eliminate the allocation of the object altogether (it may just end up having some fields represented in machine registers, for example). One common pattern that produces such cross-frame opportunities is allocating an object and passing it to called methods, which may work on it but never store a reference to it in objects that are outside the scope of the original allocator. Another pattern will be very common in factory allocation, where an object survives the frame it was allocated in, but never leaves the scope of a calling method (or some N-deep caller). Yet another pattern is allocating an object inside a loop body, where the loop-allocated objects surive for the lifetime of the loop (populating a temporary collection of objects, for example). The majority of thread-local objects, and stack-allocatable objects (all of which are by definition thread-local) do span frames, survive across at least one method call or return, or live for the duration of long repeated allocation loops. All these cases normally require the object structure to be actually allocated in memory somewhere, as it cannot be practically stored in registers for it's entire lifetime. It may be stored in a stack pattern (where popping the frame just discards the memory), but it has to be stored in memory. As I note above, once you've allocated such an object in memory, you've paid a cost that is practically equal to the cost of allocating it in the heap (it is actually likely to be higher than the cost of allocation in a new-gen allocation block). Since virtually all such objects would be collected by a generational collector with no cost at all , there is basically no win here from a GC perspective. [Most of these objects will die young almost by definition, and dead objects cost a new-gen collector nothing, as they never get visited]. The only caveat to this "no GC-win" argument is that allocating soon-to-die objects in new-gen does make the new-gen collection cycle itself occur more often (compared to allocating some of them in a stack pattern). However, this difference in pressure can be trivially offset by simply increasing the size of the new-gen by some offsetting linear factor (which would be fairly small). Don't get me wrong. Escape analysis is great, and stack-based or thread-local allocation has some other potential performance advantages (like eliminating synchronization work, enabling in-frame optimizations, and potentially improving cache locality). However, contrary to popular belief, GC efficiency is actually not one of those benefits. -- Gil. Azul Systems
  12. In contrast, frame-local analysis looks for objects that are never visible outside the current allocation frame [could be the current method, or the current loop iteration in some situations], and may enable the compiler to eliminate the allocation of the object altogether (it may just end up having some fields represented in machine registers, for example).
    This was the kind of thing that I am hoping will be added to to JVMs in general. What I understood is that this kind of optimization is not limited to a single method but any series of method calls that do not allow results to escape including the parameters to methods. If I'm not mistaken, that this can be done is explicitly stated in the paper I reference above. Given that JVMs can inline method calls, I don't see how calling a method creates a significantly different situation.
  13. diagrams unreadable[ Go to top ]

    Good to have these pre-conceptions challenged, but I would have liked being able to actually read the text in the diagrams to better understand Kirk's observations. Can you please include higher resolution versions?
  14. Good to have these pre-conceptions challenged, but I would have liked being able to actually read the text in the diagrams to better understand Kirk's observations. Can you please include higher resolution versions?
    if you email me I can send you the raw data and you can do your own analysis if you like. Kind regards, Kirk Pepperdine
  15. In defense of Goetz[ Go to top ]

    There are a number of aspects of this article that I'm struggling with. First, I don't come to the same conclusion the author comes to (about tenuring objects early) from Brian's work(s). The whole old/new generation design arose from empirical studies of how real applications (I.E - from the "trenches") behave. What they found is that in *many* application *many* objects die young (not the emphasis). The old/new split is an optimization around commonly observed behaviors. Second, "Go ahead - make a mess". I think the message Brian is sending here is that the GC system works better if you don't try to outsmart it. Again, GC design is driven by real world heuristics. Give the GC more difficult paths to optimize and it's not going to do as well. I don't think Brian is advocating doing really stupid things like hanging onto object past their natural life of usefulness with your program. Sorry to be harsh here but the conclusion I took away from this article is that the article itself is based on false conclusions and provides no valuable conclusions of it's own. Ron Pomeroy rondotpomeroyatmacdotcom
  16. Re: In defense of Goetz[ Go to top ]

    I don't think Brian is advocating doing really stupid things like hanging onto object past their natural life of usefulness with your program.
    Maybe it's me but I thought KP was agreeing with BG in this. Did I miss something? BG says specifically not to hold onto objects longer than needed. Specifically he is arguing against object pooling.
  17. Re: In defense of Goetz[ Go to top ]

    There are a number of aspects of this article that I'm struggling with.

    First, I don't come to the same conclusion the author comes to (about tenuring objects early) from Brian's work(s).
    My point is not not tenure objects early. I think this fits into the optimization (based on observation as you've pointed out) that most objects die within a few hundred clock cycles. Survivor spaces (hemispheric GC model) keep objects that manage to survive a wee bit longer out of old space. So the optimization seems to be, keep things out of old space unless they should be there. So the article is attempting to add a specific piece of evidence to Brian's message (which I totally agree with and you've nicely summerized here), don't try to out smart the garbage collector. I'd even add, don't try to out smart HotSpot. Kind regards, Kirk Pepperdine
  18. Re: Article: No Objects Left Behind[ Go to top ]

    In his recent series of talks, Brian (Goetz) makes the claim that memory allocation in Java is far cheaper than it is in C/C++
    Maybe someone can help me on this one. C/C++ and Java are languages. They do not allocate memory. Memory allocation characteristics in C/C++ apps depend on the libraries used which differ from one to another. Similarly, the JVM allocates memory not Java. So memory qualities are JVM dependent not language. Also, aren't many JVMs written in C? So how could Java be better than C???? And like a previous post mentioned, there is no object stack allocation available in Java while there is in C++. So, not everything in C/C++ is allocated (on the heap) so comparisons are not quite so easy.
  19. Maybe someone can help me on this one. C/C++ and Java are languages. They do not allocate memory. Memory allocation characteristics in C/C++ apps depend on the libraries used which differ from one to another. Similarly, the JVM allocates memory not Java. So memory qualities are JVM dependent not language.
    Also, aren't many JVMs written in C? So how could Java be better than C????
    In C++ memory on the heap is (generally) allocated with malloc. The heap in the C++ program will become fragmented as the program creates and destroys objects. Objects cannot be moved around safely (as I understand it) because pointers are literal addresses in the heap space. Well, the VM could be written in some other language but one way that a JVM written in C could deal with this is by allocating the heap as on large contiguous space. Inside this space the JVM can manage objects in any way it likes. Your assumption that a C JVM must be a thin facade over C is incorrect.
  20. C/C++ memory management[ Go to top ]

    Also, aren't many JVMs written in C? So how could Java be better than C????
    Many sophisticated C and C++ programs essentially allocate big blocks of memory and then manage it themselves. The JVM is among these program. If you look at the C++ STL containers, you can see that the template all take an optional allocator as a parameter. You may also notice that the default allocator is not "new" or "malloc," but a custom allocator written for the STL that is meant to be more optimal for containers, and that the objects you put into the containers move around in memory.