The Performance Paradox of the JVM: Why More Hardware Means More Failures
Administrators are seeing their JVMs not crash, but pause, and just stop responding to requests, as though they were a 400 meter sprinter, stopping to catch their breath at the end of a race.
The Problem of the Unpredictable Pause
As computer hardware gets cheaper and faster, administrators managing Java based servers are frequently encountering serious problems when managing their runtime environments.
While our servers are getting decked out with faster and faster hardware, the Java Virtual Machines (JVMs) that are running on them can't effectively leverage the extra hardware without hitting a wall and temporarily freezing. Sometimes it's a ten second freeze, while other times it's ten minutes, but every time, it's enough to frustrate users, cause retail sites to lose customers, cause business sites to start fielding problem calls at their help desks, and cause server-side administrators to become increasingly frustrated with the Java based environment they are tasked with managing.
So, what's the problem? What's causing all of these JVMs to pause? Well, it's memory, or more specifically, memory management.
“Memory is the new disk, and disk is the new tape.” – Jim Gray
Memory is cheap. It wasn't always cheap, but it is now, and you don't have to pay tera-bucks to buy a server that has tera-bytes of RAM. But tera-bytes or not, if you allocate all of your memory to a handful of virtual machines, even if your server is only hosting a handful of applications, you’re going to start running into problems.
In years past this wasn't an issue, as 32-bit machines could only allocate four gigs worth of memory to any one process. But now people are attempting to allocate five or ten gigs of memory to a single JVM, and they're seeing those JVMs not crash, but pause, and just stop responding to requests, as though they were a 400 meter sprinter, stopping to catch their breath at the end of a race.
You see, the JVM handles the task of garbage collection for the developer. This means cleaning up the space a developer has allocated for objects once an instance no longer has any references pointing to it. Some garbage collection is done quickly and invisibly. But certain sanitation tasks, which fortunately occur with minimal frequency, take significantly longer, causing the JVM to pause, and raising the ire of end users and administrators alike. Of course, to understand the problem, you need to understand a bit about how the JVM manages memory as a whole.
How Garbage Collection Works
"On the JVM, memory is managed in generations, or memory pools holding objects of different ages. Garbage collection occurs in each generation when the generation fills up. Objects are allocated in a generation for younger objects or the young generation, and because of infant mortality most objects die there. When the young generation fills up it causes a minor collection. Minor collections can be optimized assuming a high infant mortality rate. The costs of such collections are, to the first order, proportional to the number of live objects being collected. A young generation full of dead objects is collected very quickly. Some surviving objects are moved to a tenured generation. When the tenured generation needs to be collected there is a major collection that is often much slower because it involves all live objects."
The Performance Paradox of Memory Allocation and Garbage Collection
And this 'much slower' garbage collection is what brings a JVM to its knees. Furthermore, the paradox of JVM memory management is the fact that when more memory is allocated to the JVM, more space is provisioned to this tenured generation, and the larger this tenured generation is, the longer your JVM will be offline when it decides it’s time to plunge the pipes.
And how bad is the problem?
“You can run everybody’s JVM today, Sun, Hotspot, OpenJDK, on a 300 gigabyte heap if you choose to. The reason nobody deploys anything above two to three, or four or six gigabytes if they are really courageous, is because JVMs will pause and stop periodically, and the size of the stop, and the length of the stop, will depend on the size of the heap. So, with a 2 gig JVM, you should expect a roughly 15 second pause every once in a while. With a 4-gigabyte JVM you need to expect a half a minute pause. With a ten gigabyte JVM you might have a pause of a minute and a half.” Says Gil Tene, Chief Technology Officer and co-founder of Azul Systems.
And of course, the most disturbing part of this whole situation is the fact that you can’t accurately predict when a JVM will decide to go off and shovel it’s driveway; all you know is that eventually, it will.
“It’s like a ticking time bomb in your enterprise architecture. Generally, you can tune the JVM to delay it, but you can’t tune it away. So, you can make it happen half as often, or only once an hour, but I it will happen. There’s no way to avoid compacting of memory in Java. There is no way to avoid fragmentation in Java. This is universally accepted as fact in the Java world.”
So, what are the people who are running mission critical applications doing? Well, it highly depends on what the application does. The more work the JVM does, the closer it gets to the point where the stall is going to happen.
“Wall Street guys will size it so they can make it through a trading day, but then they’ll have a busy trading day and it will happen before the closing bell. Nobody can afford an interactive application that stops for a minute every once in a while. So that translates into nobody can afford to have ten gigabytes of memory without solving the garbage collection problem.”
So, the whole garbage collection issue limits the size of our JVMs on production machines. Along with the pausing issue, the other big problem with this is that if we didn’t have to limit our heap sizes, our applications would run much, much faster, even with the occasional JVM brain-freeze.
“There is plenty of evidence to show that the more memory that is allocated to a JVM, the faster will be the performance, and the faster will be the runtime.”
So essentially, we get a nasty performance tug of war. Actually allocating a huge chunk of memory to the JVM would allow your applications to run much faster, taking better advantage of all server resources, including memory, disk and processor cycles. But you’re still stuck with the garbage collection (gc) problem.
So, what’s the solution?
The low cost solution is just to just slice up your server’s memory into JVMs that are sized to two or three gigabytes of space, and deploy to that. You can do your own linear algebra on those numbers, but that’s a ridiculous number of JVMs floating around, and before you know it, the overhead to manage those five or six hundred JVM processes starts biting into the server’s capacity to actually host your applications, not to mention the administrative headache of trying to troubleshoot misbehaving applications, or digging through a multitude of log files.
A GNU GPL Licensed Solution to the JVMs Scalability Problems?
There are commercial solutions available, but perhaps the brightest light to shine on this industry-wide problem is Azul System’s Open Initiative for Improving Managed Runtimes. This managed runtime initiative, and more specifically, the OpenJDK managed runtime project (“MRI-J”), promises to deliver a GNU GPL licensed solution to this Java runtime problem.
“The enhanced OpenJDK runtime project demonstrates how highly optimized modules working in concert across enhanced interfaces in the system stack can achieve dramatic performance and scalability improvements. The first release (version 0.9) includes a fully functional Java Virtual Machine (JVM) based on an enhanced version of OpenJDK 6. Core capabilities of JVM include pauseless garbage collection and two orders of magnitude (100x) increase in object allocation rate (and supported heap size) for dramatically improved scalability and consistency.”
It’s a bold and ambitious project, and it’s already grabbed the attention and approval of the currently unemployed, Father of Java, James Gosling. "I’m excited about the Managed Runtime Initiative and the contribution Azul is making to the Open Source community,” said James Gosling. “This Initiative will serve to bring new functionality to the systems stack, freeing managed runtimes to continue their growth and evolution."
Of course, the managed runtime initiative is new, having only been released this June of 2010. But it does address the real problem of the scalability of Java runtimes and environments, which is essential if we are going to be able to effectively leverage the benefits of the hardware that is available to our modern Java based applications.