Sun's main Java site has an intriguing interview(http://java.sun.com/developer/technicalArticles/Interviews/community/pepperdine_qa.html) with Java performance expert Kirk Pepperdine that covers a lot about bottlenecks, pain points, tuning, memory, databases, etc. But what's most interesting are his methods of calming down a team of stressed-out, frustrated developers when he's been hired by companies to step in and solve their performance problems. Among other things, he's "rolled VMs through a cluster; neutered the HTTPSession object; used GC to slow down certain parts of the application to improve throughput; and tuned memory to some very insane configuration so that the application would run for a working day." Here are the basics from the interview: "What I'd discovered very early on is that the customers were often frustrated. They had tight deadlines, aggressive plans, and intense pressure to deliver -- and no matter what they did, things just didn't seem to be working. To make things worse, they 'knew' the problem wasn't their fault. I let them rant and rave and in the process vent their frustration. What happened is they ended up explaining exactly what was wrong with the system. So it both allowed them to vent and I could parrot back what they told me in the form of a diagnosis. They got to release all of their stress, and I ended up looking brilliant… Stress prevents us from learning. The first thing I look for in an SOS engagement is a pressure-relief valve, some hack or trick to reduce the level of stress in the room. In one case, I put in a cron job that ran every 15 minutes, looking for any database transaction that had run for more than some threshold period of time. If it found one, it would kill that session. This is an ugly hack. The user whose transaction was killed certainly wasn't happy. But the hack stabilized the system enough so that most of the users who had customers in their faces got work done. It also took pressure off the developers. Every time the system went south, which was quite often, the phones would start -- and just think of the rat in the cage being buzzed at random times. You could imagine what a relief it was to get the phones to stop ringing. You could see the stress drain out of the room and the brains turn back on. It set up an environment were we could have a meaningful discussion about a permanent fix. I've used lots of release valves to calm stressed-out developers: I've rolled VMs through clusters, neutered the HTTPSession object, used GC to slow down certain parts of the application to improve overall throughput, tuned memory to some very insane configuration so that the application would run for a working day, and on and on. This is triage, and my only goal is to keep the patient alive to give developers time to start fixing what is broken." I'd be curious to know about the experiences of others. What do you do when code goes haywire and it's driving you and everyone around you nuts in high-pressure situations? What hacks and tricks have you tried to create minimal system stability or functionality as a kind of holding action that reduces stress so you can move on to solve the problem?
- Posted by: John Simpson
- Posted on: July 18 2008 16:54 EDT
- Re: Kirk Pepperdine on Release Valves to Calm Developers by Sanjay Dwivedi on July 21 2008 11:10 EDT
- Re: Kirk Pepperdine on Release Valves to Calm Developers by Dave Sims on July 21 2008 11:21 EDT
- Re: Kirk Pepperdine on Release Valves to Calm Developers by Diego Fontdevila on July 27 2008 19:18 EDT
- Re: Kirk Pepperdine on Release Valves to Calm Developers by James Watson on July 21 2008 12:44 EDT
- Re: Kirk Pepperdine on Release Valves to Calm Developers by James Watson on July 21 2008 12:54 EDT
Excellent points here by Kirk. Stop the bleeding, give team some breathing room, provide some simple tools coupled with insights and the team usually finds areas of hotspot based on their system/domain knowledge. But calming the team is always the first step (Sitting on a conference bridge with 50+ people from entire tech - ops, dev, platform engg etc with some senior execs listening in doesn't solve the problem - increases the temperature instead of having any positive benefit). I have used similar hacks - like changing db transaction from serializable to read uncommitted in a monitoring application used by control center/NOC which was displaying list of alarms dynamically updated, 50+ users all running the list view with hundreds/thousands of alarms, just a read only notification - no need for that to be serializable. Such things work to reduce the stress so that teh team can start focusing on solving the problem. -- Sanjay
Really interesting solutions from Kirk. I don't have any particular hack to share but only an experience. Two or three years ago, we had a user who scaled up one particular workflow higher than we had anticipated, and that user was up against a production deadline. I called it a "Category 5" problem. Must have been when the big hurricanes were hitting the US a few years ago. To make it more difficult, this users was in Oceania, so the time zones were far off. To calm the whole office down, I explained to everyone what this particular user was doing, why it was so important to them, and why it was so important to us. Everyone in the office was completely in-the-know and understood why we were doing what we were doing to fix the problem. We put just about every technical resource on the problem. And ordered food and coffee. The team worked nights. As we made progress, we sent pre-releases out to the user. Those pre-releases made incremental progress, which helped calm the users as they saw progress was being made. And of course, that helped calm our own team as we all received the user feedback. So in summary, we made sure everyone in our office knew the complete story, we were completely open with that user, we sent out pre-releases that showed incremental progress until the problem was completely solved, and plenty of food and coffee was made available. Cheers, David Flux - Java Job Scheduler. File Transfer. Workflow.
The author puts a very interesting perspective on the subject. Three years ago, I worked with a client where all developers where crazy because that had an unstable system to maintain and had users ringing them up or coming down into the office all the time to complain and try to get new requirements implemented. The first thing we did to release the pressure was to set down formally the process for modifications to the production environment, restricted to two per week (that sounds insane, I know, but yt was a great improvement) and one of the programmers proposed that we keep a log of the releases we made. Not only did that get the users off the back of the programmers but helped make visible the work and rework they did every day. From then on, we started to fix the things that were really wrong with the code, with clearer heads.
The first thing we did to release the pressure was to set down formally the process for modifications to the production environmentThis is standard IT(IL) management practice: Change and Configuration Management. Application performance management can be manageable by following our standard methodology/process and it is much more effective as the knowledge of the software and systems grows and resides in the organization. We recently announced such a process which has a nice chart showing the activities and data collection techniques applicable to the application phase. http://www.jinspired.com/solutions/xpe/index.html The best release of pressure is knowing what you need to do and how to do it. William
I'd be curious to know about the experiences of others. What do you do when code goes haywire and it's driving you and everyone around you nuts in high-pressure situations? What hacks and tricks have you tried to create minimal system stability or functionality as a kind of holding action that reduces stress so you can move on to solve the problem?My experience is that dysfunctional code normally cannot be fixed because management will not authorize the fixes. I had a previous employer that was the worst offender. Millions had been spent for contractors to come in and build a fancy B2B system. The result was barely usable. It fell down constantly, lost track of important messages, had very little in the way of diagnostic logging and certain parts of it literally never worked. To give an example of the types of issues I ran into, I once spent a day helping a coworker with part of the system. I would fix an NPE, she'd come back with another. This happened several times and then I just went through and fixed all the potential NPEs in that part of the system. In the process, I got really confused about how this code (which was production code) had ever worked at all. I learned later that it hadn't ever worked and the users had reverted to manual paper-based process. The code in question, however, was officially considered to be a successful implementation. Being on-call was a nightmare. Constantly being called at 4AM to fix some stupid issue that would be easy to fix. The only reason we never fixed these issues is that we were forbidden to do so. We could only fix issues that could not be hidden and after 6 months into the year we would be told to not track any more time against maintenance because the estimate had already been reached. Amazingly, we always hit our maintenance estimates perfectly. So in effect they told us to lie. I figured that while I was lying, I might as well fix some issues I wanted resolved so I did. That's how I released the pressure. I did what needed to be done by disregarding my supervisors. The "I could parrot back what they told me in the form of a diagnosis. They got to release all of their stress, and I ended up looking brilliant" quote is classic and I've read the same things from a number of consultants. Aside from demonstrating that you can be a consultant without actually knowing anything, what they are saying is that the managers don't believe their employees but if a consultant says the exact same thing, it's got to be true. This is pretty much my experience from the employee view too. I've got to laugh to keep from crying.
Being on-call was a nightmare. Constantly being called at 4AM to fix some stupid issue that would be easy to fix.This should be "...that would be easy to eliminate permanently by fixing the code"