Action and Pitfall Tips Chapter for Review Posted

Discussions

News: Action and Pitfall Tips Chapter for Review Posted

  1. Java Doctor is a book about diagnosing and troubleshooting enterprise applications, covering the methodologies, techniques and tools needed to successfully identify problems in scalability, performance and availability.

    The book is in the process of being written. Manning Publications has published Chapter 9, Action and Pitfall Tips, here on TheServerSide.com. The authors invite you to read these tips and also contribute your own tips to the chapter they may be selected for final publication, crediting your name in the book.

    Download Chapter 9: Action and Pitfall Tips

    Action tips provide information about tool usage or techniques and how a tool can be ingeniously used to identify or solve a production issue. Pitfall tips describe common problems that occur in enterprise systems and discuss the symptoms they display.

    Threaded Messages (21)

  2. Thread dump[ Go to top ]

    I can not tell you how often doing a thread dump (kill -3) on the application server process has saved my life.

    This a passage from my Blog last Christmas when the system I was architect on was having load issues on Christmas day:

    "A few hours later the web is in bad shape again the symptoms are the application server is thread starved and the CPU usage on the box is really low. I got one of the Unix guys to do a thread dump (kill -3). It showed that all the thread were waiting on a monitor lock for an Entity Bean called ContentMapEJB. This Entity bean is a read only EJB that caches the lists for the pull down boxes on the web. Doing some google searches showed an issue with Weblogic 6.1 where it serializes the requests to Read Only EJBs.

    It is now 5pm and I suggest to my boss we do a code change to remove the Read Only EJBs and use a different MBean oriented caching structure we had used for other areas of the application. He is reluctant at first but no configuration changes work and he agrees that we need to do a fix. To do a code change I have to go to the office (which is 45 mins away) so I left my wife with her family and drove in.

    The change took 1 hour to write and 1 hour to unit test (Got to love Junit, Cactus). We ran a JMeter test at very high load which completely buried the old Read Only EJB cache. The new cache took the JMeter test like a gooden. QA came in and ran the automated tests and at 10am on boxing day we released. The MBean page we have to monitor vital stats on the application server (got to love MBeans) was a site for sore eyes. We were still amazingly busy (80k+ activations on that day) but the web site was holding and we had threads to spare.
    "
  3. Re: Thread dump[ Go to top ]

    Here here!

    We had essentially this exact issue with our system, and found it a similar way. It wasn't as glaring as in your case, but essentially it was the same problem and the same cure.

    What was most annoying, however, is that none of the "performance tuning tools" we ever tried was able to point this problem out. None of them ever raised a suspicion that this was a problem. To be fair, at the time we found it, we were evaluating a tool, and even had a team from the vendor in the office to help us. But none of them figured it out either. In the end, it was the Weblogic Console, Thread Dumps, and the fact that we had our group dedicating eyeballs to the issue that revealed the problem.

    While others may have had success with these performance tools, they never bore fruit in our system. While they have nice interfaces, I never found them much more functional or useful than hprof and some shell scripts, and with Java 5.0 the stock tools should be getting even better. To me, they're simply not worth the money and the licensing restrictions (those are the most arduos, IMHO -- having to buy a wrench and have it only work on one of your cars is kind of madness).
  4. Hi Will,

    Performance tuning tools are part of an integral development process - Software Performance Engineering (SPE). SPE tools are used from day one in determing and evaluating possible software execution models moving eventually onto system execution models and internal benchmarking for specific platforms.

    A software execution model focuses on clearly defining the individual (and in part isolated) steps that will be executed during the various functional operations which we could refer to as an use case. A system execution model looks at the software in a particular context - the context could include workload levels, transaction mixes, deployment architecture, etc.

    A good performance management solution (or tool) needs to support the knowledge discovery and retainership of the execution models overtime (project lifecycle) and under different conditions, pinpointing deviations from previous or expected executed models.

    I admit that most of the performance management tools focus on the blame game but this is incorrect unless of course you are talking to a senior manager who is more reactive than proactive. The blame game message is a unfortunate sales pitch which in the long run is unlikely to deliver any real value to the teams engineering efforts and nevers delivers on its promises. The problem for such tools is that real world problems cannot be solved so easily - there are many contributing factors and red herrings. Creating an intelligent management tool is tricky if the tool has had no time to be educated about the application.

    Throwing a "wrench" at the problem when the application is in (pre-)production is certainly not the most effective performance management strategy. If this is the case then the tool must employ powerful visualizations that allow the user to discover the actual executed model and match them against the knowledge base of models held within the users own mind. The focus is not on compiliancy of actuals with pre-defined specifications of behavior but on presenting the execution model in a graphical form that closely resembles the users conceptual model of the software's execution - immediate problem resolution rather than engineering.

    I would not give up on all such tools as there are companies trying very hard to design and deliver solutions that meet the real world needs of developers, architects, DBA, administrators and testers.


    Kind regards,


    William Louth
    JXInsight Product Architect
    JInspired

    "J2EE tuning, testing and tracing with Insight"
    http://www.jinspired.com
  5. Hi Will,

    I forgot one very important point I wanted to make. Architects, developers, testers and system administrators need as much education of the software's execution model as the tool itself especially since current tools lack intelligence (I am using this word very loosely here).

    I have seen on many occasions the surprise on developer faces when our product is installed and they see the actual transactional patterns occuring during normal system operation. This surpises me because I cannot understand why developers are not interested in understanding the component or sub-system as it is being designed. These same developers routinely compile after the smallest change to a line of code for reassurance. We need to create tools that make the "test" or "verify" command as simple and as fast as the "compile" command.


    Case Study: An important JVM metrics analysis mode we recently added immediately (one hour) identified an issue at the customer site where over the course of a hour thousands of threads where started and stopped by a misconfiguration of the middleware which was attempting to resolve another performance problem via throughput throttling at the request dispatcher level. This company has their own trace framework which creates an enormous amount of log entries and yet they could not see this issue. It was probably there in the millions of lines of logs but all they had was data and no information. It is so easy to create data but so hard to determine and present useful information. This application has been in production for months.



    Kind regards,

    William Louth
    JXInsight Product Architect
    JInspired

    "J2EE tuning, testing and tracing with Insight"
    http://www.jinspired.com
  6. Re: Thread dump[ Go to top ]

    I will say also like to say that JMX rocks. The ability to instrument code is pretty impressive. I highly recommend putting time in when building an application on figuring out how to monitor the application in production. Also I would also suggest looking at the various JMX monitoring and warning tools that are popping up.

    Case Study:

    We had an issue on migrating from Weblogic 6.1 and JDK1.3 to Weblogic 8.1 and JDK 1.4.1. After the migration the average load on some machines jumped 20%. These machines specifically serviced requests from retailers.

    Looking at the new Weblogic console I saw it had now had the ability to show the length of CM transactions. One of our main transactions from the retailers seemed to be taking between 10 and 45 seconds. This transaction was pretty simple doing a few simple db calls and using a vendor product to do an encryption. Therefore 45 seconds seemed very high. I mocked out the DB calls and ran some load tests and the transaction time for the call was still really high at 40+ seconds under load. Next step was to put logging around all the key steps to report how long they where taking. It then became very obvious what was happening.

    We had installed a new version of the vendor code to support the upgrade to JDK 1.4.1. In our code we had originally been initializing the vendor code on ever transaction. In the previous version of the product this was taking less than 1 sec but now it was taking 10+ seconds. This simple 10 fold increased was the issue. We fixed our code only to initialise the vendor code on startup and the issue was solved and load was back to pre-migration levels on all machines. :-)
  7. Re: Thread dump[ Go to top ]

    Looking at the new Weblogic console I saw it had now had the ability to show the length of CM transactions. ... I mocked out the DB calls and ran some load tests and the transaction time for the call was still really high at 40+ seconds under load. Next step was to put logging around all the key steps to report how long they where taking.


    Actually that sounds like a lot of effort in comparison to a thread dump. A series of like three consecutive thread dumps over 5 seconds would have shown you the offending actions. Within like a minute.

    The only bad thing about these dumps is that they go to stdout and seem (surprisingly) hard to read for many people. I'm looking forward to build a readable version using Thread#getAllStackTraces in 5.0. Saying that, I will actually expect such a page from every self-respecting management console in the future.

    Matthias
  8. Re: Thread dump[ Go to top ]

    To be honest you are correct I probably could have got the same results from a series of thread dumps.

    I liked using the console because it showed me in the real production environment a nice table (that I showed my manager) of what transactions were in-flight and how long they had been in-flight. Looking at it the smell of the bad transaction (my starting place) was very obvious. Of course this only works for business operations working in CMTs.

    I love thread dumps but getting thread dumps in a production environment of anything but the smallest company can be difficult. It generally requires getting operations management to authorise it and finding a production UNIX engineer to do it.

    In the coding required we already had the mock objects for unit testing so all I needed was to add a few lines of debug profiling code. Then I used the unit test for the transation wrapped with JunitPerf to provide load. This profiling code did not take much time and has been re-used several times since when running load tests on the application. We have also now put profiling code around several of the major transactions that can be turned on as needed.
  9. Re: Thread dump[ Go to top ]

    I love thread dumps but getting thread dumps in a production environment of anything but the smallest company can be difficult. It generally requires getting operations management to authorise it and finding a production UNIX engineer to do it.

    That's true. One more reason to have the application server provide it as a standard operations service. As an alternative on the application level I find it useful to provide a number of "logical stacks" showing the time already spent within each "frame".
  10. Please share your experience here first.

    Thanks.
  11. My CU complained that their app server with 4 CPUs is faster than with 8 CPUs.

    1 "kill -3" showed that half of threads are waiting for the synchronized hash map in the JVM security manager

    after switching off the security manager the app server was 10 times faster .... :)

    other question is, why CU bought the 8 CPUs machine with 32 gigs of memory for 1 app server (1 32 bits Solaris process) ...
  12. Performance Tips Chapter[ Go to top ]

    Hi,

    I was wondering whether commercial vendors are allowed to contribute to the chapters as long as the addition is relevant and not blatant advertisement.

    I read the first chapter and one author of a J2EE handbook mentions using an open source JDBC profiler tool. Is is possible to mention other alternatives that provide much more comprehensive contextual recording and information analysis? JXInsight [JDBInsight].

    I think I might have stated previously that looking at individual SQL strings misses the big picture both in terms of testing and performance management. I am currently writing an article that discusses this in more detail (maybe TSS is interested?). Here is a small excerpt that presents the problem somewhat simplified:


    <extract>
    1. Introduction
    This article highlights an important area that has in general not been given its due regard in the testing community – transaction testing. There are many tools, books and methodologies focusing on testing from a functionality perspective but hardly any of them discuss the importance of assessing whether the actual resource transaction execution pattern(s) are correct for the functional test case.

    In this article a resource transaction equates to a SQL database transaction. The transaction execution pattern is the history of SQL statement executed between “START TRANSACTION” and “COMMIT” or “ROLLBACK”.

    2. Black Box Testing Tools
    In testing tools test cases equate to an entry point into a web or workstation client application associated with an executed sequence of scripted actions that edit form or page data and click on some command or navigation buttons. The test verification consists of recording the response times and throughputs as well as response validation. Black box testing asserts success based on the external communication. The tools are not concerned with how the work was accomplished by the underlying system(s). System efficiency assessment is based solely on the recorded responses under various workloads. Conformity of the actual resource transaction execution with the test case transactional requirements is non-existent apart from very primitive data validation.

    A very simple functional test case within an application test framework would be

    1. Insert new customer data
    2. Reread customer data
    3. Compare output data against input data

    Steps 2 and 3 are solely related to functional testing. The actual user transaction is “Insert new customer data”. In the prescribed scenario a successful user transaction execution would be the insertion of 2 records in the database. The first record added to the table TBL_CUSTOMER contains basic customer information such as company name, sales category, type of company. The second record added to TBL_CUSTOMER_ADDRESS contains address data for a particular contact type, in this case billing, such as contact name, department name, street name, street number, city, post code and country.

    Current testing tools provide sufficient support for this type of testing viewing the system with its deployed applications components and services as one big black box. The test case will automate the insertion of the customer data verifying the correctness of the execution via inspection of the contents of UI controls and/or the database itself. From our experience inspection of the database by executing SQL commands after the completion of the test case is not as common as one would expect. Little support is provided for in depth analysis of the resource transaction execution patterns as a result of a particular test case execution. The tools do not tell whether the execution of a test case under (a) a high concurrent load involving the same records or (b) unhandled error conditions could result in unexpected failures and/or data corruption (because to do this requires the architect/designer/developer to create a specification of the use case in terms of resource transaction behavior).

    Less experienced architects and developers have a tendency to assume that concurrency and resource transaction mechanisms in the database are sufficient ignoring the fact that at higher levels within the application architectural layers such mechanisms can be easily invalidated via resource transaction chopping, incorrect object relational mapping and poor optimistic concurrency approaches. Overall system performance can also be impacted because in general there is a high correlation between transactions per test case and response times per test case. High transaction counts per test case are also indicative of a high number of remote procedure calls both between the client and server and server and database.
    </extract>

    If readers have opinions or experience of this area please feel to post here or email myself at william dot louth at jinspired dot com


    Aside: Our product JXInsight (including JDBInsight) provides a solution for the above via our transaction pattern analysis as well as our light weight (in terms of runtime and console usage) recording capabilities. In Feb 2005 we will be releasing a much more advanced and intelligent unit testing solution.


    Regards,

    William Louth
    JXInsight Product Architect
    JInspired


    "J2EE tuning, testing and tracing with Insight"
    www.jinspired.com
  13. Constructive[ Go to top ]

    I am affraid to mention this. The issues with performance that I used to have w/ 100 users and 32 CPUs *disapeared* years ago.... after I moved away from EJB, and went back to using SQL based DAO. (I consider writing SQL making my application Tuneable. You do want a tuneable app!)
    Nothing like showplan or explain SQL to show you the issues. For one you want to avoid "order by" - you don't want 10,000 concurent users sorting a terabyte db on each request. I think I will keep the solution). Also use share-nothing load balancing-fail over w/ sticky bit. I have writen large sites, w/ 40,000 concurent users, and 10,000 concurent users on terabyte DB... and NO ISSUES!
    Of course, I also spend A LOT of resources stress testing (DBMonster+OpenSta) my bottle neck is LAN; one site is being mobed to 10Gig cards (from 1-gig cards)
    Also TPC.org will show you relative performance (and price performance of your hardware - for example using a HW disk cache makes a HUGE diference - due to "eleveator seeking". What if you paid a lot of $ for a slow box? What if - according to TPC.org, your $120K box is slower than my $4K PC?)
    Even LAMP is much better than EJB for scalability value IMO. It can be just a short paragraph in book saying:
      "EJB's don't scale well on large production systems. Vendors that over-sell are financialy liable for the damages - Uniform Commercial Code takes precedence over software license so you can sue any vendor. If it hurts - STOP DOING IT. And now, more intereesting issues "
    Now I am moving for JamMon (open source) monitoring to JMX. JMX is 3-4 lines of code and it has a free RI servlet monitor from Sun. Shows you cache hits, soft cache size, ...
    Other issues seem to be that some people don't MVC: do data caching in data layer.
    What is important how much resource you spend to get good enough performance.
    Ex: Client calls and says: "My system is slow, can you come?
    - It's chepaer for you to buy a caching disk controller for $900 than buy me a plane ticket.
     Or - for $4K you can buy and add on new 64 bit x86 app server that can JVM address more than 2 gig of RAM. This is less than paying $1K per day per person to meet."
    KISS is fast. See Apache's jPetStore based on Struts, etc.
    hth,
    .V
  14. Constructive[ Go to top ]

    Hi Vic,

    I would not paint things so black and white. EJB (and yes including entity beans) do have their place in building enterprise systems that are componentized. A common mistake by architects and developers is not reading and understanding the first paragraph in the EJB specification.

    "The Enterprise JavaBeans architecture is a component architecture for the development and deployment of component-based distributed business applications."

    Failure to understand components and determing whether the technology is an appropriate match for the application will undoubtly spell disaster. The architect needs to make the appropriate engineering tradeoffs.

    With regard to performance management I do not agree that performing a showplan for an single SQL solves most of the "real-world" issues I see. A showplan cannot detect issues that occur under different transaction mixes and very concurrency levels. Also showplans can be invalidate so easy with a minor change in the database management system. Change managment is an important area of performance managment tools and solutions. When a real production problem occurs it is the accessibility of global and local execution contexts that make the difference in how easily, quickly, and accurately the problem is found and resolved.

    Adding some debug lines into a class (as has been stated above by another reader) and using the wall clock time ignores the execution context:

    1. Did a GC event occur during the execution prolonging the length?
    2. Did a random monitoring contention arise during the execution direct/indirect called code?
    3. Was the CPU clock time low because of a wait on a IO response or because a thread context switch occurred?
    4. Was there external resource contention at the database level due to competiting executing threads and proceses.

    <promo>
    It is for this reason that our tool, JXInsight/JDBInsight, records the following counters for any interval event (transaction, trace, or JDBC interception):

    - Wall Clock
    - CPU Time
    - GC
    - Blocking
    - Waiting
    - Object Allocations
    - Clock Adj

    Coupled with this we provide transaction timeline analysis with powerful visualization of database, transaction and SQL concurrency levels.
    </promo>

    I hope you do not find this response an aggresive rebuttal of your opinions. My aim is to show that there are performance management vendors that are working very hard on solving the more complex issues in the most efficient manner both in terms of runtime and offline analysis. For low hanging fruit system printouts can be effective during development and when working on large isolate issues.


    William Louth
    JXInsight Product Architect
    JInspired

    "J2EE tuning, tracing, and testing with Insight"
    http://www.jinspired.com
  15. Constructive[ Go to top ]

    Thanks William,
    Yes. I think that replacing EJB is a low hanging fruit.

    Saving 100 nanmseconds or 23 microsecons in a thread... the users can't tell that the perofrmance improved. That is like in C++ inlining function, removing OO, to save a clock tick.

    I would add that in general saving a total that is less than 1/3 of a second, is no use: becuase users can't tell the diference. We should only look for big things that the users can tell improved.

    Overwall, on comercial systems, the slow part is DB access, what you call IO. A good DBA knows many tricks.
    I do not do many deprmental systems, and can't imagine that it is possible to write a slow one.
    .V
  16. Hi Vic,

    Two points I would like to make.

    1. If the savings is small in terms of a single execution but the quantity is large and the savings occurs on a shared and limited resource then the benefits of tuning justify the expense of the change. Saving a few milliseconds per user transaction can make all the difference when the system is heavy loaded in terms of concurrency - it all comes down to service times and queue lengths. Agreed the most expensive and limited resource being the database. For basic inter JVM component communication it takes alot of "innovation" to slowdown a system. If one was to look at the large container call stack prior to a component request dispatch they would be somewhat reluctant to spend time profiling their own little bean. It takes a lot of JVM executions within a component bean for it to appear on the radar. You need to think about the "precious" resources accessed and consumed during the execution: CPU, IO (DB/JMS/XML), memory (obj allocs), thread monitors, etc.

    2. Another benefit in tuning a system to perform better is that the organization acquires a better understanding of how to design, develop, and deploy high performance solutions. We should learn from our mistakes and acquire a better understanding. Again a good performance management solution aids the architect and developer in understanding the execution behavior and runtime performance of their systems, applications, modules, and components. The knowledge gained is applicable outside of the performance domain.


    Kind regards


    William Louth
    JXInsight Product Architect
    JInspired

    "J2EE tuning, testing and tracing with Insight"
    http://www.jinspired.com
  17. Constructive[ Go to top ]

    "Adding some debug lines into a class (as has been stated above by another reader) and using the wall clock time ignores the execution context:

    1. Did a GC event occur during the execution prolonging the length?
    2. Did a random monitoring contention arise during the execution direct/indirect called code?
    3. Was the CPU clock time low because of a wait on a IO response or because a thread context switch occurred?
    4. Was there external resource contention at the database level due to competiting executing threads and proceses."

    I agree but often doing simple stuff like I mentioned gets you pretty far fast in the problems I stated. I don't think a sledge hammer like a tool is often the right place to start for a lot of problems I have seen. Sure if doing the basics doesn't give you an idea of the issue then yes it may be time to jump into the deep end and look at the stragies for implementing a tool like yours.

    Also I agree about your comments about EJB's. People often blame them far too much for their bad architectural decisions.
  18. Constructive[ Go to top ]

    after I moved away from EJB, and went back to using SQL based DAO.

    1) I use DAO for both entity EJBs and JDBC

    2) IMO entity EJBs or any other OR mapping should be used only as the DB cache: small amount of data accessed more times

    Let's say 2 DB tables: product and product_group. About 100 000 products and about 100 product groups.

    I implement product_group DAO by entity beans and cache them all in memory.

    For reading all products or searching I use JDBC DAO.

    For editing of products I use Entity DAO. User will update only a few products and probably more times same one.

    3) I use only local interfaces.

    Comments are welcomed !!!
  19. Constructive[ Go to top ]

    Daminen,

    - That design works for a DB that fits in RAM. I don't work on such systems.
    - Using EJB just for caching?: You can drive a car w/ your feet, but it's not a good idea.

    Take a peek at a good design: Apache's Struts based jPetsStore at ibatis.com.
    It uses externalized SQL strings and an automatic data cache.

    .V
  20. Constructive[ Go to top ]

    Hi Vic,
    That design works for a DB that fits in RAM.

    No. I use a combination of:
    - OR mapping (in my case EJB 2.1 in SUN's app server implemented by JDO) for small data sets, which are accessed more times
    - JDBC for stuff, what does not make sense to cache. E.g. big data sets, accessed only once
    - DAO as an encapsulation
    - and I do not use the rich OO domain model. Only ERD -> JAVA

    I believe it is the ideal combination for midle/big sizes of enterprise apps
    Using EJB just for caching?:
    no, I wrote "entity EJBs". And I do not have any business logic in the entity EJBs.
    Apache's Struts based jPetsStore at ibatis.com.

    jPetsStore PDF doc from 2002 says that it uses:
    - Struts
    - a simple OR mapping framework - iBATIS Database Layer

    in my case I go for:
    - JSF, what is also MCV and IMO quit similar to the Struts
    - sophisticated OR mapping (entity EJB 2.1 implemented by JDO) + JDBC framework and all it is encapsulated by DAO interfaces to hide difference between entity EJBs and JDBC

    Regards,
    Damian.
  21. From The Authors[ Go to top ]

    Hi TSS Community,

    If you are interested in having your name in print, then read on!

    For a limited period, TSS, Manning and the authors of the Java Doctor will be collecting troubleshooting tips from people like you. If selected, your tip will be published in a specially-designated chapter within the soon-to-be-realeased Java Doctor book. You'll get your name and contact information listed alongside your tip.

    If you are interested, take a look at Chapter 9 for more information, examples, and most importantly, the guidelines you will need to follow for your tip. In addition, we suggest taking a quick read of the Introduction (to be posted shortly) to get an idea of what the book is or isn't about! Only the best of the tips will be selected for print!

    Got one? Send your tip to javadoctor at manning dot com

    Good Luck!


    TSS Java Doctor Page

    Manning Java Doctor Page

    Java Doctor Authors,
    Jamiel Sheikh
    Ali Syed
  22. From The Authors[ Go to top ]

    I've received numerous responses on the deadline for submissions. Unfortunately, a slightly older copy of Chapter 9 was posted earlier (and is now corrected), which stated a deadline that is already past.

    There are no hard deadlines, however we will begin our review and selection process on February 1st, and decide if we need to close submissions or obtain more.

    So please send us your tips, we'd like to publish them and give you the credit for it.

    Thanks

    Jamiel Sheikh