StackProbe - a new Java profiler

Discussions

News: StackProbe - a new Java profiler

  1. StackProbe - a new Java profiler (28 messages)

    StackProbe can connect to local or remote applications through JMX without causing too much overhead, while giving accurate results. It periodically takes snapshots of threads stacktraces and analyzes them in real time. The main advantages of this technique are:
    • You can control overhead by dynamically adjusting the sampling rate.
    • There is no risk of breaking the production system - no code is ever modified.
    • The slight slowdown caused by sampling is distributed evenly between all methods in the code, so the relative values reported by the profiler are very accurate.
    • StackProbe is able to estimate the statistical error of reported results. This is actually the first profiler to do this.
    • The profiler is ready to work just after establishing a JMX connection.
    • No special agents are required to be installed on the remote side.
    • You can observe first results very early, just when the first few samples arrive.
    StackProbe is free for open-source projects. More information and an online demo can be found at http://www.stackprobe.com/.

    Threaded Messages (28)

  2. it is NOT free,[ Go to top ]

    StackProbe is not free, only the demo is "free" actually using it requires you to purchase a license.
  3. The product is free for Open Source projects. It doesn't claim to be free for commercial projects.
  4. from the site:[ Go to top ]

    " When first started, StackProbe runs in demo mode until you install a valid license key. In the demo mode, you are allowed to profile only StackProbe itself and you cannot attach it to other local or remote Java applications. To get rid of this limitation, obtain a free trial or purchase a commercial license, and then click "Install license key" button to input the license data you have received: " i don't say anything about os or non-os projects, in my book purchase means money.
  5. Check the following link[ Go to top ]

    Perhaps it is not clear from Webstart, but the information is on the site: http://www.stackprobe.com/pricing.php
  6. VisualVM[ Go to top ]

    How is it better than VisualVM? (well, for me VisualVM doesn't work well, probably because I have a Mac... but otherwise).
  7. Profiling: Sampling versus Execution (Part 1) http://williamlouth.wordpress.com/2009/01/07/profiling-sampling-versus-execution-part-1/ "Unless we are talking about a “HelloWorld” application with only one main thread of execution being profiled a dynamic strategy based execution profiling (metering) solution can indeed out-perform simplistic sample based profilers whilst collecting much more relevant data, discarding noise, at a much higher degree of accuracy." Profiling: Sampling versus Execution (Part 2) http://williamlouth.wordpress.com/2009/01/16/profiling-sampling-versus-execution-part-2/
  8. I am not entirely against thread stack analysis only the way that most standard solutions do it which is largely inaccurate, single meter based, full of noise, and terribly expensive especially when performed via JMX (remote or local). Better Java Thread Stack Trace Dumps http://williamlouth.wordpress.com/2009/02/08/better-java-thread-stack-trace-dumps/
  9. Yes, we all know, that JXInsight does it the only RIGHT WAY ;) And seriously, it is funny how your benchmarks in your blog contradict your conclusions. For the first tested "sampling profiler" you get less than 20% overhead for profiling EVERY METHOD of a program with stacktrace depth of 400. The other of your benchmarks of the instrumenting profilers never get even close to that (I'm comparing the profiled runs with the baseline on your graphs). What you completely ignore is that an instrumenting profiler can never be as fair as a sampling profiler and will never get you accurate results as it creates systematic, assymetric overheads. Sorry, instrumenting the method with a call to the timer yields enormous overheads (you even mentioned it somewhere - about hundreds of ns). Hundreds of ns is a huge period of time compared with the time required to execute a small method. You get rid of the overhead partially by instrumenting only SOME methods, but it is just a hack forcing the user to perform many profiling iterations with different profiled sets. Sampling enable to get the sampling rate down to almost 0, where there exists no overhead. The only downside is you must wait longer for the results. And if you tell that sampling through JMX is terribly expensive, well... I suspect this to be not-true unless you give us some evidence. I've just connected hprof in sampling mode to the complex application calculating optimal index sets for RDBMS using genetic algorithms and I hardly ever see a 30% slowdown. When I connect an instrumentation-based VisualVM, I get 5x slowdown and the results are bogus for some short methods, no matter for how long I run the profiler. But I can be wrong. One application and 2 opensource implementations is not enough to draw general conclusions. Give us JXInsight profiler's demo and we'll see.
  10. OpenSource[ Go to top ]

    Yes, there is a bug in the description on our website. Of course, StackProbe is free for open-source projects.
  11. To my reply to William Louth: When writing on the sampling profiler from your tests, I meant the "profiler B" on your graphs. I can see about 20% overhead for the stacktraces of 400. Decrease sampling rate by 4 and you get 5% (if the profiler supports configurable sampling rate). And nw try to do it with any instrumentng profiler :P
  12. JXInsight's overhead for an instrumented method that is not a hotspot (determined dynamically at runtime) is as low as 23 nanoseconds. We can drop this down even further to 10-13 nanoseconds with global/thread/probe disablement. But lets stop there we can even generate a hotspot extension jar on a second run which only instruments those methods previously detected (in test or pre-production) to be hotspots. Instrument->Measure->Refine->Repeat http://williamlouth.wordpress.com/2009/04/06/instrument-measure-refine-repeat/ Now tell us all how long and at what cost (cpu, clock time, alloc objects/bytes) a call stack sampling takes with 50 threads running in a small web container via JMX. Do you know this?
  13. As little as you wish. Take 1 sample per minute, and the overhead will become unmeasurable. For exact data you must wait till tomorrow when I actually measure it. I don't suppose 1 sample would take longer than 5 ms. Which could mean as little as 0.1 ns / method call, for large number of method calls, or 1000 samples gathered in 10 seconds with less than 50% overhead. 10 ns per EACH profiled instrumented method call is STILL huge overhead for an inlined getter or setter, which can take less than 0.1 ns.
  14. Sampling cannot handle requests (never mind calls) that are shorter than the actual sampling collection and we know how slow that is.
    Any references / evidence? This is simply not true. It is perfectly possible to profile 10 us requests with 1 s sampling interval. No problem here. You just have to get many such 10 us requests, which is actually very easy in a loaded application. BTW: Instrumenting a class causes larger service lag than taking a stacktrace snapshot. So instrumenting is neither suitable for hard real-time systems. But financial systems are not hard real-time, so no problem here.
  15. Instrumenting a class causes larger service lag
    Instrumentation takes time at load-time but we have an instrumentation cache which gets used on the second run of an application (note: you can copy the cache from staging into prod). We have no instrumentation (of a class) overhead. Coupled that with the generation of a true hotspot instrumentation extension and you have accurate service level management reporting and multi-resource metering all at a unbeatable cost.....
  16. Ok, just to clear some things up:
    1. I agree the class-loading problem is not a real problem, as often GC causes larger pauses. However we are not talking on hard real-time systems. So tie here. Even 0.1 seconds pauses are tolerable for most server-side systems, otherwise you should never use the default JVM -server GC with large heaps and switch to CMS or G1 instead (G1 when it is production ready).
    2. Instrumentation, as you stated, gives you other useful information, sampling cannot provide. It is perfect for monitoring allocation / used resources. I agree with that completely.
    But:
    1. Instrumentation cannot give you line-level resolution. Well, StackProbe does not have it yet, but some expensive commercial profilers do.
    2. Instrumentation is useless for accurate profiling methods shorter than ~20 ns. With instrumentation overhead per single method of about 23 ns (these are your words, sir), how can you ACCURATELY measure the time of execution of the 20 ns method? You cannot even be sure if for that particular method the actual overhead is 20 ns or maybe 10 ns, as many things affect it (like HotSpot optimizations). And 20 ns is a lot of time, enough for many short method calls.
    3. The hotspot detection hack used by JXInsight to selectively instrument only some methods (not on the hotspot) is nice, but of little practical use. I want to profile the hotspots, not to avoid them. I want to know why the given place is a hotspot, and this often leads to analysing very short methods, deep in the call tree, like operations on collections.
    4. Sampling can accurately measure relative execution time of methods as short as a single inlined getter/setter.
    5. Sampling can be as lightweight as you wish by adjusting the sampling rate, and even lighter if you profile just a subset of threads. I do not need to profile all 50 identical worker threads of the servlet container. Just one is enough.
    6. Sampling cannot give you absolute timings, but... sampling together with instrumentation for only counting the method calls - can. Which is still lighter than calling wall clock time / performance counters from within the instrumented method.
    BTW: Interesting that JXInsight is not a competition for StackProbe and vice versa. You are like an employee of SUV company bashing Smart. :D
  17. There is no bashing going on. This is how I debate things with the hope that I will hear an argument that I had not previously factored into my assessment of a particular technology or data collection technique. I do think sampling has its place in the toolset of a performance troubleshooter and to a lesser degree in performance monitoring. The benefit I see with sampling is that once it stops there is no overhead is incurred. But if we are really talking about production monitoring why would anyone stop monitoring? No one turns off service level management reporting. That said sampling can be used in cases were the instrumentation applied has not (un)covered a code base or execution path that is presenting a problem in production. Sampling is one of a number of sources of information to a review of instrumentation strategies and coverage. I wrote the sampling vs execution analysis articles a while back. Any perceived bashing was on the data collection approach irrespective of a particular technology or product. I do plan on adding cal stack sampling capabilities to our OpenCore Metrics runtime but from a somewhat different perspective in terms of what is actually recorded and stated - I could not bring myself to state clock times because this would be a lie but there is something else that is important to note about such sampling.... It is not going to be on by default as I still think the overhead is too high in a Java EE multi-core environment compared to what we offer in our probes technology. The high overhead is not just because of the number of threads or large call stack depths but because of the delay (processor util drops across the board) that is experienced in collecting very much like a GC stop world event and the amount of rubbish that is created in doing so. If the JVM's offered a much more efficient mechanism then this would be much more viable.
  18. Instrumentation cannot give you line-level resolution. Well, StackProbe does not have it yet, but some expensive commercial profilers do.
    Instrumentation is useless for accurate profiling methods shorter than ~20 ns.
    No one instruments such methods though sampling will record them on the stack. We have 700 extension jars targeting different technologies we do not need to instrument all those 200-400 methods in the call stack that are being sampled. Most users target entry/exit/transition points removing all the noise of a sampling tool.
    The hotspot detection hack used by JXInsight to selectively instrument only some methods (not on the hotspot) is nice, but of little practical use.
    You seem confused. The whole point of our hotspot strategy based approach is to focus specifically on the hotspots - not ignore. Note that we have many other strategies which can be chained together including sampling, busy, busythread, concurrent, re-entrant, random, warmup, interval, frequency, checkpoint, dynamic, initial, highmemory, burst, delay, ....
    Sampling can accurately measure relative execution time of methods as short as a single inlined getter/setter.
    Now this interesting. I have never heard such wild claims by any other sampling solution so please do tell us all how this can be so when you cannot even accurate detect whether that stack has indeed changed or remained stalled since the last sample (its guess work). The same call stack for a thread does not tell you whether you are indeed looking at the same request processing or a complete differently one (contextually speaking). I would also like for you to state the cost in collecting this data ** remotely ** via JMX for a reasonable sized Java EE application with concurrent request processing
  19. We speak of two different applications of profiling. You speak of performance monitoring of production systems. Neither StackProbe, hprof nor JSamp sampling profilers are targeted at this application. I speak of on-demand debugging / profiling mainly in the test environment, but sometimes also in production. If a service seems to be overloaded, then we quickly attach a profiler to it and see what is going on. Sometimes it is not possible to recreate the same conditions in a test environment. You tell there is no need to profile short methods. No, what you actually mean is there is no need to MONITOR short methods in the production system. I agree. And I agree instrumentation is best suited for this. But once the performance problem arises in one of such long methods, you don't know the exact cause. And the exact cause may be in a chain of very short methods called very often, you are not able to accurately profile with instrumentation based solution. Sampling does not require the stacktraces to remain unchanged between the samples. It does not matter and actually it is not even being checked. If we find a thread "sitting" in the same method everytime we sample, it does not change anything, whether this method was called once (and doing heavy loops), or this method is a simple oneliner that gets called millions of times per second. In both cases the cumulative time spent in that method is same - and can be reported with great accuracy, provided that you get enough samples (1000 is enough to get the worst-case 3% accuracy). By sampling the thread's status given by the JVM, we can know the cumulative time the method was actively executed (runnable) by the given thread, waiting or blocked - another thing instrumentation cannot provide. The cumulative time spent in a method is the first most interesting parameter when optimizing any application. Methods with low cumulative time don't count - they are not a performance problem. What we still don't know, is how long it takes to execute a SINGLE call of the method. And here instrumentation helps a lot. But still - much better together with sampling: get cumulative time from sampling and divide it by the number of calls from instrumentation (really lightweight). GPROF and AMD CodeAnalyst C/C++ profilers work this way, and they are very well trusted profilers. And sorry for the word "hack" - I meant it in a positive way. Hack is usually an innovative, clever workaround of some problem - in this case a problem of high method instrumentation overhead. This IS used by some other profilers (e.g. YourKit), but usually in an on-demand manner - not as automated as in JXInsight.
  20. Quick benchmark[ Go to top ]

    Ok, I connected StackProbe to the application server serving mobile pages of one of our clients (in test envoronment). This is quite a standard Jetty + Spring + Hibernate stack. First some performance results for unprofiled server: First run: pkolaczk@darkstar:~/tmp$ http_load -parallel 5 -fetches 1000 urlfile 1000 fetches, 5 max parallel, 625000 bytes, in 6.50617 seconds 625 mean bytes/connection 153.7 fetches/sec, 96062.7 bytes/sec msecs/connect: 0.058479 mean, 0.189 max, 0.014 min msecs/first-response: 14.5792 mean, 179.396 max, 1.982 min Second run: HTTP response codes: code 200 -- 1000 pkolaczk@darkstar:~/tmp$ http_load -parallel 5 -fetches 1000 urlfile 1000 fetches, 5 max parallel, 625000 bytes, in 6.86647 seconds 625 mean bytes/connection 145.635 fetches/sec, 91022.1 bytes/sec msecs/connect: 0.058318 mean, 0.162 max, 0.014 min msecs/first-response: 15.271 mean, 369.553 max, 1.934 min HTTP response codes: code 200 -- 1000 Now with the profiler attached and sampling mode set to "medium" (which means about 3 samples per second): pkolaczk@darkstar:~/tmp$ http_load -parallel 5 -fetches 1000 urlfile 1000 fetches, 5 max parallel, 625000 bytes, in 6.98356 seconds 625 mean bytes/connection 143.193 fetches/sec, 89495.8 bytes/sec msecs/connect: 0.058079 mean, 0.149 max, 0.014 min msecs/first-response: 16.397 mean, 209.892 max, 1.924 min HTTP response codes: code 200 -- 1000 pkolaczk@darkstar:~/tmp$ http_load -parallel 5 -fetches 1000 urlfile 1000 fetches, 5 max parallel, 625000 bytes, in 6.64917 seconds 625 mean bytes/connection 150.395 fetches/sec, 93996.6 bytes/sec msecs/connect: 0.057182 mean, 0.254 max, 0.014 min msecs/first-response: 17.303 mean, 390.209 max, 1.943 min HTTP response codes: code 200 -- 1000 And now the test stressing the server more with our profiler - sampling rate set to 20 samples / sec. pkolaczk@darkstar:~/tmp$ http_load -parallel 5 -fetches 1000 urlfile 1000 fetches, 5 max parallel, 625000 bytes, in 7.63574 seconds 625 mean bytes/connection 130.963 fetches/sec, 81852 bytes/sec msecs/connect: 0.067142 mean, 0.285 max, 0.014 min msecs/first-response: 18.6212 mean, 412.277 max, 1.848 min HTTP response codes: code 200 -- 1000 From the test I can also see, that running application is more slowing down the profiler than the profiler is slowing down the application - the sampling rates are higher for not loaded app server. The server used 41 threads and it seems the bottleneck is the database ;) Well, we will have to tune it - it is not ready yet. All tests were performed on the a AMD Athlon 3700+ computer, the profiler was using local JMX connection.
  21. Re: Quick benchmark[ Go to top ]

    Well if it is all IO bound why would you expect the application to be slow downed especially with only 5 threads running for such a short window. This surely is not representative of real world performance stress/load testing. I am even surprised you managed to slow down this at all which begs the question..... I am not sure what 3 samples a second (333 ms interval) tells anyone from the resulting data. Most sample based profilers start in the 5-10 ms. Your "stress" of the profiling stack has this pegged at 50 ms. 10 times the usually coarse grain interval Why so high? At such resolution you are effectively making every method look like a "short" method resulting in large overstatements and missed functions. Have I misread your data and approach? The server had 40 threads. How many were actually involved in the work being performed? 5? Please try a little bit harder. SPECjAppServer? SPECjvm2008? Now the million dollar question is "why is the profiler slowing down". Why don't you report the containers cpu/heap/gc stats during the testing? Why no reporting of the time it actually takes to collect and process a single dump remotely? Would it be close to 50 ms?
  22. Re: Quick benchmark[ Go to top ]

    This was just a quick, unscientific benchmark, just as your unscientific benchmarks in your blog. There is no slowdown here (the differences between these runs are statistically insignificant). The application was just as many enterprise applications - IO bound. The interesting conclusion is that for IO bound applications you can use the "fastest" mode without hurting yor target application's performance. Quite nice :) However, checking this with a comlex CPU bound numerical application gives similar results. With 3 samples/s there is almost no slowdown. With maximum possible sampling rate, unsuprisingly - there is about 50%, when running on a single core CPU. The OS tries to give fair amount of CPU time to both the profiler and the profiled application. So, nothing strange here. 3 samples per second is enough. It is just a little over 5 minutes of testing to get 3 p.p. accuracy or relative times. And just about 10 seconds of testing to spot the large bottlenecks. I've written that at least two times: Sampling rate HAS NOTHING TO DO with the final accuracy you can get. It is the total number of samples that has. It seems you don't understand the basics of sampling. Before you post a reply, please read thoroughly: http://en.wikipedia.org/wiki/Performance_analysis
  23. The hotspot detection hack used by JXInsight
    A hack? You really have completely lost the argument and with that wording all credibility. What you call a hack is what others call an innovative multi-resource multi-strategy based (chaining) metering runtime.
  24. I am sure you will be delighted to know that we have a new update coming out this week that reduces the overhead further for non-hotspots. Down to 13 ns !! To put that in context the cost of a System.nanoTime() call is 55-60 ns on the same (somewhat old) MacBook Pro. The nearest profiling solution had a cost of 660 ns which makes us at least 50 times faster than the second best which itself has a huge lead on the the solution in third place. I am sure we will eventually beat that number down further especially when we release an alternative to our AspectJ based extension library set.
  25. We will have a release out later today which drops the overhead down to 1 nanosecond for non-hotspot methods allowing us to process 1 billion operations per second. Thanks for the extra push.
  26. There is nothing accurate about sampling. Call stack sampling is only useful in identifying low hanging fruit (surely never in a production application). Instrumentation is required to validate/filter/quantify the results of sampling. Sampling is not at all suitable for production at least not in a managed environment were accurate response times are needed for service level management reporting (variances, max, trends, baselines). Sampling is a single meter based solution and ** cannot assign ** other resource/system costs to call stacks, methods (including aggregation at package/class/method levels). Those days are over - very much so in the cloud. Sampling cannot handle requests (never mind calls) that are shorter than the actual sampling collection and we know how slow that is. This rules out most financial applications or well tuned applications. A call stack dump can be a useful starting point for investigation on an application that is a dog (or snail) but nothing beats having real data with statistical measures.
  27. Yes, we all know, that JXInsight does it the only RIGHT WAY ;)
    Thank god for that. I would hate to find out that I wasted so many years going in the wrong direction. Or maybe JXInsight does it the right way because I spent all those years refining (refactoring) my approaches to performance engineering and application performance management starting with resource transaction path analysis -> distributed (contextual) tracing -> runtime state/flow diagnostics -> activity resource metering -> integrated software + system metrics -> ..... Each year I try hard to prove the old William wrong, ;-)
  28. Usefulness of StackProbe[ Go to top ]

    First of all - wow! What a p***ing match (in some of those previous comments). I have to say that I have actually tested StackProbe on a production system using a trial license and did find it quite useful. The overhead from profiling at the default sampling rate was not noticeable at all. Obviously the longer that you perform sampling, the better the results you get. I did a number of sampling sessions, the longest being ~6200 samples. Note that this was on a box serving about 1000 http req/s at peak. One of the topics that was mentioned in a thread above was about profiling really short running methods. I have to say that I did find an instance of HashSet rehashing occurring because a HashSet was not sized appropriately. StackProbe identified the percentage of time that was being spent rehashing and was extremely useful. On the downside, I now want to buy it, but the procedure to buy seemed a bit strange. My co-worker doing the purchase told me the response he received was something like "We'll release the license once we verify the funds are in our bank account." Apparently we can't use a credit card? He thought this was a bit fishy and it may take me some time to actually get the license :( Hopefully the purchase process will improve. Cheers, John
  29. Re: Usefulness of StackProbe[ Go to top ]

    It took a little longer than I had hoped but the license did arrive.
    On the downside, I now want to buy it, but the procedure to buy seemed a bit strange. My co-worker doing the purchase told me the response he received was something like "We'll release the license once we verify the funds are in our bank account." Apparently we can't use a credit card? He thought this was a bit fishy and it may take me some time to actually get the license :(

    Hopefully the purchase process will improve.

    Cheers,
    John