This suggests that you have hardly experienced performance intensive situations in any of your projects :-)
ROTFLMAO! Perhaps you should read my blog sometime to see what sort of projects I've been involved in and what the performance aspects were.
Let alone 200ms, even a 20ms round trip to the Db from the middle tier has a cascading cost, when the system is loaded. The cost is typically multiples of the base additionally latency (in this case, the 20 ms). These could be caused by any number of factors that in the least include context switches (and the ensung increased CPU stress) and cascading concurrency aggravation.
So you're saying that latency adds up serially? Sorry - that isn't necessarily the case at all.
As you increase load on a system, latency tends to only go up slightly for quite awhile, where your latency levels per individual thread exist in a relatively tight band. For example, in a pure comms system w/ no persistent store access, for 1-100 threads the latency might stay in a band of 1-10 milliseconds, and latency could be indistinguishable between, say, 40 threads and 60. This happy situation exists because you haven't saturated any resources yet. Latency starts to pile up only when a resource gets saturated and queueing starts happening somewhere in the system - be it network, CPU, or something else.
If you throw in something with complex locking, like a database, then you've added another factor into the equation - lock contention. A well designed RDBMS and well designed app code will be coded to avoid lock contention as much as possible, and will resemble the comms example above. If there's excessive locking, then you _can_ start getting latency pile ups due to queueing - in this case, you're queueing up due to lock contention.
In XA situations, it's possible to code the system in pure Java and a fast disk so that you can support >5000 transactions/second from a large number of threads (say, 800 or so) and hold latency down to <100 milliseconds.
In comms situations, with small messages (1K or less) it's possible to do >15,000 msgs per second with similar latencies in a suitably fast network.
All of the above of course relies on parallelism, and in many cases piggybacking multiple requests together in one form another - in XA, you might have people sharing disk forces, in networking you might combine multiple requests into a single network packet or otherwise coalesce results somehow.
And when a given item gets saturated, you get around it by adding "more" of the saturated resource - more machines, more memory, faster disks, more disks, a faster network. Hell, even multiple networks. Perhaps you've heard the term of scaling up software? Well, this is what it's all about - software that lets you add "more" of what you need hardware wise, and that can take advantage of that and distribute costs over multiple components.
Now from the perspective of a single thread, things are a bit different and closer to what you describe. Traditionally in Java land, people do access resources serially and in a synchronous manner. So if they access 10 "external" resources, they access them one at a time, and the total "request" time becomes the sum of all of those accesses. This goes hand in hand with people talking about not using fine-grained network interfaces - to avoid summing up the cost of many network round trips. In this situation you will hit a wall at some point.
And this is the point where people learn what people in the financial services arena learned decades ago - asynchronicity and non-blocking I/O and parallelism are your friends :-) Some aspects of a workflow may be serial in nature and force you to code them that way, but a surprising amount of work in a "request" coming into a system may be done asynchronously in a non-blocking manner. The model here is often:
- Do some up front work
- Fire off external requests asynchronously to "N" resources
- Reap results of async requests as they come in
- Process & return results to caller
Such an approach can take 200 milliseconds of work and accomplish it in 40 or 50 milliseconds - and adding a few extra asynchronous firings doesn't measurably change this perceived time to the user.
This sort of design tends to be very fast and very resilient in the face of errors. It's primary drawback is that it is a very different way of doing things, and one that some people find more "complex". To Fowler's credit, he mentions a growing like of asynchronous protocols and approaches to this sort of problem. He recognizes that asynchronicity and non-blocking semantics solves a number of his performance critiques of distributed computing (he's a couple of decades late in that observation, but at least he finally got there).
-Mike