The Investment Banking Technology Stack with John Davies

Discussions

News: The Investment Banking Technology Stack with John Davies

  1. John Davies’ shortened keynote at TSSJS Barcelona was an overview of the past, present and future state of wholesale (rather than retail) banking. It focused on the technology stack (saving the coding details for his Friday talk) and it gave a terrific insight into a sphere where languages and hardware really are at the bleeding edge. As CTO of C24 with 20 years in banking combined with 25 years in the IT sector, not to mention co-authoring several books on Java - clearly make John Davies the man to deliver this information. Stating the obvious: Speed is everything. The activity of the trading room floor (described as the front office) demands CPUs make up to 10,000 calculations per second, or one every 5 millisecs. This grunt is handled by up to 6,000 CPU’s housed in massive (heavily air conditioned) rooms, sometimes powered by their own power grids...stats that raised 250 sets of eyebrows very quickly. Running short on time some of the key points were the preference of banks for Linux over Unix and that banking systems currently run on mainly Java 1.4. 1.5 is justified on a 20% performance gain - certain issues aside. How do Banks choose a company? Well it seems banks face a difficult consumer choice. Big Companies lack innovation – at least in small enough cycles to be helpful, Small Co’s might not be around for very long and for middle sized Co’s they worry about them being bought out by the big guys. The solution for many banks seems to be in-house Open Source with the core benefits being control and long-term security. As part of providing a balanced picture it’s worth noting that John stated that Open Source is riddled with politics, whether you choose Jboss, Apache or any other solution Open Source in application is not necessarily as free as the blue sky ideal might suggest (a point that undoubtedly would have benefited from more time for input). Hot off the Open Source Press: As close to a TSSJS exclusive as you can get, John announced the official launch yesterday (Tuesday 20th June) of AMQ - Advanced Messaging Queue Protocol…and (if you actually needed a hook) went on to state “It’s going to commoditise the messaging world”. In a presentation made up largely of sound bites here is a quote ponder over on the flight home
    …Databases are on their way out. You can brute force an ‘in memory’ search of a million lines of code quicker than you can query a database.
    John Davies recommended site: www.openadaptor.org

    Threaded Messages (49)

  2. I hate to be pedantic here but are you sure you are using the word "grunt" properly? "This grunt is handled..." Do you mean "The brunt of this is handled..."
  3. This looks like a summary of the keynote. Is the actual keynote available somewhere? I didn't get to go to Spain. :(
  4. Typos?[ Go to top ]

    Stating the obvious: Speed is everything. The activity of the trading room floor (described as the front office) demands CPUs make up to 10,000 calculations per second, or one every 5 millisecs. This grunt is handled by up to 6,000 CPU’s housed in massive (heavily air conditioned) rooms, sometimes powered by their own power grids...stats that raised 250 sets of eyebrows very quickly.
    I hope these are just typos but 10000 calculations per second does not equal 1 every 5 milliseconds, it's 10 every millisecond. And frankly that's really not very many for even just a single a modern computer, let alone a grid of 6000. Maybe calculation is not really the right term; transactions possibly...? I enjoyed this quote too:
    …Databases are on their way out. You can brute force an ‘in memory’ search of a million lines of code quicker than you can query a database.
    Why would you want to search through a million lines of code in a financial application? In an IDE, sure. On the other hand, why would you be storing your code in a database? Maybe that's why he needs a 6000 computer grid to achieve 10000 calculations a second? ;-)
  5. Re: Typos?[ Go to top ]

    I hope these are just typos but 10000 calculations per second does not equal 1 every 5 milliseconds, it's 10 every millisecond.
    Ha! I didn't even notice that but I did think 10000 calculations per second was pretty unimpresive.
    Why would you want to search through a million lines of code in a financial application? In an IDE, sure. On the other hand, why would you be storing your code in a database? Maybe that's why he needs a 6000 computer grid to achieve 10000 calculations a second? ;-)
    Perhaps you've struck upon why these stats "raised 250 sets of eyebrows very quickly".
  6. Please enter a subject.[ Go to top ]

    Why would you want to search through a million lines of code in a financial application?
    I assume that the real presentation was talking about a million records ... otherwise the statement makes no sense. If so, though, does anyone think that speed of search is the only essential value provided by a RDBMS ? I also note that it is 'up to' 6000 CPUs, blah, blah, blah. 2 is 'up to' 6000. This is the kind of one-upmanship that sometimes makes it hard to discuss real issues : 'my volume is bigger than yours ...'. Few organizations in the world support 6000 CPUs dedicated to one task. (Granted, many have 6000 desktops ... :-D) I'm sure he sees this in his line of work, but to how many situations does it apply ? Finally it seems to be a power grid not a computer 'grid' that he's talking about. That is obviously not true, and 6000 CPUs certainly do not need their own power grid in any event, but whatever ...
  7. Re: Please enter a subject.[ Go to top ]

    If so, though, does anyone think that speed of search is the only essential value provided by a RDBMS ?
    Yeah, I was thinking the same thing. Where does the global data storage go? You've got 6000 CPUs, right? What do they do, each maintain an internal set of records and issue broadcast messages when something changes? There's also this thing called 'caching'...


    I also note that it is 'up to' 6000 CPUs, blah, blah, blah. 2 is 'up to' 6000. This is the kind of one-upmanship that sometimes makes it hard to discuss real issues : 'my volume is bigger than yours ...'.
    Yes this does seem to be a case of 'bigus dickus' syndrome. High volume is just one of many technical issues one may or may not address in a system.
    Few organizations in the world support 6000 CPUs dedicated to one task. (Granted, many have 6000 desktops ... :-D) I'm sure he sees this in his line of work, but to how many situations does it apply ?

    Finally it seems to be a power grid not a computer 'grid' that he's talking about. That is obviously not true, and 6000 CPUs certainly do not need their own power grid in any event, but whatever ...
    No no no, they have their own nuclear reactor and everything. But seriously, I think what is meant is that the system can or does run off a auxillary power station that can be separated from the grid. This is nothing special. Any mission critical computing environment needs something to this effect. Every building I've worked in has it's own generator for thousands of computers, lights, etc.
  8. In some cases[ Go to top ]

    A financial firm will have it's own dedicated generator at the NOC to power the servers for atleast 2 weeks if needed. It's all part of normal operation for big financial firms. peter
  9. Re: Please enter a subject.[ Go to top ]

    High volume is just one of many technical issues one may or may not address in a system.
    No no no no no !!! You are wrong. if you have not worked in the financial industry, you do not know how to handle high volume. :-)
  10. Re: Please enter a subject.[ Go to top ]

    Where does the global data storage go? You've got 6000 CPUs, right? What do they do, each maintain an internal set of records and issue broadcast messages when something changes?
    James, the 6,000 servers are the global storage, and no, you wouldn't do any broadcasting across 6,000 servers. That's why we introduced data partitioning in Coherence, so that servers could work together (and using scalable point-to-point communication) to provide a reliable and consistent single system image. Peace, Cameron Purdy Tangosol Coherence: The Java Data Grid
  11. Re: Please enter a subject.[ Go to top ]

    James, the 6,000 servers are the global storage, and no, you wouldn't do any broadcasting across 6,000 servers. That's why we introduced data partitioning in Coherence, so that servers could work together (and using scalable point-to-point communication) to provide a reliable and consistent single system image.

    Peace,

    Cameron Purdy
    Tangosol Coherence: The Java Data Grid
    Exactly! Thanks Cameron, -John-
  12. Re: Please enter a subject.[ Go to top ]

    Where does the global data storage go? You've got 6000 CPUs, right? What do they do, each maintain an internal set of records and issue broadcast messages when something changes?


    James, the 6,000 servers are the global storage, and no, you wouldn't do any broadcasting across 6,000 servers. That's why we introduced data partitioning in Coherence, so that servers could work together (and using scalable point-to-point communication) to provide a reliable and consistent single system image.

    Peace,

    Cameron Purdy
    Tangosol Coherence: The Java Data Grid
    This is something it would be useful to learn more about. Thanks for the clarification.
  13. Re: Please enter a subject.[ Go to top ]

    James, the 6,000 servers are the global storage, and no, you wouldn't do any broadcasting across 6,000 servers. That's why we introduced data partitioning in Coherence, so that servers could work together (and using scalable point-to-point communication) to provide a reliable and consistent single system image
    Actually, I have a question about this. If they image is stored across servers and the servers must talk to each other to keep things consistent, how much can you really save by eliminating the DBMS? Just to be clear Cameron, I don't doubt that you can, I would just like to understand more. If this is a trade secret, please jut say so. Thanks.
  14. Re: Please enter a subject.[ Go to top ]

    James, the 6,000 servers are the global storage, and no, you wouldn't do any broadcasting across 6,000 servers. That's why we introduced data partitioning in Coherence, so that servers could work together (and using scalable point-to-point communication) to provide a reliable and consistent single system image
    Actually, I have a question about this. If they image is stored across servers and the servers must talk to each other to keep things consistent, how much can you really save by eliminating the DBMS?
    It's partitioned, which means they don't have to talk to each other to maintain the data, because only one server will have the master copy of each data (and probably only one or two servers will have a backup copy for redundancy). You do have network traffic for data access, but it is far better than a DBMS for scalability because there are 6,000 servers, each serving up 1/6000 of the data. (BTW - 6,000 is extreme; most large-scale applications need only a few hundred at the most.) Furthermore, you can near cache data, i.e. keep local copies for read purposes, using time-outs and/or events to manage synchronization, with explicit locking, optimistic transactions, pessimistic transactions and mobile agents all available to guarantee consistency. Peace, Cameron Purdy Tangosol Coherence: The Java Data Grid
  15. Re: Please enter a subject.[ Go to top ]

    It's partitioned, which means they don't have to talk to each other to maintain the data, because only one server will have the master copy of each data (and probably only one or two servers will have a backup copy for redundancy
    If you'll continue to humor me... I guess what I'm missing here is how does a single server service a request that requires (and/or requires updaing) data that is spread across two or more of the partition servers? The servers must then speak to each other or have caches (which obviously need to be flushed in the case or an update.) thanks, -James
  16. It's partitioned, which means they don't have to talk to each other to maintain the data, because only one server will have the master copy of each data (and probably only one or two servers will have a backup copy for redundancy


    If you'll continue to humor me...

    I guess what I'm missing here is how does a single server service a request that requires (and/or requires updaing) data that is spread across two or more of the partition servers? The servers must then speak to each other or have caches (which obviously need to be flushed in the case or an update.)

    thanks,

    -James
    The whole model works well for certain workloads, and fails miserably for others, e.g. big joins.
  17. Re: Please enter a subject.[ Go to top ]

    I guess what I'm missing here is how does a single server service a request that requires (and/or requires updaing) data that is spread across two or more of the partition servers? The servers must then speak to each other or have caches (which obviously need to be flushed in the case or an update.)

    thanks,

    -James
    i will wait for Cameron's officially answer; but essentially - yes (and some more messages for redundancy). these are similar questions that the industry struggled with RDBMS wrt replication in early 90s. these discussions are now taking in context of the very good 'caching' technology off-the-shelf available today. its possible to do this as LAN/WAN network latencies (at least within a controlled environment) is much lower than disk access latencies, memory is cheap, electronics and electric power supply is much more reliable compared to 15 years ago. the general statement "databases are going away" isn't quite right. the "database" isn't going away - only the need for disk round-trip is. as some of these 'caching' technologies give you ACID compliance too and at a license cost much lower than those of Oracle/DB2 (compare several hundred CPU* 40K USD v/s 2K USD). i am quoting 'cache' all this while as i have a feeling some of these may already be satisfying several of Codd's rules for DBMS. *these are the new DBMS*. Oracle/DB2/SQL Server do not have the design today to provide 25000 TPS cheaply - and Coherence/GigaSpaces etc can do so today. eventually the big RDBMS vendors will have that capacity (i think in 7-8 years time - you already see Oracle buying out some of the technology companies in that arena) - but by that time the need of financial industry will be to process another 10x more rate of transactions. i already hear FGPA based computing being recommended in this space. and finally (sorry can't avoid taking this dig - despite being employed by financial services sector) Buffet still beats the investment banks without ever feeling the need to process transactions at that rate!
  18. Re: Please enter a subject.[ Go to top ]

    I guess what I'm missing here is how does a single server service a request that requires (and/or requires updaing) data that is spread across two or more of the partition servers? The servers must then speak to each other or have caches (which obviously need to be flushed in the case or an update.)

    thanks,

    -James


    i will wait for Cameron's officially answer; but essentially - yes (and some more messages for redundancy). these are similar questions that the industry struggled with RDBMS wrt replication in early 90s. these discussions are now taking in context of the very good 'caching' technology off-the-shelf available today. its possible to do this as LAN/WAN network latencies
    The reason you don't want to do a distributed join is not the network latency today - although that can also be a limiting factor - it's the context switching which is built into socket-based communications. The reality is that the general problem of scaling a database horizontally is solved today best probably by Oracle RAC: shared disk, zero context switching, and infiniband. On the other hand most applications don't need the full brute-force solution, or they may wish to formulate their application requirements so that they don't need it.
  19. Re: Please enter a subject.[ Go to top ]

    first let me take back my comment of Coherence/Gigaspace/'Cache' statisfying Codd's rules - they don't seem to have sufficient built-in support for logical schema and query language processing yet. they are still programmer's tool - and definitely fit "applications that formulate their requirements so that they don't need full-brute-force Oracle".
    it's the context switching which is built into socket-based communications.
    i am trying to understand the effect of context switching on distributed join processing (other than latency). will be grateful if you can explain with an example.
    The reality is that the general problem of scaling a database horizontally is solved today best probably by Oracle RAC: shared disk, zero context switching, and infiniband.
    i don't yet understand the effect of zero context switching on query processing (i am not clear what 'context' you are pointing to - will await your clarification). i assume other factors you mention will affect Oracle/RDBMS and cache in equal measure. wrt your statement above, i have a different question: are you saying that today's Oracle will scale as much as today's commercial ACID caches. (in other words if oracle licenses were as cheap as the cache licenses - the cache product market will disappear). if Oracle itself provided a 'raw/low-level API' at the abstraction level of the caches (i.e. take away the query language support requirement from Oracle engine) - will today's Oracle as ACID compliant datastore compare in performance to the caches?
  20. Re: Please enter a subject.[ Go to top ]

    first let me take back my comment of ...
    sorry, i meant back-out / retract.
  21. Re: Please enter a subject.[ Go to top ]

    i am trying to understand the effect of context switching on distributed join processing (other than latency). will be grateful if you can explain with an example.
    When you send a UDP packet you make a system call. The kernel takes the packet and copies it into the ip stack and sends it out. Receiving works the same way but backwards. After the kernel is done sending it then gives control back to your process. But it doesn't have to be that way. It is possible to operate the network card directly from user space. And with very low-latency, reliable delivery networks, it makes sense to do this synchronously. You send the packet out and wait for it to be placed in the memory of the receiving machine. This is also known as RDMA, Remote Direct Memory Access. It is essentially the same as having a NUMA machine (Non-Uniform Memory Access) such as SGI makes (or used to make). Essentially it's like having some memory banks which are "close" and some which are a little "farther" away. With infiniband you can buy $4000 blade servers that will do the RDMA copy in 1.4us (microsecs.)
    are you saying that today's Oracle will scale as much as today's commercial ACID caches. (in other words if oracle licenses were as cheap as the cache licenses - the cache product market will disappear). if Oracle itself provided a 'raw/low-level API' at the abstraction level of the caches (i.e. take away the query language support requirement from Oracle engine) - will today's Oracle as ACID compliant datastore compare in performance to the caches?
    Infiniband-based Oracle RAC (Real Application Clusters) effectively solves the problem of scaling the database by using infiniband-equipped servers instead of buying bigger and bigger symmetric multiprocessors (what can you get today, 128?). I think I read that Oracle RAC scales up to 500 servers. By using RAC you get the same database functionality, a _real_ single system image, essentially the same performance, and all your data is stored in one place (your network-attached raid array). I guess the minus side is cost. Cameron once complained that Oracle RAC costs too much, and that's probably true ($60K/cpu, perhaps?) But then you get what you pay for. Another advantage of databases is that they use memory more efficiently than objects. In java there is a high minimum cost for all objects. For example, I think once I measured that an empty string requires about 40 bytes. For storing nothing. I want to repeat that for automatic trade generation you need to use a real in-memory database, not Oracle - too slow. The main difference is that the database is organized in memory so as to maximize cpu cache utilization (for example, by columns). Oracle doesn't do this because since eventually the disk is involved the cpu cache never becomes the bottleneck.
  22. Clarifications[ Go to top ]

    Guglielmo -
    Infiniband-based Oracle RAC (Real Application Clusters) effectively solves the problem of scaling the database by using infiniband-equipped servers instead of buying bigger and bigger symmetric multiprocessors (what can you get today, 128?). I think I read that Oracle RAC scales up to 500 servers.
    Oracle RAC can run on up to 64 servers. For read-only applications, it can actually scale to that level. On the other hand, if it is a write-intensive application, it can scale negatively (i.e. one server can be the optimal deployment).
    By using RAC you get the same database functionality, a _real_ single system image, essentially the same performance, and all your data is stored in one place (your network-attached raid array).
    I happen to be a fan of Oracle RAC, and I don't understand why you keep comparing it to Coherence. They solve two completely different problems. As far as the _real_ this and the _real_ that, feel free to drop me an email (cameron at tangosol) to explain your apparent personal disgust with the concepts of managing objects in memory across a grid infrastructure. Regarding performance, having to issue database queries and map relational data into objects pretty clearly gives an significant advantage to a system that already has the data in object form, even if the database has zero latency and scales linearly (neither of which Oracle can claim). However, the reverse is also true, which is that there are still many tasks that we counsel architects to use a database for, and not a data grid.
    I guess the minus side is cost. Cameron once complained that Oracle RAC costs too much, and that's probably true ($60K/cpu, perhaps?) But then you get what you pay for.
    Technically it runs about $80k/CPU, but companies that run RAC don't tend to pay per CPU. It's a good product, and yes, with good products you tend to get what you pay for. I've never counseled anyone to avoid RAC; I like the idea of a reliable, scalable database sitting behind a Coherence data grid. What I do counsel people to do is to avoid architecting applications to rely heavily on a single point of bottleneck, and that means avoiding synchronous reliance on shared resources (e.g. a database shared by many application servers in a scale-out architecture).
    Another advantage of databases is that they use memory more efficiently than objects. In java there is a high minimum cost for all objects. For example, I think once I measured that an empty string requires about 40 bytes. For storing nothing.
    This is definitely true. Oracle comes from the 1970s, where every byte (every bit?) mattered. As a result, it is much more efficient with memory than Java is. At the same time, it is much less likely to make efficient use of large memory, which is why Oracle bought TimesTen (an in-memory relational db).
    I want to repeat that for automatic trade generation you need to use a real in-memory database, not Oracle - too slow.
    Coherence is a very popular solution for building algorithmic trading and electronic execution systems, so you obviously do not need an in-memory "database", just in-memory data. Oracle or Sybase or DB2 are plenty efficient for managing the persistent parts of such applications. Peace, Cameron Purdy Tangosol Coherence: The Java Data Grid
  23. Re: Clarifications[ Go to top ]

    Oracle RAC can run on up to 64 servers. For read-only applications, it can actually scale to that level. On the other hand, if it is a write-intensive application, it can scale negatively (i.e. one server can be the optimal deployment)
    A RAC database can have upto 100 instances. Atleast, as claimed by Oracle. Checkout http://download-west.oracle.com/docs/cd/B19306_01/rac.102/b14197/admcon.htm#sthref59
    Casual conversations on RAC with our customers provides a different picture though - it ain't easy to get the product to scale beyond 4/5 cluster members for high throughput applications. RAC is a shared everything database where each instance has equal access to all disks used to manage the tablespaces. Here are a few important points worth noting:
  24. The unit of data movement in the shared buffer cache across the cluster happens to be a logical data block, which is typically a multiple of 8KB. The design is primarily optimized to make disk IO very efficient. AFAIK, even when requesting or updating a single row the complete block has to be transferred. In a distributed main memory object management system, the typical unit of data transfer is an object entry. In a update heavy and latency sensitive environment, the unit could actually be just the "delta" - the exact change.

  25. In RAC, If accessing a database block of any class does not locate a buffered copy in the local cache, a global cache operation is initiated. Before reading a block from disk, an attempt is made to find the block in the buffer cache of another instance. If the block is in another instance, a version of the block may be shipped. Again, there is no notion of a node being responsible for a block; there could be many copies of the block depending on how the query engine parallelized the data processing. This will be a problem if the update rate is high - imagine distributed locks on coarse grained data blocks in a cluster of 100 nodes; for every single update?

  26. At the end of the day, RAC is designed for everything to be durable to disk and requires careful thought and planning around a high speed private interconnect for buffer cache management and a very efficient cluster file system
  27. Technically it runs about $80k/CPU, but companies that run RAC don't tend to pay per CPU. It's a good product, and yes, with good products you tend to get what you pay for.
    I guess, you are implying that by the time you are deployed, it could cost you $80K/CPU? The oracle price list is available here -> http://www.oracle.com/corporate/pricing/eplext.pdf
    Oracle comes from the 1970s, where every byte (every bit?) mattered. As a result, it is much more efficient with memory than Java is. ..
    Good point. One of the reasons why we also provide a native C/C++ implementation for managing the data structures. Having said that, there are variety of ways to reduce the object overhead. For instance, when managing objects in cache memory on a server, just manage the object in a serialized form. Or, if the distributed cache supports native serialization, this can further compress the data storage requirements. GemStone's GemFire provides native XML data management where strings are canonicalized potentially resulting in more than 10X savings in space needs. Finally, but most importantly, what I see is with objects, applications naturally tend to manage object graphs - related data stored together. In a RDB, every foreign key relationship translates to heavy duty index maintainence. Take a complex "star" schema with many FK relationships, and, I think, what is more efficient could be a coin toss. Cheers -- Jags Ramnarayan GemStone Systems GemFire, the Enterprise data fabric (http://www.gemstone.com)
  • Re: Clarifications[ Go to top ]

    Oracle RAC can run on up to 64 servers.
    I read somewhere that it was tested with 500 nodes.
    For read-only applications, it can actually scale to that level. On the other hand, if it is a write-intensive application, it can scale negatively (i.e. one server can be the optimal deployment).
    That's just the nature of the workload - it's 100% serial. That's not a limitation in the scalability of the product, but in the scalability of the workload.
    I happen to be a fan of Oracle RAC, and I don't understand why you keep comparing it to Coherence.
    They are not comparable, but I read comparable claims, specifically solving the database bottleneck. If someone is going to sell me a clustered cache I want to hear its limitations clearly spelled out, because there is no such thing as a free lunch. Clearly any cache clustered on top of IP has severe limitations when it comes to processing the general SQL query that you can process in a centralized database. I know that from first principles, not because I know anything about Coherence. I am sure Coherence is an excellent product.
    They solve two completely different problems. As far as the _real_ this and the _real_ that, feel free to drop me an email (cameron at tangosol) to explain your apparent personal disgust with the concepts of managing objects in memory across a grid infrastructure.
    I have no such disgust.
    Regarding performance, having to issue database queries and map relational data into objects pretty clearly gives an significant advantage to a system that already has the data in object form, even if the database has zero latency and scales linearly (neither of which Oracle can claim)
    That's true. I think the reason it doesn't matter for most OLTP applications is that the latency is acceptable whereas throughput is always a problem. I think that the point of having a database clustered with special hardware is that the latency doesn't take much of a hit over the single-node case, but you can add throughput just by adding nodes. That's nice.
    Coherence is a very popular solution for building algorithmic trading and electronic execution systems, so you obviously do not need an in-memory "database", just in-memory data.
    I would be curious to see how the latency of such algorithms implemented in Java compare to programs written in K (www.kx.com). Perhaps for the algorithms in question shaving a few extra milliseconds doesn't matter. Guglielmo Enjoy the Fastest Known Reliable Multicast Protocol with Total Ordering .. or the World's First Pure-Java Terminal Driver
  • Re: Clarifications[ Go to top ]

    Clearly any cache clustered on top of IP has severe limitations when it comes to processing the general SQL query that you can process in a centralized database. I know that from first principles, not because I know anything about Coherence. I am sure Coherence is an excellent product.
    Coherence is designed to provide high throughput and low latency for data, and is typically used for identity-based data management (as opposed to general SQL queries), which means that the application is accessing and manipulating data by primary key. Regarding the "on top of IP" question, I'm curious what we'll see when Java supports Infiniband "natively", which is supposed to be in Java 6 (I thought). Technically, there is no reason (on platforms with user mode IP) for IP traffic to go through kernel mode either. Either way, we've already made significant architectural investments to be able to support IB as another option in Coherence.
    Coherence is a very popular solution for building algorithmic trading and electronic execution systems, so you obviously do not need an in-memory "database", just in-memory data.
    I would be curious to see how the latency of such algorithms implemented in Java compare to programs written in K (www.kx.com). Perhaps for the algorithms in question shaving a few extra milliseconds doesn't matter.
    I don't know K. Are you saying that K would introduce additional latency because of the need to query, etc.? Or are you saying that Java has additional latency because of some particular reason? From a brief look at their web site, we have a good number of joint customers. I could ask around to see what projects are using them. Peace, Cameron Purdy Tangosol Coherence: The Java Data Grid
  • Re: Clarifications[ Go to top ]

    Coherence is designed to provide high throughput and low latency for data, and is typically used for identity-based data management (as opposed to general SQL queries), which means that the application is accessing and manipulating data by primary key.
    I figured.
    Regarding the "on top of IP" question, I'm curious what we'll see when Java supports Infiniband "natively", which is supposed to be in Java 6 (I thought). Technically, there is no reason (on platforms with user mode IP) for IP traffic to go through kernel mode either. Either way, we've already made significant architectural investments to be able to support IB as another option in Coherence.
    Good for you. You can support transparently via Socket Direct Protocol (Mellanox has a zero-copy SDP implementation, I think, even though zero copy is not a goal of SDP.) But I think much more interesting would be to implement a Java binding for UDAPL.
    I don't know K. Are you saying that K would introduce additional latency because of the need to query, etc.? Or are you saying that Java has additional latency because of some particular reason?

    From a brief look at their web site, we have a good number of joint customers. I could ask around to see what projects are using them.
    K would remove latency. K is an array language (like APL, A+, and a few others, apparently). KDB is an in-memory database based on K. The tables are stored as sets of lists (one list per column). KDB is single-threaded - it does not support concurrent transactions. It's designed to have the absolute rock-bottom response time. The fascinating thing that makes it work well is that the columnar storage and the array generate query plans that have stellar cache utilization - unlike your run-of-the-mill relational database. Separately I came upon MonetDB, which also has columnar storage, and I think the latest version of the MonetDB storage engine, called X100, essentially should reproduce the performance of K. There is an old article on K and KDB by Arthur Whitney et al. called "High Volume Transaction Processing". Next time we can talk about those other guys who are using graphics cards to do database joins ... Guglielmo Enjoy the Fastest Known Reliable Multicast Protocol with Total Ordering .. or the World's First Pure-Java Terminal Driver
  • Re: Clarifications[ Go to top ]

    Either way, we've already made significant architectural investments to be able to support IB as another option in Coherence.
    Good for you. You can support transparently via Socket Direct Protocol (Mellanox has a zero-copy SDP implementation, I think, even though zero copy is not a goal of SDP.) But I think much more interesting would be to implement a Java binding for UDAPL.
    So far, I've only seen it used on the IPonIB layer (which I assume was SDP). I've downloaded the items you referenced, and will take a look. I still haven't seen any UDAPL support in Java. Do you have any links on that?
    Next time we can talk about those other guys who are using graphics cards to do database joins ...
    IIRC, I heard of it in a New York finserv company, using the graphics chip (i.e. COTS hardware) as a floating point vector engine to do loads of relatively simple risk calculations. It was just a POC, and I have no idea what the outcome was. A vendor that I'm aware of is Detica in the UK (the company that bought Evolution, which Simon Brown works for), which repurposed some of its specialized chips to target the financial industry. Again, I've only heard of POCs so far. Peace, Cameron Purdy Tangosol Coherence: The Java Data Grid
  • Re: Clarifications[ Go to top ]

    So far, I've only seen it used on the IPonIB layer (which I assume was SDP).

    I've downloaded the items you referenced, and will take a look. I still haven't seen any UDAPL support in Java.
    I don't think there is such a thing as UDAPL for Java, but it can be added. The essence of udapl is that it supports RDMA semantics, as opposed to stream semantics.
    Do you have any links on that?
    Just www.datcollaborative.org, but I am sure you already found it.
    IIRC, I heard of it in a New York finserv company, using the graphics chip (i.e. COTS hardware) as a floating point vector engine to do loads of relatively simple risk calculations. It was just a POC, and I have no idea what the outcome was.

    A vendor that I'm aware of is Detica in the UK (the company that bought Evolution, which Simon Brown works for), which repurposed some of its specialized chips to target the financial industry. Again, I've only heard of POCs so far.
    I have only read some articles about it. I doubt it's being used anywhere for real, although you never know. Once I sent an email to Kx to ask them if they know of anyone using gpus for program trading, and they said no. Guglielmo Enjoy the Fastest Known Reliable Multicast Protocol with Total Ordering .. or the World's First Pure-Java Terminal Driver
  • Re: Please enter a subject.[ Go to top ]

    i already hear FGPA based computing being recommended in this space.
    We have seen some interest in this area, and while we are working with a company on this, the future is not clear. There are probably 10 different approaches today, including PCIx/PCIe mounted custom FPGAs, custom servers (think Azul), chips that drop into an AMD Opteron socket, video accelerators used for calculation accelleration (no kidding!), etc. Peace, Cameron Purdy Tangosol Coherence: The Java Data Grid
  • I guess what I'm missing here is how does a single server service a request that requires (and/or requires updaing) data that is spread across two or more of the partition servers? The servers must then speak to each other or have caches (which obviously need to be flushed in the case or an update.)

    thanks,

    -James
    James, There are many techniques available to reduce the need to service requests by joining or locking data across distributed nodes. Thoughts that come to mind: (1) Relationship driven Data Placement : Try to colocate data that is related and commonly accessed together in a single node. For instance, many OLTP applications tend to join parent-child tables in a very predictable fashion - when I access Orders, I typically need the parent Customer data and the corresponding LineItems. This is similar to how you cluster tables (nothing to do with distributed clustering across machines) in a RDB. In a distributed data management system that offers relational database semantics, these relationships are well defined and the underlying machinery could capitalize on this knowledge automatically. In the object world, you have the following choices: (1) framework that requires the application to programmatically identify the relationships (2) Use byte code enhancements to make relationship management somewhat transparent (3) Use a persistence layer like the new Java Persistence API (EJB3) where the relationships are captured in a deployment descriptor or through annotations. (2) Combining partitioning with replication: When it comes to many high performance applications, like the ones in investment banking, it is seldom the case that the operational data required by the app ever goes beyond 50GB or so. What you often find is the equivalent of the "star" schema - one or two fact tables with millions of records (transactions for the month, trades, etc) and several code/reference data tables that the fact tables reference. The interesting point is that these related tables are generally much smaller. So, the trick here is to just partition the big table (of Object collection) and replicate the rest of the tables. Ofcourse, I am overly simplifying the problem - I guess, the point is, you can use redundant copies along with partitioning to achieve scale. Also, in practice, what we observe when dealing with apps that are query intensive is that the problem typically is not how much data is moving on the network, rather where the query is executed. The queries are quite selective on the data set (returned dataset is very small) - the challenge here is to reduce disk contention (which ain't a problem with just memory storage), contention points in the software, context switching, the plan generation, using indexes and so on. The goal should be - reduce CPU utilization and you got scale. (3) Dynamic Data Placement: Similar to the first approach with one key difference: Keep statistics about the query workload and automatically move data and establish copies of data. Of course, the usual tradeoffs of maintaining data consistency with high update rates apply. Then, there could be smarter indexing schemes where indexes could be made available locally or through a small set of nodes in your distributed system. Though not a single hop query execution scheme, you get the huge benefit of query parallization. And, of course, for many object oriented applications that manage closely related data in the object entry itself, key based access is more than sufficient. Checkout this wiki page on Distributed Hashtable technology - consistent hashing for the keys, overlay networks, etc (http://en.wikipedia.org/wiki/Distributed_hash_table) If you have acces to ACM, I find this paper relevant -> http://portal.acm.org/citation.cfm?id=371598 Cheers! --- Jags Ramnarayan GemStone Systems GemFire - The Enterprise Data Fabric http://www.gemstone.com

  • Thanks to everyone that took time to respond to my questions.
    (1) Relationship driven Data Placement : Try to colocate data that is related and commonly accessed together in a single node. For instance, many OLTP applications tend to join parent-child tables in a very predictable fashion - when I access Orders, I typically need the parent Customer data and the corresponding LineItems...
    So the strategy here would be to get the request to the server that has the order's data, right?
  • (1) Relationship driven Data Placement : Try to colocate data that is related and commonly accessed together in a single node. For instance, many OLTP applications tend to join parent-child tables in a very predictable fashion - when I access Orders, I typically need the parent Customer data and the corresponding LineItems...


    So the strategy here would be to get the request to the server that has the order's data, right?
    Yes, that is correct.
  • (1) Relationship driven Data Placement : Try to colocate data that is related and commonly accessed together in a single node. For instance, many OLTP applications tend to join parent-child tables in a very predictable fashion - when I access Orders, I typically need the parent Customer data and the corresponding LineItems...
    So the strategy here would be to get the request to the server that has the order's data, right?
    Yes, that is correct.
    With Coherence, you simply execute the request as an agent against the Order object itself: result = orders.invoke(orderId, requestProcessor); In a grid environment, the once-and-only-once processing would occur where the Order object is located, and with affinity the Orders for a particular customer would be located on the same grid node that contains the customer data. Peace, Cameron Purdy Tangosol Coherence: The Java Data Grid
  • Re: Please enter a subject.[ Go to top ]

    It's partitioned, which means they don't have to talk to each other to maintain the data, because only one server will have the master copy of each data (and probably only one or two servers will have a backup copy for redundancy
    [..] how does a single server service a request that requires (and/or requires updaing) data that is spread across two or more of the partition servers? The servers must then speak to each other or have caches (which obviously need to be flushed in the case or an update.)
    Yes, of course. It is very easy to understand distributed systems simply by imagining each server to be a person, and trying to determine the least amount of conversation required to achieve a certain goal. For the simple case you mentioned, if a server running Coherence needs data from (for example) three other servers, it will send a request in parallel to those three servers (typically all point-to-point communication), which will then process and respond in parallel, so the latency does not tend to increase for read operations regardless of the number of servers involved. There will almost always be network traffic when there are mutating operations, since resiliency of the data modifications requires synchronous backups to another server or servers. As you suggested, additional communication is often present since other servers may be listening for events resulting from the modifications. Coordinating changes to data managed on different servers will also obviously require communication. Peace, Cameron Purdy Tangosol Coherence: The Java Data Grid
  • So, if each machine contributed, say, 1GB to the "global in-memory storage", we have 6000 GB, which is 6 Terabytes. Hypothetical question, purely out of curiosity: what happens if 5999 of the 6000 servers fail at the same time? Obviously, the 1 surviving machine does not have 6 TB of RAM to substitute for the failed ones. Does Tangosol begin to backup to disk or something when the "data grid" begins to realize that there's not sufficient RAM around? I suppose this would be an uneconomical choice, so perhaps the grid panics and shuts down?
  • So, if each machine contributed, say, 1GB to the "global in-memory storage", we have 6000 GB, which is 6 Terabytes.
    Actually, it would be 3TB (with 1 level of redundancy).
    Hypothetical question, purely out of curiosity: what happens if 5999 of the 6000 servers fail at the same time?
    You lose data.
    Does Tangosol begin to backup to disk or something when the "data grid" begins to realize that there's not sufficient RAM around?
    Yes. It's called "overflow", and we can do that to disk (including with BerkelyDB).
    I suppose this would be an uneconomical choice, so perhaps the grid panics and shuts down?
    Yes, the application (failing to meet quorum requirements) should stop operating. The problem is that when you lose many servers *at the same time* you will lose data, because chances are that one of those servers that died was a backup for at least a little of the data from one of the other servers that died. However, if one dies every five seconds or something like that, then it can use the overflow model that you described without data loss. Peace, Cameron Purdy Tangosol Coherence: The Java Data Grid
  • Re: Please enter a subject.[ Go to top ]

    Come-on guys, enough of Bigus Dickus (in this context at least), The actual text on the slide was... Most large banks have 2-5,000+ CPU grids - Often a mixture of blades, Egenera, V40s etc. - Recently Azul has appeared with interesting technology I was just pointing out that we have to perform a lot of complex calculations and by preference we like grids. Most of the largest grids are in banks, Platform, DataSynapse, GigaSpaces, Tangosol and Gemstone all work in this area. I wasn't referring to any one bank or any one implimentation, it was a generalisation based on having consulted on projects with at least three of these vendors and worked on well over half a dozen grid projects in several banks. Where the heck you guys got power-grids from I just don't know, this is High Performance Compute grids guys!!! I suggest you guys come to the conference next time, you might actually understand what we're talking about here rather than going off in a tangent. -John- -John-
  • Where the heck you guys got power-grids from I just don't know, this is High Performance Compute grids guys!!!
    I can shed light on this :-) Middle of the third line (on 1280x1024) of the third paragraph of the original post :-) Actually I think, the original was quite intelligible, too, although obviously the stats could not be correctly learned from it. Best regards, Robert Varga
  • Re: Please enter a subject.[ Go to top ]

    Where the heck you guys got power-grids from I just don't know, this is High Performance Compute grids guys!!!
    Someone already pointed it out but here's the text from the entry: "This grunt is handled by up to 6,000 CPU’s housed in massive (heavily air conditioned) rooms, sometimes powered by their own power grids...
    I suggest you guys come to the conference next time, you might actually understand what we're talking about here rather than going off in a tangent.

    -John-

    -John-
    If you feel that way then tell TSS not to post articles on your discussions.
  • Re: Please enter a subject.[ Go to top ]

    Finally it seems to be a power grid not a computer 'grid' that he's talking about. That is obviously not true, and 6000 CPUs certainly do not need their own power grid in any event, but whatever ...
    In a "true" enterprise setting it is not atypical for a data center to have it's own power generation capabilities for disaster mitigation purposes. This has nothing to do with the actual power needs of the data center (often not even a blip to the public grid). Here in the Philadelphia suburbs I can point to a couple corporate buildings that have their own power production facilities and that actually resell excess power to the grid in order to defray the costs. In fact, I've worked for a company whose trading room and data center was in a remodeled mill building and whose backup power kicked in every other hour as a "hot test". The data center there could handle three days of uptime without any outside assistance.
  • Re: Typos?[ Go to top ]

    I hope these are just typos but 10000 calculations per second does not equal 1 every 5 milliseconds, it's 10 every millisecond. And frankly that's really not very many for even just a single a modern computer, let alone a grid of 6000. Maybe calculation is not really the right term; transactions possibly...?
    This are definately typos, bare in mind that that most of us have been up partying until 6am in the morning so you have to accept a few typos, the don't start eating here until midnight. I would have proof-read these for Cameron if he could have found me but I was sleeping most of the day. :-) This was the original text on the slide... Automated trading systems use rules to trade on the exchanges Volumes often exceed 10,000 messages a second Reaction times are typically in the low milliseconds As for searching through a million lines of code, this too was a little typo, what I said was that you can often brute-force search through 1,000,000 rows of data (Objects) faster than executing a query on a database. -John-
  • Re: Typos?[ Go to top ]

    This are definately typos, bare in mind that that most of us have been up partying until 6am in the morning so you have to accept a few typos...
    So me let get this straight... 'This are definately typos' but we are to be blamed for 'no being there' or (I suppose) not being able to know what was really meant. I think you should 'bare' in mind that not everyone has the time or money to hop on a flight to Europe and those without such luxuries can only assess what was written here.
  • Re: Typos?[ Go to top ]

    So me let get this straight... 'This are definately typos' but we are to be blamed for 'no being there' or (I suppose) not being able to know what was really meant.

    I think you should 'bare' in mind that not everyone has the time or money to hop on a flight to Europe and those without such luxuries can only assess what was written here.
    This really isn't worth going into and I apologise if you think I'm blaming you. There are a few typos in this and it's obviously confused things a little. Some people seem to have understood the review and others didn't. The slides should be posted somewhere soon and if anyone's interested I can either correct the overview or repost something. -John-
  • Re: Typos?[ Go to top ]

    This really isn't worth going into and I apologise if you think I'm blaming you.
    That's not really an apology but I wasn't really looking for one anyway.
    There are a few typos in this and it's obviously confused things a little. Some people seem to have understood the review and others didn't. The slides should be posted somewhere soon and if anyone's interested I can either correct the overview or repost something.
    From your comments, it seems that it was more than just a few typos. It seems that the author of this may not have really understood or followed the discussion. Saying that the CPUs are "sometimes powered by their own power grids" can't really be explained away as a typo. You apparently never mentioned power grids and 'sometimes powered by their own computer grids' doesn't really make sense and even if you assume 'powered by' in the technology marketing sense (e.g. 'powered by google') it still seems to miss the whole point. I imagine those who 'got it' already had a lot of knowledge in this area.
  • Re: Typos?[ Go to top ]

    I think if you really take the content of the site and the value of the discussion seriously then you would probably (1) take the quality of the postings more seriously (this is generally a problem here on TSS) and (2) would take steps to correct the original article (which is still a mess as I write this) as soon as you see general confusion over its (obviously) confusing and incorrect content rather than berating your readers for not being telepathic. But it's good to know that the "information" read here comes from half-drunk conference attendees (and the JBoss marketing dept.).
  • The activity of the trading room floor (described as the front office) demands CPUs make up to 10,000 calculations per second
    From the wording I would guess he was talking about program trading. You read a constant flood of 'ticks' and generate trades and fire them off to the exchange automatically (everyone else is doing it also.) Much of this stuff is written in APL, A+, J, etc. I think about 50% of NYSE trading may be program trading. While RDBMSs are definitely out (who has time to write to disk or even copy data?) in-memory dbs are very much in, for example KDB or, very interestingly, MonetDB, especially the new X100 engine. I did some investigating on why KDB is so fantastically fast and came upon MonetDB and convinced myself that X100 should be as fast as KDB. Except that KDB is also really really simple, which MonetDB is not (but it's instructive.) This is also were the totem protocol may come in handy: replicate your data in memory by using totem instead of saving it to disk, and you can take the disk out of the critical path. Although don't use it for program trading because other companies are already using primary-backup which has lower latency, so you would be at a disadvantage. But there are many other applications. Guglielmo Enjoy the Fastest Known Reliable Multicast Protocol with Total Ordering .. or the World's First Pure-Java Terminal Driver
  • Spot on, I'm glad someone can see through the typos! -John-
  • Nope...a 6000 CPU grid[ Go to top ]

    I just left the company for which Davies works. The 6000 CPU setup he is talking about is a compute grid. They are all (I believe) in a single data center, or perhaps they are divided across two data centers. This 6000 CPU grid is a general purpose grid. IIRC, it was set up initially to handle risk calculations, although the grid is now used for a variety of purposes. These problems are incredibly difficult to handle, as they pull in data from 1000s (millions?) of sources, performing complex equations, with each calculation potentially affecting subsequent calculations. I'm not as familiar with the program trading end of the business, but I don't believe the program trading was handled by the grid. Program trading isn't what I thought before I got into the investment banking business. I originally thought these guys were doing massive calculations to automatically buy & sell positions across a variety of investment vehicles, all based on some empirical calculations. As it turns out, the program trading that I am aware that Davies' company is doing really boils down to someone (say Vanguard, the mutual fund) attempting to purchase $1b of MSFT (for instance). Well, if they were to buy $1bn of the stock all at once, the price of the stock would sky rocket immediately. Instead, this trade would be broken into tons of trades, all of a volume of, say, 1000 shares at a time. The trade will occur over several days, or even weeks. The program trading algorithm decides how best to buy or sell the stock so as to minimize or maximize, respectively, the average price of the stock for the $1bn trade. This turns out to be a much simpler problem, since there are far fewer variables involved in the calculation, and much fewer dependencies than are required for caculating risk. As I understand it, there are some groups that do more complex program trading, where the underlying financial instrument is bought or sold in concert with derivatives, such as options. So, for instance, over an hour you might purchase 100,000 shares of a stock while simultaneously buying call options for 10,000 shares. The stock purchase might end up moving the price of the stock a few basis points, for which the profit from the call options would offset the higher average cost of the stock. In the end, the net effective cost for the shares would be closer to the price of the stock prior to the initiation of the programmed trade. This might involve more complex computations, but I really doubt this sort of task would be well suited to a massively parallel grid. John Davies might want to correct me on this point, since he has far more information on what exactly the grid is used for than I. But, rest assured, the problems that the investment banking industry sees on a daily basis are typically an order of magnitude more complex, and transactionally intesnive, than the most difficult problems seen in most industries. It is a fascinating industry, although it by no means has a monopoly on high-volume transactionally-intesive and CPU-intensive computing. Given all of that, you might be surprised how much of the business is still run on AS/400 systems running RPG. For new work, though, Java appears to be the language of choice for most investment banking firms, at least for new development.
  • Program trading isn't what I thought before I got into the investment banking business. I originally thought these guys were doing massive calculations to automatically buy & sell positions across a variety of investment vehicles, all based on some empirical calculations.
    There are such people, and I think this kind of work generates a massive amount of trading. For example all arbitrage is likely handled this way. I think what you saw in your company is usually called "portfolio trading", where you manage a client's very large positions, but in a semi-automated way. For the automated trading I know that at Morgan Stanley they used to use APL. Arthur Whitney wrote an implementation called A+, and later founded Kx Systems and wrote K. That kind of trading is completely automated - it does exactly what you described above. Guglielmo Enjoy the Fastest Known Reliable Multicast Protocol with Total Ordering .. or the World's First Pure-Java Terminal Driver
  • Re: Nope...a 6000 CPU grid[ Go to top ]

    Dan, You're not far off, there's little to correct you on. I must point out though the the number "6,000" was picked out of the air by Cameron wrote wrote this up so it's not a direct quote from me, having said that 6,000 is well within the norm. It would be unwise for either of us to go into anymore detail about what any particular bank does with their grid, I assume if you worked at the same place as me you will still be under some form of NDA. Anyway, if you take the top, say 20 banks, you will probably not find a single one with a dedicated grid of less than 2,000 CPUs, many are much larger. Typically these are calculating things like VaRs (Value at Risk), the mathematics behind VaR are far too complex to go into here, it is complex enough to have a book dedicated to this single calculation (http://www.value-at-risk.net/), this includes sleep-invoking topics like * quadratic ("delta-gamma") methods for nonlinear portfolios, * variance reduction (control variates and stratified sampling) for Monte Carlo VaR measures, * principal component remappings, * techniques to "fix" estimated covariance matrices that are not positive-definite, and * the Cornish-Fisher expansion. Typically a single VaR can take up to 20 minutes to calculate on an average laptop, we need 10s of 1000s of these calculated in seconds, the quicker the better, with multiple calculations occurring at once certain optimisations can be made to the curves so it can be made to run a little quicker. It's quite simple really, it's like buying and selling on eBay, if you take too long to decide what price you're going to quote you will lose the deal, banks are the same, only the winners makes money, seconds place loses the deal, in a dealing room though the price is only valid for seconds and you need to decide in milliseconds. If your competitor decides in 800ms then you need <800ms and therefore more CPUs, the business traded in day will easily justify another few 1000 CPUs. -John- CTO, C24