Discussions

News: Clustered JDBC (C-JDBC) Performance Report

  1. Clustered JDBC (C-JDBC) Performance Report (39 messages)

    C-JDBC is an open source Java middleware that provides fault tolerance and performance scalability to databases using clustering techniques at the JDBC level.

    C-JDBC works with any Java application and does not require application changes. It also works transparently with any database that provides a JDBC driver (MySQL, PostgreSQL, Oracle, DB2, Sybase, SAP DB, HSQLDB, ...).

    A research report on C-JDBC and its underlying concept RAIDb (Redundant Array of Inexpensive Databases) has been published.

    A performance evaluation using a servlet version of the TPC-W benchmark shows that scalable performance can be achieved using open-source databases and commodity hardware.

    Project home page: http://c-jdbc.objectweb.org

    Threaded Messages (39)

  2. Transactions?[ Go to top ]

    I admit that I haven't read the report, but right off the bat I know that if you want database integrity you will need to perform two-phase commits, which will add to the latency. Perhaps that's okay performance-wise, because with higher throughput you always get higher latency, but the issue does not inspire confidence.
  3. Transactions?[ Go to top ]

    No, you don't need two-phase commit. C-JDBC supports transactions and uses parallel transactions. A specific recovery log is used to recover or roll forward a database that went out of sync.
    The assumption is that it usually goes well (which is the case in fact) and you don't have to pay the price of a 2-phase commit for each transaction. However, when a database fails, you need to start a recovery process that is not as efficient as if the transaction was handled by a 2PC. In the targeted cluster environment, the behavior of all database backends is the same unless a node crashes. In any case, you have to pay the price of a recovery when a node crashes.
  4. Transactions?[ Go to top ]

    Could you clarify how you are rolling forward the database?

    > No, you don't need two-phase commit. C-JDBC supports transactions and uses parallel transactions. A specific recovery log is used to recover or roll forward a database that went out of sync.
    > The assumption is that it usually goes well (which is the case in fact) and you don't have to pay the price of a 2-phase commit for each transaction. However, when a database fails, you need to start a recovery process that is not as efficient as if the transaction was handled by a 2PC. In the targeted cluster environment, the behavior of all database backends is the same unless a node crashes. In any case, you have to pay the price of a recovery when a node crashes.
  5. Transactions?[ Go to top ]

    Could you clarify how you are rolling forward the database?

    >
    A recovery log records all SQL statements that update the database (including transaction markers). Checkpoints correspond to indexes in the recovery log and database dumps can be associated to a checkpoint.
    If you want to add a database on the fly, first we first restore the dump corresponding to a given checkpoint and replay (roll forward) all the updates since that checkpoint. When the database becomes in sync with the other nodes, we enable it to serve user queries.
    If a database failed to commit a transaction, we can replay this transaction and resynchronize the node from the failure point. However, if something more serious happened (transaction log full, node failure, ...) we have to completely restore the database from a known checkpoint and replay the updates, the same way we add a new backend.
    Note that the recovery log uses is stored using JDBC and thus can be reinjected in a C-JDBC cluster providing fault tolerance features to the log.
  6. Transactions?[ Go to top ]

    A recovery log records all SQL statements that update the database (including transaction markers). Checkpoints correspond to indexes in the recovery log and database dumps can be associated to a checkpoint.

    > If you want to add a database on the fly, first we first restore the dump corresponding to a given checkpoint and replay (roll forward) all the updates since that checkpoint. When the database becomes in sync with the other nodes, we enable it to serve user queries.
    > If a database failed to commit a transaction, we can replay this transaction and resynchronize the node from the failure point. However, if something more serious happened (transaction log full, node failure, ...) we have to completely restore the database from a known checkpoint and replay the updates, the same way we add a new backend.
    > Note that the recovery log uses is stored using JDBC and thus can be reinjected in a C-JDBC cluster providing fault tolerance features to the log.

    So does this mean that you have N+1 databases, where N are the user database and 1 for the transaction (recovery) log?
  7. Transactions?[ Go to top ]

    So does this mean that you have N+1 databases, where N are the user database and 1 for the transaction (recovery) log?


    You could but you can also store the transaction log on the same backends that are hosting your user database (in a separate database or just as extra tables of existing databases). If you are completely paranoid about loosing your log, you can even replicate it on each of your N database backends.
  8. Transactions?[ Go to top ]

    So does this mean that you have N+1 databases, where N are the user database and 1 for the transaction (recovery) log?

    >
    > You could but you can also store the transaction log on the same backends that are hosting your user database (in a separate database or just as extra tables of existing databases). If you are completely paranoid about loosing your log, you can even replicate it on each of your N database backends.
     
    OK, so when a replica fails and recovers where does it get the most up-to-date state from? Presumably each database must have some way of telling whether it is up-to-date or not?

    Does a recovering database replica have to talk to a majority of the other databases to find the "current" state? Do you trickle updates to that database while it is initialising itself (you could imagine that as a recovering replica becomes up-to-date this will take some time and the "current" state may be moving on as users continue to manipulate the state - or do you block access while replicas continue?)

    Can you tolerate network partitions? Split brain scenario occurs in reality in these kinds of situations and I'd like to know how (or if) you tackle it.

    How do you detect failures? Actually "suspect" failures is a better term, since you can't ever detect a failure with 100% accuracy until (or unless) a failed machine recovers.

    Mark.
  9. Transactions?[ Go to top ]

    Mark, I cannot find any documentation that answers any of your questions. Have you had any luck?
  10. Transactions?[ Go to top ]

    Mark, I cannot find any documentation that answers any of your questions. Have you had any luck?


    Sorry, I've been travelling and only just got back online. I haven't found any documentation that answers my questions, so I'd be interested in any pointers.
  11. Transactions?[ Go to top ]

    OK, so when a replica fails and recovers where does it get the most up-to-date

    > state from? Presumably each database must have some way of telling whether it
    > is up-to-date or not?

    When a replica fails, the recovery is always ensured by the controller using the information stored in the recovery log.
    Every enabled database is always up-to-date unless some writes are still in the controller queue and have not been completed. In any case, the controller has a complete knowledge of each backend state and a query depending on a previous write will wait for the completion of the write before executing.
    Maybe one thing that was not clear in my previous explanation is that the controller never guesses the state of a failed or newly added backend. The contrller does not look at the database content to know to which state or checkpoint it corresponds. If we don't know the state of a backend (because of a crash for example), we restore a database dump corresponding to a known checkpoint (also known from the recovery log) and we replay the updates to get this backend syncrhonized with the others.

    >
    > Does a recovering database replica have to talk to a majority of the other databases to find the "current" state? Do you trickle updates to that database while it is initialising itself (you could imagine that as a recovering replica becomes up-to-date this will take some time and the "current" state may be moving on as users continue to manipulate the state - or do you block access while replicas continue?)

    In general, database replica does not talk to each other. When a replica recovers, a separate thread in the controller is in charge of recovering the database and once the recovery log has been replayed, we do the appropriate tricks so that ongoing updates that where not yet logged are played by the new replica before it is enabled.
    >
    > Can you tolerate network partitions? Split brain scenario occurs in reality in these kinds of situations and I'd like to know how (or if) you tackle it.

    This question was asked recently on the mailing list. If you have multiple controllers that use group communication and for some reason the link between the controller fails, then your controllers will work in different partition. At the moment, we do nothing against it and users want to be able to plug their own policy to take the appropriate decision according to their needs (blocking writes, authorizing only one controller to do updates, ...).

    >
    > How do you detect failures? Actually "suspect" failures is a better term, since you can't ever detect a failure with 100% accuracy until (or unless) a failed machine recovers.
    >
    The main problem is that SQLExceptions are not typed. Therefore when we get such an exception, we are not sure if it is a backend failure or not. What we try to do is we execute the query on a second backend (maybe it's just the query that is bad). If the second backend fails, the query is considered "bad" and the exception stack is sent back to the application.
    In general, if one backend fails and others succeed, we try to close and re-open the connection to the failed backend (it is maybe just a staled connection) and retry the query. If it still fails, the node is declared as failed.
    For network partitions between replicated controllers, we rely on JavaGroup's GMS (Group Membership Service) to let us know if one node is suspicious or has completly failed.
  12. Transactions?[ Go to top ]

    OK, so when a replica fails and recovers where does it get the most up-to-date

    > > state from? Presumably each database must have some way of telling whether it
    > > is up-to-date or not?
    >
    > When a replica fails, the recovery is always ensured by the controller using the information stored in the recovery log.

    But how available is the log? What are the failure assumptions that you make (there have to be some, e.g., no network partitions, no byzantine failures, etc.) At some point you can only be as available as the log.

    > Every enabled database is always up-to-date unless some writes are still in the controller queue and have not been completed. In any case, the controller has a complete knowledge of each backend state and a query depending on a previous write will wait for the completion of the write before executing.

    So the controller is the group view manager in effect? Group joins and leaves are communicated to the controller (which must be highly available too)?

    > Maybe one thing that was not clear in my previous explanation is that the controller never guesses the state of a failed or newly added backend. The contrller does not look at the database content to know to which state or checkpoint it corresponds. If we don't know the state of a backend (because of a crash for example), we restore a database dump corresponding to a known checkpoint (also known from the recovery log) and we replay the updates to get this backend syncrhonized with the others.

    I get that, but this then comes back to one of my other questions: restoring an entire database image from one machine to another is not a zero-time occurrence. How do you stream updates that may be occurring simultaneously to this database recovery, or do you freeze updates while recovery occurs?

    >
    > >
    > > Does a recovering database replica have to talk to a majority of the other databases to find the "current" state? Do you trickle updates to that database while it is initialising itself (you could imagine that as a recovering replica becomes up-to-date this will take some time and the "current" state may be moving on as users continue to manipulate the state - or do you block access while replicas continue?)
    >
    > In general, database replica does not talk to each other. When a replica recovers, a separate thread in the controller is in charge of recovering the database and once the recovery log has been replayed, we do the appropriate tricks so that ongoing updates that where not yet logged are played by the new replica before it is enabled.
    > >

    OK, that last bit explains my "streaming" question. Is the group membership static then? You say that the controller effectively notices when a recovery occurs, so it must know a machine has recovered. This could be done in a group comms. scheme via, for example, a virtually synchronous group protocol, but there are other means, so I'm interested in what mechanism you employ. Do you support completely new members (e.g., increasing the availability)?

    > > Can you tolerate network partitions? Split brain scenario occurs in reality in these kinds of situations and I'd like to know how (or if) you tackle it.
    >
    > This question was asked recently on the mailing list. If you have multiple controllers that use group communication and for some reason the link between the controller fails, then your controllers will work in different partition. At the moment, we do nothing against it and users want to be able to plug their own policy to take the appropriate decision according to their needs (blocking writes, authorizing only one controller to do updates, ...).
    >

    OK, so the answer is really no, you don't tolerate partitions.

    > >
    > > How do you detect failures? Actually "suspect" failures is a better term, since you can't ever detect a failure with 100% accuracy until (or unless) a failed machine recovers.
    > >
    > The main problem is that SQLExceptions are not typed. Therefore when we get such an exception, we are not sure if it is a backend failure or not. What we try to do is we execute the query on a second backend (maybe it's just the query that is bad). If the second backend fails, the query is considered "bad" and the exception stack is sent back to the application.

    Oh, so you rely on the database connection. For some reason I assumed you were also working at a lower level than this. Does this mean that two SQLExceptions on the same database means (in your system) that you assume the backend has failed? Seems pretty coarse grained to me.

    > In general, if one backend fails and others succeed, we try to close and re-open the connection to the failed backend (it is maybe just a staled connection) and retry the query. If it still fails, the node is declared as failed.
    > For network partitions between replicated controllers, we rely on JavaGroup's GMS (Group Membership Service) to let us know if one node is suspicious or has completly failed.

    So your tolerance of failures is better at the controller level than at the db level. Any particular reason you couldn't have unified these?
  13. Transactions?[ Go to top ]

    Mark, I think the extensive talk of clustering, RAID levels, and the like is leading you mentally in one direction, when the reality is quite simpler (this is what happened to me).

    Forget about clustering, RAID, and various other theoretical concerns when thinking about C-JDBC. What you've got is a JDBC driver your app uses, which talks to an RMI server process. This server process talks to other databases - using only JDBC. C-JDBC has no special awareness of any database particulars, has no special hooks - just plain old JDBC. It can't even deal with database objects like views.

    So you have a JDBC proxy which can auto-direct queries against certain tables to a specific server for partioning; a proxy which can load-balance read-only queries across servers (load balancing); and which can replicate writes (replication). But it's all at the JDBC level - there isn't any special knowledge of the DB architecture, transaction log system, no DB heartbeat mechanism, etc.

    In addition, despite the docs, you can't have multiple controllers acting in cooperation to avoid having a single point of failure (not at this time - it's apparently in the works). So, again, it comes down to a JDBC proxy, nothing more, nothing less.

         -Mike
  14. Transactions?[ Go to top ]

    I think I will stop here this useless and endless chat.
    Oracle is also nothing more and nothing less than a proprietary database ...
  15. Transactions?[ Go to top ]

    \Emmanuel Cecchet\
    Oracle is also nothing more and nothing less than a proprietary database ...
    \Emmanuel Cecchet\

    As I'm sure you know, Oracle is much more than that. At the same time, your current actual implementation is much _less_ then you have claimed. You may want to build a Redundant Array of Inexpensive Databases, but you actually have right now is a JDBC driver and an RMI server which proxies JDBC requests. There are interesting bits in there, such as automatically distributing writes and load balancing and partitioning, but fundamentally this is not a clustering technology, it's a proxy with no special hooks into the resources being proxied to.

    I may seem overly pedantic here, but many people (most?) care about _what the code actually does today_. And are far less interested about what it will do, or what the grand vision is, or what the impressive paper says will come about some day. We've all heard promises and seen lots of cool diagrams, and been burned by the deadly "oh, yeah, that'll be available Real Soon Now". If you want developers to take your effort seriously, it's in your best interests to carefully distinguish between what is now, and what may be some day.

        -Mike
  16. Vision[ Go to top ]

    <quote>
    I may seem overly pedantic here, but many people (most?) care about _what the code actually does today_. And are far less interested about what it will do, or what the grand vision is, or what the impressive paper says will come about some day.
    </quote>

    Vision in an Open Source project is important. Such a project like Apache Geronimo won't survive without vision and also any other Open Source projects. C-JDBC has a very good idea and vision. It can help many Open Source databases to become "production" ready in high traffic sites with many cheap servers. Until now I haven't seen such a comparable project like C-JDBC (and also Open Source!).

    Instead of complaining, join them and try to make the product better.

    Peace,
    Lofi.
    http://www.openuss.org
  17. Vision[ Go to top ]

    It's also important to seperate the Vision from the Reality.

    So far, the Geronimo guys have done a decent job IMHO of doing that. They're articulated what they're shooting for, and at the same time stated the roadmap for getting there - with all the attendant bootstrapping, compromises, and little hacks that that involves.

    And, at the same time, it's tough to criticize anything about the Geronimo architecture, since its been done many times before (by the same exact people who have done it before!).

    C-JDBC does appear to be a good idea, but the architecture isn't proven, and they're mixing up Vision with Reality (at least in their pitch). Frankly, it may be impossible to achieve what they're shooting for with a JDBC-only solution, which would put a helluva dent into their architecture.

    As for the dangers, you've demonstrated them in your own post. You say:

     "It can help many Open Source databases to become "production" ready in high traffic sites with many cheap servers. Until now I haven't seen such a comparable project like C-JDBC (and also Open Source!)."

    You're confusing "can" with "may possibly someday". In case you missed it, C-JDBC will not help you get Open Sources become production ready today, because C-JDBC is not production ready. And, interestingly enough, you still haven't seen a comparable product like C-JDBC, because C-JDBC the Reality is _not_ C-JDBC the Vision.

    See also the AOP Alliance stuff from a few months ago. Many people were clapping and applauding and giving all sorts of laurels for people trying something new and daring and fascinating in the Open Source world. The only problem is - they didn't have anything then, and a couple of months later activity has all but ceased on the sourceforge site and mailing list (and there's still nothing there). All the excitement and cheerleading were for naught. I can't say that the same will happen here, but there are many of the same ear-marks here. If the C-JDBC folks sustain their momentum, and adopt a more intellectually honest approach as the Geronimo folks have, then maybe it will come to something. But speaking in the present tense about features that aren't even implemented yet isn't a great sign.

         -Mike
  18. Vision[ Go to top ]

    <quote>
    You're confusing "can" with "may possibly someday". In case you missed it, C-JDBC will not help you get Open Sources become production ready today, because C-JDBC is not production ready. And, interestingly enough, you still haven't seen a comparable product like C-JDBC, because C-JDBC the Reality is _not_ C-JDBC the Vision.
    </quote>

    Software development is a process and sometime a long term process. I still remember in year 1999-2000 as I wrote an article about Open Source EJB servers: JOnAS and JBoss for the German's JavaMagazin. Those products were also not production ready. But look at them now (almost 3 years afterwards). They become a real solution for real world applications.

    C-JDBC is just a new project with a good idea and intelligent people. Surely they need more time to become a real world solution (with all the features you like to have), but this is just the process, in which every single applications have to go through.

    By writting such a scientific report, Emmanuel tries to spread and to share his knowledge to any other people (DB designers, developers, etc.). This is just how a scientific work happens. Every other people are invited to read, to agree or to disagree with his work. If you are not satisfied with his work, you can extend the work or write a new report, which shows some of the weak points of his work. By doing this, the work will become better and better. And this activity should also happen with the application itself: Open Source <==> Scientific Work.

    Lofi.
  19. Vision[ Go to top ]

    I think you have a fundamentally flawed assumption here - that every great idea actually comes to fruition and becomes a production-quality product.

    The reality, of course, is that for every Apache and Jboss and Linux and log4j out there, there are several hundred open source projects that never amount to very much (the ratio could even be much worse). Now - most of us do not have the time to monitor all 700 projects or so and get excited about each and every one of them, and while such projects are embryonic it's impossible to tell if they're destined for greatness or for the dust bin.

    In C-JDBC's case, it's still embryonic and in the "can't tell" phase. More power to the people involved in trying to make it work, but you're overreaching by a mile to call the work involved "scientific".

    \Lofi Dewanto\
    If you are not satisfied with his work, you can extend the work or write a new report, which shows some of the weak points of his work. By doing this, the work will become better and better.
    \Lofi Dewanto\

    The above is a throughly ridiculous statement, through and through. Someone comes up with an (IMHO) flawed premise with no implementation pieces to back up some of the more aggressive claims, and now the onus is on me to make it better and functional? Why on earth would I spend my valuable time doing that? The onus is on _them_ to make it work, not me.

    With that in mind, several people have offered criticisms and suggestions in this thread, but they appear to have fallen on a deaf ear.

        -Mike
  20. Vision[ Go to top ]

    Just some concluding remarks:
    - C-JDBC is still in beta phase and we never claimed that it is comparable to commercial products.
    - Read carefully the research report, its goal is to classify replication strategies and to show the viability of database replication at middleware level.
    - Many of your assumptions are just wrong because you didn't read carefully what was written. I understand that you do not share our vision and you are probably working on a similar product.
    - We don't try to sell anything, we are just doing open source software to help the open source community.
    - C-JDBC does not make coffee and does not handle earthquake at the moment but we are working on it ...
  21. Transactions?[ Go to top ]

    Mark, I think the extensive talk of clustering, RAID levels, and the like is leading you mentally in one direction, when the reality is quite simpler (this is what happened to me).

    >
    > Forget about clustering, RAID, and various other theoretical concerns when thinking about C-JDBC. What you've got is a JDBC driver your app uses, which talks to an RMI server process. This server process talks to other databases - using only JDBC. C-JDBC has no special awareness of any database particulars, has no special hooks - just plain old JDBC. It can't even deal with database objects like views.
    >
    > So you have a JDBC proxy which can auto-direct queries against certain tables to a specific server for partioning; a proxy which can load-balance read-only queries across servers (load balancing); and which can replicate writes (replication). But it's all at the JDBC level - there isn't any special knowledge of the DB architecture, transaction log system, no DB heartbeat mechanism, etc.
    >
    > In addition, despite the docs, you can't have multiple controllers acting in cooperation to avoid having a single point of failure (not at this time - it's apparently in the works). So, again, it comes down to a JDBC proxy, nothing more, nothing less.
    >

    Mike, thanks for the info. I'm coming at this from having been involved with something similar (replicated object stores) many years ago and all of the issues I mentioned are ones we had to solve. I thought this work was going to be more than it turned out to be.

    Mark.
  22. Joins[ Go to top ]

    According to the report, about 50% of the queries require joins. Table 2 in the report shows most of the tables being replicated on each machine to support these joins. If I'm understanding this right, you might as well use standard RAID under C-JDBC. Presumably you would then get the benefit of increased CPU's from horizontally scaling the datbase server on top of the filesystem. However, most of the time when a database slows down, it has nothing to do with CPU, but table/row lock contention. Therefore the benefit of C-JDBC seems mostly to be in the providing higher avaialability. The 25% performance inprovements cited, are for read-mostly workloads.
  23. Joins[ Go to top ]

    If I'm understanding this right, you might as well use standard RAID under C-JDBC.


    Yes, you are absolutely right.

    > However, most of the time when a database slows down, it has nothing to do with CPU, but table/row lock contention.

    With a better load distribution, you reduce the table/row lock contention on each database. That is also why we can obtain superlinear speedups when compared to the performance of a single database which spends much more time in synchronization.

    > Therefore the benefit of C-JDBC seems mostly to be in the providing higher avaialability. The 25% performance inprovements cited, are for read-mostly workloads.

    It is true that performance improvement (compared to a centralized database) is only obtained with read queries which are the only one that can be balanced. But it is possible to approximate the write performance of a single database by limiting the replication of tables that are heavily written. If you look at the ordering mix numbers (which is 50% read - 50% write), you will notice that speedups are still very good. Most eCommerce applications have read-mostly workloads which makes it an ideal target for C-JDBC.
  24. cross-database joins?[ Go to top ]

    Is it possible to join across tables in different database servers? For e.g., table employees is on database 1 and table departments is on database 2 and databases 1 and 2 are on separate machines, can I join between employees and departments?

    Thanks
  25. cross-database joins?[ Go to top ]

    No, in the current implementation, C-JDBC does not support cross-database joins. People have been talking on the mailing list about embedding an in-memory HSQLDB to perform such joins but nothing is ready yet.
  26. cross-database joins?[ Go to top ]

    This would be a killer feature, if available. We currently have a gigantic database and a few enourmous, rogue tables cause outages due to corruption. We would like to split out the heavy tables into a separate database and would need to perform cross-db joins then. C-JDBC would have been a PERFECT fit if this feature was available.
  27. cross-database joins?[ Go to top ]

    This would be a killer feature, if available. We currently have a gigantic database and a few enourmous, rogue tables cause outages due to corruption. We would like to split out the heavy tables into a separate database and would need to perform cross-db joins then. C-JDBC would have been a PERFECT fit if this feature was available.


    The main problem with the current plans is that this join would have to fit in memory. Do you think this is realistic ?
  28. cross-database joins?[ Go to top ]

    In my case, yes, because the joins are very restrictive in nature.
  29. cross-database joins?[ Go to top ]

    The results of the query have an average of about 500 to 1000 rows but the base data over which the query runs before compiling the result could be big.

    How would the in-memory solution work? Would you somehow split the query to run as separate queries on the separate databases, gather the results and then do the join?
  30. cross-database joins?[ Go to top ]

    How would the in-memory solution work? Would you somehow split the query to run as separate queries on the separate databases, gather the results and then do the join?


    Yes, you would just issue the apropriate sub-selects on each database and then perform an in-memory join in the controller.
  31. cross-database joins?[ Go to top ]

    We found this product did a good job with distributed joins (it was sprung from the Almanden "Garlic" project).

    http://www-3.ibm.com/solutions/lifesciences/solutions/discoverylink.html
  32. Great stuff![ Go to top ]

    Emmanuel,

    great stuff! I hope to re-start my c-jdbc integration in OpenUSS in the beginning of the next semester (October) as we will get a lot more users (our students) in the next semester (hope to get 2000 more ;-) --> 13.000).

    Greets,
    Lofi.
    http://www.openuss.org
  33. Kudos on the interesting report and research.

    Some questions and points...

    First, I'd be very curious in seeing the comparison of C-JDBC to an approach using the vendor's (or otherwise) JDBC implementation alone to get a better feel for what overhead C-JDBC imposes.

    Second, (relatedly) I'm a bit wary of the overhead imposed by multiple schedulers, caches, etc. Witness the trend in the Linux (and Solaris) multi-threading implementations away from the complexity & overhead of multiple schedulers as a quick example of how less can be more. I'm also concerned that the need to parse requests and interpret DB schemas will add to overhead.

    Thirdly, I'm wondering about the impact of the (commonplace) use of stored procedures. The controller would obviously need to know which db the stored proc resides on for partitioned database scenarios... How would this be done? Also, are there hidden issues in parameter passing and stored procedure invocation that would make use of different database backends difficult? e.g. Is a timestamp var passed to DB2 the same as on for Postgres, SQL Server, etc.?

    Finally, I'm a bit confused by the suggestion (in section 3.5, page 7) in the report that 4 RDBMS backends would be needed to support RAIDb-2ec. Could you kindly enlighten me?
  34. First, I'd be very curious in seeing the comparison of C-JDBC to an approach using the vendor's (or otherwise) JDBC implementation alone to get a better feel for what overhead C-JDBC imposes.


    In fact, in all experiments, the singleDB experiment does not use C-JDBC at all but directly uses the MySQL driver to access the database. We did not compare the single node version against a single node using C-JDBC. That could be interesting to evaluate the basic overhead in this case. But I would expect people to use caching in this case.
    >
    > Second, (relatedly) I'm a bit wary of the overhead imposed by multiple schedulers, caches, etc. Witness the trend in the Linux (and Solaris) multi-threading implementations away from the complexity & overhead of multiple schedulers as a quick example of how less can be more. I'm also concerned that the need to parse requests and interpret DB schemas will add to overhead.

    I am not sure to exactly understand your question, but you only use one scheduler and one cache per database. If you have controller replication, the each of them will use the exact same scheduler for a given virtual database and they only synchronize on writes and transaction markers.
    In fact, C-JDBC comes with many caches including caches for parsing. When using preparedStatement, we just parse the skeleton (the query with the ?) that never changed whatever the values are inserted. The result of the parsing is stored in a cache and for most applications that always issue the same set of queries (very often in eCommerce application), the parsing is just done once when the first query is executed.
    It is sure that the more options you want (controller replication, caching, recovery log, ...) you have an extra overhead for it, but it is up to you to choose the good tradeoff for your environment. Maybe later, we will find a way to automatically configure the whole thing but you will have to wait several years before the first release of this feature!
    >
    > Thirdly, I'm wondering about the impact of the (commonplace) use of stored procedures. The controller would obviously need to know which db the stored proc resides on for partitioned database scenarios... How would this be done? Also, are there hidden issues in parameter passing and stored procedure invocation that would make use of different database backends difficult? e.g. Is a timestamp var passed to DB2 the same as on for Postgres, SQL Server, etc.?
    >
    The problem with stored procedures is that we don't know anything about the side effects. For example, on each stored procedure call we have to completly flush the cache (unless you set the connection to read-only) since we can't know which entries will be affected by the stored procedure.
    The JDBC API provides a way to retrieve the stored procedure in each database but as we can't know if the content is read-only or not, we always send the stored procedure to all backends. The JDBC API is very limited on what can be done with stored procedure and it is easy to break the system with them. At the moment, we didn't have much demand on stored procedure but they are supported.
    Incompatible typing in heterogeneous databases usually requires request rewriting. We are working on exposing a simple API to the user so that he can express rewriting rules to make requests (or stored procedure calls) compatible across all backends.

    > Finally, I'm a bit confused by the suggestion (in section 3.5, page 7) in the report that 4 RDBMS backends would be needed to support RAIDb-2ec. Could you kindly enlighten me?

    When I have to do error checking, I need 3 copies of a table to have a majority to decide for the final result. If I have only 3 databases, it means that each of it must have a copy of the full content which corresponds to a RAIDb-1ec configuration with 3 nodes. I need at least 4 backends, if I want a solution without full replication.

    Let me know if you have further questions and don't hesitate to post on c-jdbc at objectweb dot org
    Emmanuel
  35. Hi Emmanuel,

    I read the report. It didn't clearly state how many controllers were being used for the tests (i.e. was it just one centralized controller, or were you adding a controller per database server). Also, I found it concerning that the controller was run on an additional server that was 4x the speed of the database server; doesn't that strike you as odd? What would the performance have been, for example, if you had to run one controller on each database host? (I may have misunderstood some of this, so please clarify the setup.)

    All in all, the report was very impressive in its scope, but I still had a lot of questions remaining after reading it.

    Peace,

    Cameron Purdy
    Tangosol, Inc.
    Coherence: Easily share live data across a cluster!
  36. Cameron,

    > I read the report. It didn't clearly state how many controllers were being used for the tests (i.e. was it just one centralized controller, or were you adding a controller per database server).

    We just used a single centralized controller.

    > Also, I found it concerning that the controller was run on an additional server that was 4x the speed of the database server; doesn't that strike you as odd? What would the performance have been, for example, if you had to run one controller on each database host? (I may have misunderstood some of this, so please clarify the setup.)

    The main problem for these experiments was the number of machines available. I only had 6 machines with the same hardware config and then some extra machines for the controller and the client. Running the databases on the slow nodes allowed us to saturate them with a single node generating the load. With faster machines, I need a cluster to generate the workload as an input into the C-JDBC cluster.
    I just received new machines so that now I'll be able to play with larger setups. In any case, on the controller machine we used, we measured a throughput of 110000 req/min (maximum generated by the load generator) with a peak cpu usage of 15% on the controller node (average about 10%). So, we can expect to handle roughly 1 million req/min on the machine we used.
    So, the setup was 1 client -> 1 Controller -> up to 6 dbs
    >
    > All in all, the report was very impressive in its scope, but I still had a lot of questions remaining after reading it.

    Don't hesitate to contact us if you have further questions.

    Emmanuel
  37. This looks like some interesting work, and overall the report looks better written than some I've come across in the industry :-) But I have a few questions/observations:

     - Responsibility for recovery is a combination of individual RDBMS recovery (by the admin from the logs) and C-JDBC recovery on top of that. This is, effectively, a manual recovery process. It appears to an extent you're punting on the failure/recovery part - and that is where much of your performance gains are coming from. When you have auto-recovery (the more the better!), you'll find that this is where the biggest performance losses occur.

     - On not using XA....this ties into failure/recovery comments above. If you circumvent XA, it's _very_ easy for your databases to get out of sync if something like a network failure occurs. This can result in chaos in your operations room (I know - I've seen this scenario first hand). The majority of cases where I've seen people looking into fairly serious recovery (and your paper seems to indicate that you're fairly serious), people are definitely looking for speed, but a corrupt and out of sync data center (RAIDb in your parlance) is their worst nightmare, and they're willing to sacrifice some speed to get some guarnatees towards auto-recovery.

     - You're only using mySql in your tests. This seems, ah, rather dicey to me. People concerned with clustering (speed and reliability) typically use something a big heavier in weight.

     - You stated in your comments to Cameron that you used rather underpowered RDBMS nodes, and a more powerful "controller" node, and that this made it easy for one client to saturate a given RDBMS node. Frankly, and I don't mean to be rude, but this is known as skewing the test towards your architecture. It could very well be that a single RDBMS node running on reasonable hardware (and something a bit more suped up than mySQL) would peform better than your C-JDBC cluster. People don't want to know C-JDBC performance on a rather weird and esoteric setup. They want to know total costs of "traditional" RDBMS architectures on a hardware setup suited to it, vs. the total costs of a reasonable C-JDBC setup. If you show tests where a single client can saturate an RDBMS node, most grizzled IT managers will laugh at you. I know, I know, the point is "Redundant Array of Inexpensive DBs", with emphasis on inexpensive, but total cost does have to be calculated on traditional solutions vs. this one.

     - It's unclear to me how C-JDBC checkpoints are synchronized with RDBMS checkpoints. It seems to me that C-JDBC and each database needs to be checkpointed together to have a hope of making sense of things in case of a failure. If each checkpointed DB is not checkpointed simultaneously with the RDBMS, then when you replay from the C-JDBC logs you could get a slew of errors, with no clue as to which are spurious and which are real problems. This, again, is a nightmare for operations people - you try to recover and get a boatload of errors, and have no way to tell if you should worry about them or not.

    --------

    Frankly, when looking at tests like this, the most value I see is showing the performance numbers, and then going ahead and showing the recovery scenarios right after. This proves out not only performance, but that the system works in the face of problems.

    I once told a project manager "I can ignore error recovery, and give you a 500% performance boost [this was conservative]. Or I can plain do the wrong thing and give you a nearly infinite performance boost = and cross my fingers that you don't validate the results :-)". Your report scares me a bit because it goes to great lengths to describe failure-recovery and balancing scenarios, then jumps right into performance numbers without proving that the various failure/recovery options work.

        -Mike
  38. - Responsibility for recovery is a combination of individual RDBMS recovery (by the admin from the logs) and C-JDBC recovery on top of that.


    No, you never have to deal with the database admin logs. The only thing that is manual at the moment, is restoring a dump using the database specific dump tool. We are working on automating this process using Octopus (http://octopus.enhydra.org).

    > - On not using XA....this ties into failure/recovery comments above. If you circumvent XA, it's _very_ easy for your databases to get out of sync if something like a network failure occurs. This can result in chaos in your operations room (I know - I've seen this scenario first hand). The majority of cases where I've seen people looking into fairly serious recovery (and your paper seems to indicate that you're fairly serious), people are definitely looking for speed, but a corrupt and out of sync data center (RAIDb in your parlance) is their worst nightmare, and they're willing to sacrifice some speed to get some guarnatees towards auto-recovery.

    As soon as a network failure occurs, nodes are automatically disabled because C-JDBC does not allow replicas to be out of sync. Remember that this is an ongoing volunteer effort that primarily targets cluster environments where the kind of failure you are talking about never happens (either everything fails or only few nodes fail).

    > - You're only using mySql in your tests. This seems, ah, rather dicey to me. People concerned with clustering (speed and reliability) typically use something a big heavier in weight.

    If you have time (and money for licenses) to redo the performance evaluation with other databases we would be more than happy to look at the numbers. Also, if Oracle let you publish the results, that would be great. Once again, remember that this is an open source project and I don't think that Oracle admins will switch one day to a solution using C-JDBC and open source databases. All db admin using high-end commercial databases will never look at these open source solutions because they want to have someone to sue if the system does not work. Unfortunately, this is not possible with open source solutions.
    If you don't think that MySQL or Postgres are not serious solutions, then I think that C-JDBC is more likely to be as valuable as these products for you.
    >
    > - You stated in your comments to Cameron that you used rather underpowered RDBMS nodes, and a more powerful "controller" node, and that this made it easy for one client to saturate a given RDBMS node. Frankly, and I don't mean to be rude, but this is known as skewing the test towards your architecture. It could very well be that a single RDBMS node running on reasonable hardware (and something a bit more suped up than mySQL) would peform better than your C-JDBC cluster. People don't want to know C-JDBC performance on a rather weird and esoteric setup. They want to know total costs of "traditional" RDBMS architectures on a hardware setup suited to it, vs. the total costs of a reasonable C-JDBC setup. If you show tests where a single client can saturate an RDBMS node, most grizzled IT managers will laugh at you. I know, I know, the point is "Redundant Array of Inexpensive DBs", with emphasis on inexpensive, but total cost does have to be calculated on traditional solutions vs. this one.
    >
    FYI, on our testbed, Oracle perform in the best case as good as MySQL (I remind you that we are using uniprocessor machines). Even on dual-proc machines, Oracle and MySQL do perform the same. The goal of this experiments (as stated in the paper) was not to get an absolute performance number but to see if the approach was viable and could lead to substantial speedups. If you are willing to offer us a suitable testbed to reproduce the experiments, we would be happy to redo the measurements using these machines (note that you need at least 16 machines ...)

    > - It's unclear to me how C-JDBC checkpoints are synchronized with RDBMS checkpoints.

    They are not. We are not using RDBMS checkpoints, we are managing our own checkpoints since all checkpointing systems are db specific and our tool is generic.
     
    > Frankly, when looking at tests like this, the most value I see is showing the performance numbers, and then going ahead and showing the recovery scenarios right after. This proves out not only performance, but that the system works in the face of problems.

    We are working on automating the recovery process and we are preparing another article on it. If you want to contribute to the project, you are welcome, we are always looking for contributors.
    >
    > I once told a project manager "I can ignore error recovery, and give you a 500% performance boost [this was conservative]. Or I can plain do the wrong thing and give you a nearly infinite performance boost = and cross my fingers that you don't validate the results :-)". Your report scares me a bit because it goes to great lengths to describe failure-recovery and balancing scenarios, then jumps right into performance numbers without proving that the various failure/recovery options work.

    I should apologize for the scary report, but once again, don't be scared by open source! ;-)

    Emmanuel
  39. \Emmanuel Cecchet\
    No, you never have to deal with the database admin logs. The only thing that is manual at the moment, is restoring a dump using the database specific dump tool. We are working on automating this process using Octopus (http://octopus.enhydra.org).
    \Emmanuel Cecchet\

    This was part of my point - there's a manual step. In addition, I don't believe that your current version "auto-discovers" when a RDBMS instance is back up, you need to manually add it back into the cluster, correct? That's two manual steps - I imagine there may be more.

    \Emmanuel Cecchet\
    As soon as a network failure occurs, nodes are automatically disabled because C-JDBC does not allow replicas to be out of sync. Remember that this is an ongoing volunteer effort that primarily targets cluster environments where the kind of failure you are talking about never happens (either everything fails or only few nodes fail).
    \Emmanuel Cecchet\

    The nature of your work force does not change the nature of the problem you're addressing.

    As for various failure types - it's more common then you may believe for NICs to fail, or a router to go flakey (and often pilot error is involved). Time and again I've seen people come along saying, in effect "Oh, ignore those things, they never happen, and look - we give you an XX% performance boost". Meanwhile, people are buying into the solution because it purports to give them safety and speed - when in fact the safety is not as complete as they realize.

    In the case of your product and how I understand the actual implementation right now, it works very well when everything else works very well. Beyond this, I'd characterize it as an accident waiting to happen.

    \Emmanuel Cecchet\
    If you have time (and money for licenses) to redo the performance evaluation with other databases we would be more than happy to look at the numbers. Also, if Oracle let you publish the results, that would be great. Once again, remember that this is an open source project and I don't think that Oracle admins will switch one day to a solution using C-JDBC and open source databases. All db admin using high-end commercial databases will never look at these open source solutions because they want to have someone to sue if the system does not work. Unfortunately, this is not possible with open source solutions.
    If you don't think that MySQL or Postgres are not serious solutions, then I think that C-JDBC is more likely to be as valuable as these products for you.
    \Emmanuel Cecchet\

    The problem I see here is one of perceptions. Your web site, performance paper, docs, etc are all written agressively in terms of taking on the big boys, of getting Oracle RAC like reliability and even better performance with a "Redundant Array of Inexpensive DBs". You then go and publish performance results against MySQL only (not even Postgres), and on underpowered machines no less.

    You're doing a repeated (and classic) bait and switch here. You make extensive comparisons to production-strength solutions, claim you do as well with your solution, then back away and say "we're only volunteer open source developers". You go to extensive lengths to document N-teen "RAIDb levels", then we find that many/most aren't implemented. Worst of all, despite extensive documentation on the subject, C-JDBC still doesn't support horizontal grouping/replication of controllers - meaning that your solution boils down to a single point of failure right now.

    Again, I regret the very negative tone, but you make very aggressive claims in your literature and web site, and it's a giant let-down to dive in deeper and find that most of those claims aren't yet implemeneted. And being open source and staffed by volunteers isn't an excuse for that - just don't make announcements until things actually work, be intellectually honest, and be frank in your documentation to document only the things that actually work today. In a very similar way to what happened to the AOPAlliance and JBoss AOP, and others, you announced a grand project in its infancy with grand claims and little working code, and in doing so you've mostly tarnished your reputation, and lost alot of potential users.

    Earth to open source developers: stuff your ego in the closet, do the work that you love in the manner and timing of your choosing, and above all don't go shouting about it to the rooftops until the code is actually ready for prime time.
    \Emmanuel Cecchet\
    I should apologize for the scary report, but once again, don't be scared by open source! ;-)
    \Emmanuel Cecchet\

    I'm not afraid of Open Source. I am afraid of widely published projects that make bold claims that aren't matched by the code - for the simple reason that some egghead may actually try to use it where I work, and _I_ then have to help undo the mess.

        -Mike
  40. Never mind....[ Go to top ]

    Well, I suppose my previous comments are irrelevant. I checked out the mailing list, and it appears that the most important aspects cited in the white paper aren't necessarily there yet - like horizontal scalability, among other things. My panties always get in a bunch when a paper goes to great detail about how something's supposed to work, releases some performance numbers, and then you read elsewhere that those critical features (which often greatly impact performance) are "coming soon to a product near you!".

    Sorry for the negative vibes, but releasing this sort of "report" when critical features are missing just wastes alot of peoples' time. A cluster solution with manual recovery and a single point of failure is IMHO just a waste of time.

        -Mike