[Exerpted from Abadi's blog. For the full version, go to http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html ] Ideally there would exist an analytical database system that achieves the scalability of Hadoop along with with the performance of parallel database systems (at least the performance that is not the result of a tradeoff with scalability). And ideally this system would be free and open source. That's why my students Azza Abouzeid and Kamil Bajda-Pawlikowski developed HadoopDB. It's an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines. In essence it is a hybrid of MapReduce and parallel DBMS technologies. But unlike Aster Data, Greenplum, Pig, and Hive, it is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level. Also unlike Aster Data and Greenplum, it is free and open source.
- Posted by: Peter Varhol
- Posted on: July 21 2009 09:17 EDT
- Re: Yale Researchers Create Hadoop-database Cross by Nikita Ivanov on July 21 2009 13:10 EDT
- Re: Yale Researchers Create Hadoop-database Cross by Ilya Sterin on July 21 2009 14:55 EDT
- Re: Yale Researchers Create Hadoop-database Cross by shawn spencer on July 21 2009 19:17 EDT
How is it different (better or worse) than in-memory data grids such as Coherence, for example? Thanks, -- Nikita Ivanov GridGain - Cloud Development Platform
So my understanding is that this is for asynchronous jobs. Hadoop has access to postgresql for retrieving and storing data during the map/reduce phase? I haven't read the paper yet, I plan on it, but am I misunderstanding it? if this is so, how is that any different than using hadoop with the rdbms data provider plugin, if I remember correctly there is one. I understand the distributed properties of it. But you still have to partition your data manually, you still have to implement app specific transactional semantics. Does this even support serialized transactions for data integrity? I didn't see anything mentioned about it in the paper while doing a quick scan. Either way, I hope I'm misunderstanding this, but if the above is true, Nikita is right in terms of it being pretty much a distributed in memory grid with a relational store to mirror some data on the distributed disk. Ilya
I did a quick scan of the paper... from what I can gather, they enable the use of postgresql (or mysql) as a backing store for structured data. So each Hadoop slave node would have its own instance of a postgresql DB. A 'table' would be stored across multiple nodes and I assume that it's partitioned on some column value(s). They've also extended Hive so that it can execute a query that breaks down not only into M/R jobs but also SQL queries. This means operations such as JOINs could be done natively in the RDBMS and would theoretically perform better, provided that the data sets are partitioned well... I'm skeptical but I'd love for someone to convince me otherwise.
yeah, same here. Sounds the same as gigaspaces (and coherence?), but using hadoop for distributing data. How can SQL query joins run natively when data is partitioned? This can only happen if the data which is being joined is all partitioned to the same physical rdbms node. Do they transparently handle situations where data might reside on multiple physical nodes? If this is not the case, then this is nothing more than a sharded database that handles some concerns like partitioning and job management through a proxy. We build this ourselves in a few weeks time and it would probably take that long to figure this out and set it up. Ilya
How is it different (better or worse) than in-memory data grids such as Coherence, for example?or Terracota ....
GridGain - Cloud Development Platform
I'm pretty clear about Terracotta - it's like apples and oranges. Coherence or GigaSpaces is more of a grey area. Wish those researches would read TSS and provide some answers... It appears that TSS just grabbed their blog and we can't get any answers which is sad. -- Nikita Ivanov. GridGain - Cloud Development Platform