-
Yale Researchers Create Hadoop-database Cross (6 messages)
- Posted by: Peter Varhol
- Posted on: July 21 2009 09:17 EDT
[Exerpted from Abadi's blog. For the full version, go to http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html ] Ideally there would exist an analytical database system that achieves the scalability of Hadoop along with with the performance of parallel database systems (at least the performance that is not the result of a tradeoff with scalability). And ideally this system would be free and open source. That's why my students Azza Abouzeid and Kamil Bajda-Pawlikowski developed HadoopDB. It's an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines. In essence it is a hybrid of MapReduce and parallel DBMS technologies. But unlike Aster Data, Greenplum, Pig, and Hive, it is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level. Also unlike Aster Data and Greenplum, it is free and open source.Threaded Messages (6)
- Re: Yale Researchers Create Hadoop-database Cross by Nikita Ivanov on July 21 2009 13:10 EDT
- Re: Yale Researchers Create Hadoop-database Cross by Ilya Sterin on July 21 2009 14:55 EDT
-
Re: Yale Researchers Create Hadoop-database Cross by Patrick Angeles on July 22 2009 10:04 EDT
- Re: Yale Researchers Create Hadoop-database Cross by Ilya Sterin on July 22 2009 12:48 EDT
-
Re: Yale Researchers Create Hadoop-database Cross by Patrick Angeles on July 22 2009 10:04 EDT
- Re: Yale Researchers Create Hadoop-database Cross by shawn spencer on July 21 2009 19:17 EDT
- Re: Terracotta by Nikita Ivanov on July 22 2009 01:02 EDT
- Re: Yale Researchers Create Hadoop-database Cross by Ilya Sterin on July 21 2009 14:55 EDT
-
Re: Yale Researchers Create Hadoop-database Cross[ Go to top ]
- Posted by: Nikita Ivanov
- Posted on: July 21 2009 13:10 EDT
- in response to Peter Varhol
How is it different (better or worse) than in-memory data grids such as Coherence, for example? Thanks, -- Nikita Ivanov GridGain - Cloud Development Platform -
Re: Yale Researchers Create Hadoop-database Cross[ Go to top ]
- Posted by: Ilya Sterin
- Posted on: July 21 2009 14:55 EDT
- in response to Nikita Ivanov
So my understanding is that this is for asynchronous jobs. Hadoop has access to postgresql for retrieving and storing data during the map/reduce phase? I haven't read the paper yet, I plan on it, but am I misunderstanding it? if this is so, how is that any different than using hadoop with the rdbms data provider plugin, if I remember correctly there is one. I understand the distributed properties of it. But you still have to partition your data manually, you still have to implement app specific transactional semantics. Does this even support serialized transactions for data integrity? I didn't see anything mentioned about it in the paper while doing a quick scan. Either way, I hope I'm misunderstanding this, but if the above is true, Nikita is right in terms of it being pretty much a distributed in memory grid with a relational store to mirror some data on the distributed disk. Ilya -
Re: Yale Researchers Create Hadoop-database Cross[ Go to top ]
- Posted by: Patrick Angeles
- Posted on: July 22 2009 10:04 EDT
- in response to Ilya Sterin
I did a quick scan of the paper... from what I can gather, they enable the use of postgresql (or mysql) as a backing store for structured data. So each Hadoop slave node would have its own instance of a postgresql DB. A 'table' would be stored across multiple nodes and I assume that it's partitioned on some column value(s). They've also extended Hive so that it can execute a query that breaks down not only into M/R jobs but also SQL queries. This means operations such as JOINs could be done natively in the RDBMS and would theoretically perform better, provided that the data sets are partitioned well... I'm skeptical but I'd love for someone to convince me otherwise. -
Re: Yale Researchers Create Hadoop-database Cross[ Go to top ]
- Posted by: Ilya Sterin
- Posted on: July 22 2009 12:48 EDT
- in response to Patrick Angeles
yeah, same here. Sounds the same as gigaspaces (and coherence?), but using hadoop for distributing data. How can SQL query joins run natively when data is partitioned? This can only happen if the data which is being joined is all partitioned to the same physical rdbms node. Do they transparently handle situations where data might reside on multiple physical nodes? If this is not the case, then this is nothing more than a sharded database that handles some concerns like partitioning and job management through a proxy. We build this ourselves in a few weeks time and it would probably take that long to figure this out and set it up. Ilya -
Re: Yale Researchers Create Hadoop-database Cross[ Go to top ]
- Posted by: shawn spencer
- Posted on: July 21 2009 19:17 EDT
- in response to Nikita Ivanov
How is it different (better or worse) than in-memory data grids such as Coherence, for example?
or Terracota ....
Thanks,
--
Nikita Ivanov
GridGain - Cloud Development Platform -
Re: Terracotta[ Go to top ]
- Posted by: Nikita Ivanov
- Posted on: July 22 2009 01:02 EDT
- in response to shawn spencer
I'm pretty clear about Terracotta - it's like apples and oranges. Coherence or GigaSpaces is more of a grey area. Wish those researches would read TSS and provide some answers... It appears that TSS just grabbed their blog and we can't get any answers which is sad. -- Nikita Ivanov. GridGain - Cloud Development Platform