Feature wise comparison between Apache Hadoop vs Spark vs Flink
This Apache Hadoop vs Spark vs Flink comparison tutorial is most
comprehensive guide covering feature-wise comparison between Apache
Hadoop, Apache Spark and Apache Flink. Learn what is difference between
spark and flink, what is the new features added in flink which makes it
4G of Big Data. Why industry has moved from hadoop to spark and now
planning to move to Flink. Learn what are the differences between Apache
hadoop vs flink. In this comparison guide we will discuss all the
features of Apache Hadoop, Apache Spark and Apache Flink and difference
between them. Get the answer of questions like will Flink replace Spark?
Who will be the successor of Hadoop Spark or Flink ?
Comparison between Apache Hadoop vs Apache Spark vs Apache Flink
Features | Apache Hadoop | Apache Spark | Apache Flink |
---|---|---|---|
Data Processing Engine | At the core Hadoop’s MapReduce is batch processing Engine | At the core Apache Spark is batch processing Engine | At the core Apache Flink is Stream processing Engine |
Language Support | Primarily Java, but other languages like C, C++, Ruby, Groovy, Perl, Python also supported using Hadoop streaming | Supports Java, Scala, python and R | Flink Supports Java as well as Scala |
Language Developed | Hadoop is developed in Java | Spark is developed in Scala | Flink is developed in Java & Scala |
Processing Speed | Map-Reduce processes data much slower than Spark and Flink. | Spark processes 100 times faster than Map-Reduce, because of it is in-memory processing system. | Flink processes faster than Spark because of its underlying streaming engine. |
Iterative Processing | Does not support iterative processing natively. |
Spark iterates its data in batches. In Spark, for iterative processing, each iteration has to be scheduled and executed separately. |
Flink provides native support for iterative processing. Flink iterates data by using its streaming architecture. Flink can be instructed to only process the parts of the data that have actually changed, thus significantly increases the performance of job. |
Stream Processing | Mapreduce is purely batch-oriented data processing tool. It doesn’t support stream processing | Spark uses micro-batches for all streaming workloads. But it is not sufficient for use cases where we need to process large streams of live data and provide results in real time with low latency |
Apache Flink is a true streaming engine. It uses streams for all workloads: streaming, SQL, micro-batch and batch. Batch is a finite set of streaming data |
Computation Model | Mapreduce adopted batch-oriented model. Batch is essentially processing data at rest, taking a large amount of data at once and then processing it and then writing the output. |
Spark’s core also follows batch model but has adopted micro-batching. Micro-batches are an essentially used for handling near real-time processing data model. |
Flink has adopted a continuous flow, operator-based streaming model. A continuous flow operator processes data when as soon as it arrives in streaming mode, without any latency in collection to processing the data. |
Memory Management | Hadoop provides configurable Memory management. Admin can configure it using configurations files. |
Spark provides configurable memory management, although with the latest release of Spark 1.6, Spark has moved towards automating memory management as well. |
Flink provides automatic memory management. It has its own memory management system, separate from Java’s garbage collector. |
Windows criteria | NA | Spark has time-based Window criteria. | Flink has record-based, time-baed or any custom user-defined Window criteria. |
Optimization | In Mapreduce jobs has to be manually optimized. |
In Apache Spark jobs has to be manually optimized. |
Flink jobs are automatically optimized. Flink comes with an optimizer that is independent with actual programming interface. |
Latency | Apache Hadoop has higher latency than both spark and Flink. |
Apache Spark has high latency as compared to Apache Flink. | With minimum efforts in configuration Apache Flink’s data streaming run-time achieves low latency and high throughput. Flink can process data (with very fast velocity and high volume) in milliseconds. |
Fault tolerance | MapReduce is highly fault tolerant, no need to restart the application from scratch in case of any failure. | Spark Streaming recovers lost work and delivers exactly-once semantics out ofthe box with no extra code or configuration. |
The fault tolerance mechanism supported by Apache Flink is on Chandy-Lamport distributed snapshots. This is lightweight mechanism, which results in maintaining high throughput rates as well as provides strong consistency guarantees at the same time. |
Performance | Hadoop’s performance is slower than Spark and Flink | Though Apache Spark has an excellent community background and now It is considered as most matured community. But Its stream processing is not as efficient as Apache Flink as it uses micro-batch processing. |
The performance of Apache Flink is excellent compared to other data processing engines. Flink uses closed loop iterations making machine learning and graph processing fastest. |
Duplicate elimination | NA | Spark process every records exactly once hence eliminates duplication. | Apache Flink process every records exactly once hence eliminates duplication. |
Compatibility | Mapreduce and Spark are compatible to each other. | MapReduce and Spark are compatible with each other and Spark shares all MapReduce’s compatibility for data sources, file formats, and business intelligence tools via JDBC and ODBC. |
Flink is also fully compatible with Hadoop, it can process data stored in hadoop and supports all the file-formats / input-formats |
Security | Hadoop supports Kerberos authentication, which is somewhat painful to manage. HDFS supports access control lists (ACLs) and a traditional file permissions model. However, third party vendors have enabled organizations to leverage Active Directory Kerberos and LDAP for authentication. |
Spark’s security is a bit sparse by currently, only supporting authentication via shared secret (password authentication). The security spark can enjoy is that if you run Spark on Hadoop, it uses HDFS ACLs and file-level permissions. Additionally, Spark can run on YARN giving it the capability of using Kerberos authentication. |
There is user-authentication support in Flink via the Hadoop / Kerberos infrastructure. If you run Flink on YARN, it should seamlessly work, Flink acquires the Kerberos tokens of the user that submits programs, and authenticate itself at YARN, HDFS, and HBase with that. Flink’s upcoming connector, streaming programs can authenticate themselves as stream brokers via SSL |
Iterative Data Flow | MapReduce computation dataflow does not have any loops, it is a chain of stages; at each stage you progress forward using output of previous stage and producing input for the next stage. |
Though ML algorithm is a cyclic data flow it is represented as direct acyclic graph inside the spark | Flink takes little bit different approach than others. They support controlled cyclic dependency graph in run time. This makes them to represent the ML algorithms in a very efficient manner. |
Visualization | All the BI tools like JasperSoft, SAP Business Objects, Qlikview, Tableu, Zoom Data, etc. have provided connectivity with hadoop & its ecosystem. |
All the BI tools like JasperSoft, SAP Business Objects, Qlikview, Tableu, Zoom Data, etc. have provided connectivity with Spark. Spark can also be integrated to Apache. It provides data analytics, ingestion, as well as discovery, visualization and collaboration. |
Apart from this Spark offers a web interface for submitting and
executing jobs. The resulting execution plan can be visualized on this
interface.
All the BI tools like JasperSoft, SAP Business Objects,
Qlikview, Tableu, Zoom Data, Zeppelin, etc. have provided connectivity
with hadoop & its ecosystem.
Flink also offers a web interface for submitting and executing jobs.
The resulting execution plan can be visualized on this interface.
CostMapReduce can typically run on less expensive hardware as it does not attempt to store everything in memory.As spark requires a lot of RAM to run in-memory, increasing it in cluster, gradually increases its cost.Flink also requires a lot of RAM to run in-memory, increasing it in cluster, gradually increases its cost.ScalabilityHadoop has incredible scalability potential and has been used in production on tens of thousands of
nodes.Spark is also highly scalable; we can keep adding n
number of nodes in the cluster and has been used in production on
thousands of nodes.Flink is also highly scalable; we can keep adding n
number of nodes in the cluster and has been used in production on
thousands of nodes.Easy to useMapReduce developers need to hand code each and every operation which makes it very difficult to work.Spark is easy to program as it has tons of high level operatorsFlink is also very easy to program as it has tons of high level operatorsInteractive ModeMapReduce does not have interactive mode to process the data.Spark can process data interactively.Flink can also process data interactivelyreal time AnalysisMapReduce doesn’t support to real-time data processing
(stream processing) as it was developed to perform batch processing on
voluminous amounts of data.It can process data in near real-time data ie data
coming from the real-time event streams at the rate of tons of events
per second, but with high latency, as it uses micro-batch model.Flink natively supports real-time data Analysis. Although it also provides fast batch data processing.SchedulerMapreduce needs an external job scheduler like Oozie to schedule complex flowsDue to in-memory computation spark has its own flow schedulerFlink can use Yarn scheduler but Flink also has its own schedulerSQL supportHadoop enables users to run SQL queries using Apache Hive and ImpalaSpark enables user to run SQL queries using Spark-SQL. Spark-SQL is tightly integrated with Spark coreIn Flink Table API and SQL is used to run SQL-like expression and process structured data.Caching MapReduce cannot cache the data in memory for future requirementsSpark can cache data in memory for further iterations which enhances its performanceFlink can also cache data in memory for further iterations which enhances its performance.