This tutorial covers difference between Apache Storm and Apache Spark streaming. Apache Storm is the stream processing engine for processing real time streaming data while Apache Spark is general purpose computing engine which provides Spark streaming having capability to handle streaming data to process them in near real-time. Let’s understand in a battle of Storm vs Spark streaming which is better.
|Apache Storm||Apache Spark Streaming|
|Processing Model||supports true streaming processing model through core strom layer||spark streaming is wrapper over spark batch processing|
|Primitives||Storm provides a very rich set of primitives to perform tuple level
process at intervals of a stream (filters, functions). Aggregations over
messages in a stream can be achieved through group by semantics. join
operations can be performed across streams – it supports left, right,
|Spark streaming provides 2 wide varieties of operators – stream
transformation operators that transform one DStream into another
Dstream, and output operators that write information to external
systems. the previous includes stateless operators (filter, map,
mapPartitions, union, distinct then on) still as stateful window
operators (countByWindow, reduceByWindow then on).
|StateManagement||Core storm by default doesn’t offer any framework level support to
store any intermediate bolt output (result of user operation) as state.
Hence, any application has to expressly create/update its own state as
and once required.
|The underlying Spark by default treats output of every RDD operation
(stateless/stateful) as intermediate state and stores it as RDD. Spark
Streaming permits maintaining and changing state via updateStateByKey
API. A pluggable method couldn’t be found to implement state within the
|Message delivery guarantees (Handling message level failures)||Storm supports 3 message processing guarantees: at-least once,
at-most once and exactly once. Storm’s reliability mechanisms are purely
distributed, scalable, and fault-tolerant.
|Spark Streaming define its fault tolerance semantics in terms of the
guarantees provided by the recepient and output operators. As per the
Apache Spark architecture, the incoming data is read and replicated in
different Spark executors nodes. This generates failure scenarios in
terms of data received but may not be reflected. Fault tolerance in case
of worker failure and driver failure is handled differently.
|Fault Tolerance (Handling process/node level failures)||Storm is intended with fault-tolerance at its core. Storm daemons
(Nimbus and Supervisor) are made to be fail-fast (that means that method
self-destructs whenever any sudden scenario is encountered) and
stateless (all state is unbroken in Zookeeper or on disk).
|The Driver Node (equivalent of JT) is a SPOF. If driver node fails,
then all executors with their received and replicated in-memory
information is lost. so as to dependably get over from driver failure,
data checkpointing is needed.
|Debuggability and monitoring||Storm UI supports image of every topology; with enitre break-up of
internal spouts and bolts. UI additionally contributes information
having any errors coming in tasks and fine-grained stats on the
throughput and latency of every part of running topology.It helps in
debugging problems at high level.
Metric based monitoring monitoring: Storm’s inbuilt metrics feature
supports framework level for applications to emit any metrics, which can
then be simply integrated with external metrics/monitoring systems .
|Spark interfaceIdisplays an extra Streaming tab that shows
statistics regarding running receivers (whether receivers ar active,
variety of records received, receiver error, and so on .) and completed
batches (batch process times, queueing delays, and so on). this can be
used to observe the execution of the application.The following 2 info in web UI are significantly necessary for standardization of batch size:Processing Time – The time to process every batch of data.
Scheduling Delay – The time a batch stays in a queue for the process previous batches to complete.
|Auto Scaling||Storm provides configuring initial parallelism at various levels per
topology – variety of worker processes, variety of executors, variety
of tasks. additionally, it supports dynamic rebalancing, that permits to
increase or reduces number of worker processes and executors w/o being
needed to restart the cluster or the topology. However, the amount of
initial tasks designed stay constant throughout the life of topology.
Once all supervisor nodes are fully saturated with worker processes, and
there’s a requirement to scale out, one merely has to begin a
replacement supervisor node and inform it to cluster wide Zookeeper.
It is possible to transform the logic of monitor the present resource
consumption on every node in a very Storm cluster, and dynamically add a
lot of resources. STORM-594 describes such auto-scaling mechanism
employing a feedback system.
|The community is currently developing on dynamic scaling to streaming applications.
At the instant, elastic scaling of Spark streaming applications isn’t supported.
Essentially, dynamic allocation isn’t meant to be used in Spark
streaming at the instant (1.4 or earlier). the reason is that presently
the receiving topology is static. the number of receivers is fixed. One
receiver is allotted with every DStream instantiated and it’ll use one
core within the cluster. Once the StreamingContext is started, this
topology can not be modified. Killing receivers leads to stopping the
|Yarn Integration||The Storm integration alongside YARN is recommended through Apache
Slider. Slider is a YARN application that deploy non-YARN distributed
applications over a YARN cluster. It interacts with YARN RM to spawn
containers for distributed application then manages the lifecycle of
these containers. Slider provides out-of-box application packages for
|Spark framework provides native integration alongwith Yarn. Spark
streaming as a layer above Spark, merely leverages the integration.
every Spark streaming application gets reproduced as a individual Yarn
application. The ApplicationMaster container runs the Spark driver and
initializes the SparkContext. Every executors and receivers run in
containers managed by ApplicationMaster. The ApplicationMaster then
periodically submits one job per micro-batch on the YARN containers.
|Isolation||Each employee process runs executors for a particular topology ,
that’s mixing of various topology tasks isn’t allowed at worker process
level which supports topology level runtime isolation. Further, every
executor thread runs one or more tasks of identical element (spout or
bolt), that’s no admixture of tasks across elements.
|Spark application is a different application run on YARN cluster,
wherever every excecutor runs in a different YARN container. therefore
JVM level isolation is provided by Yarn since 2 totally different
topology can’t execute in same JVM. in addition, YARN provides resource
level isolation so that container level resource constraints (CPU,
memory limits) can be organized.
|Open Source Apache community||Storm Powered-By page healthy list of corporations that are running
Storm in production for numerous use-cases: time period analytics, NLP,
data normalizations and ETL; scalable low-latency and high performance
process, in varied domains (ad-tech so on ).
Many of them are large-scale web deployments that are extremely pushing
the boundaries in terms of performance and scale. for instance Yahoo
readying consists of two,300 nodes running Storm for near-real-time
event process, with largest topology spanning across four hundred nodes.
|Spark Streaming remains rising and has restricted expertise in production clusters.
However the general umbrella Spark community is well one in all the
biggest and therefore the most active open supply communities out there
nowadays. within the latest one.4.1 release, over 210 contributors from
seventy completely different organizations contributed over a thousand
patches. the general charter is apace evolving given the massive
developer base. this could cause maturity of Spark Streaming within the
close to future.