Fault Tolerance in Apache Spark.
Introduction to Fault Tolerance in Apache Spark
Before we start with learning what is fault tolerance in Spark, let us revise concepts of Apache Spark for beginners.
Now let’s understand what is fault and how Spark handles fault tolerance.
Fault refers to failure, thus fault tolerance is the capability to
operate and to recover loss after a failure occurs. If we want our
system to be fault tolerant, it should be redundant because we require a
redundant component to obtain the lost data. The faulty data is
recovered by redundant data.
Spark RDD fault tolerance
Let us firstly see how to create RDDs in Spark.
Spark operates on data in fault-tolerant file systems like HDFS or S3.
So all the RDDs generated from fault tolerant data is fault tolerant.
But this does not set true for streaming/live data (data over the
network). So the key need of fault tolerance in Spark is for this kind
of data. The basic fault-tolerant semantic of Spark are:
- Since Apache Spark RDD
is an immutable dataset, each Spark RDD remembers the lineage of the
deterministic operation that was used on fault-tolerant input dataset to
create it. - If due to a worker node failure any partition of an RDD is lost,
then that partition can be re-computed from the original fault-tolerant
dataset using the lineage of operations.
- Assuming that all of the RDD transformations are deterministic, the data in the final transformed RDD will always be the same irrespective of failures in the Spark cluster.
To achieve fault tolerance for all the generated RDDs, the achieved
data is replicated among multiple Spark executors in worker nodes in the
cluster. This results in two types of data that needs to be recovered
in the event of failure: 1) data received and replicated. 2) data received but buffered for replication.
- Data received and replicated: In this, the data gets replicated on one of the other nodes thus the data can be retrieved when a failure.
- Data received but buffered for replication: The data is not replicated thus the only way to recover fault is by retrieving it again from the source.
Failure also occurs in worker as well as driver nodes.
- Failure of worker node: The node which runs the application code on the Spark cluster
is Spark worker node. These are the slave nodes. Any of the worker
nodes running executor can fail, thus resulting in loss of in-memory If
any receivers were running on failed nodes, then their buffer data will
be lost. - Failure of driver node: If there is a failure of
the driver node that is running the Spark Streaming application, then
SparkContent is lost and all executors with their in-memory data are
lost.
Apache Mesos
helps in making the Spark master fault tolerant by maintaining the
backup masters. It is open source software residing between the
application layer and the operating system and makes it easier to deploy
and manage applications in large-scale clustered environment.
Executors are relaunched if they fail. Post failure, executors are
relaunched automatically and spark streaming does parallel recovery by
recomputing Spark RDD’s on input data. Receivers are restarted by the
workers when they fail.
Fault Tolerance with receiver – based sources
For input sources based on receivers, the fault tolerance depends on both- the failure scenario and the type of receiver. There are two types of receiver:
- Reliable receiver: Once it is ensured that the
received data has been replicated, the reliable sources are
acknowledged. If the receiver fails, the source will not receive
acknowledgment for the buffered data. So, the next time the receiver is
restarted, the source will resend the data, and no data will be lost due
to failure. - Unreliable Receiver: Due to the worker or driver failure, the data can be lost since such receiver does not send an acknowledgment.
If the worker node fails, and the receiver is
reliable there will be no data loss. But in the case of unreliable
receiver data loss will occur. With the unreliable receiver, data
received but not replicated can be lost.