A gentle introduction to Apache Spark



1. What is Apache Spark?

  • An open source and powerful data processing engine.
  • Complement (or even replace) its pioneer counterpart - Hadoop in the future due to much better performance.

    Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk.

Why better? keep reading …

2. What are the most important criteria for a data processing engine? And then what are the differences between Hadoop and Spark?

  • Scalability (work distribution): can work with large data
  • Fault tolerance: can self-recover

Spark leverage the distributed memory which the previous engine did not. That’s the first big difference!

Hadoop targets at acyclic data flow model: load data -> process data ->write output -> finished. However, now there are many applications that reuse a working set of data across multiple parallel operations (Eg: working on the same dataset to optimize parameters in machine learning algorithms: PageRank, K-means, logistic regression) or in interactive data analysis tools (run Pig or Hive, then we query data from disk, not in-memory). These applications reuse intermediate results across multiple computations and it would be so painful if we use Hadoop to pipeline multiple jobs. We have to confront with data replication, disk I/O, serialization,… which is dominating the application execution time.


Fortunately, the guy, namely Spark came up with the idea of leveraging the distributed memory (RAM memory on the slave (worker) machines). As the price of DRAM is decreasing, and processing in-memory data is extremely fast in comparison with memory stored on hard disk, this new thought has been shown more and more potentials.

alt interactive Data Analysis1

By keeping intermediate results in memory, we can overcome the mentioned difficulties. Furthermore, we can also cache the data for multiple queries which is really effective to interactive data analysis tools.

3. How to implement that novel idea?

Well, that’s a bigger question than you imagine! :D, I’m kidding! Resilient Distributed Dataset (RDD) plays the most important role in realizing the idea. First, we should understand what it is.

3.1. RDD is …

  • An immutable collection of objects. (Immutable -> safety with parallel processing)
  • Partitioned and distributed across a set of machines. (Scalability)
  • Can be rebuilt if a partition is lost. (In-memory -> fast, rebuilt -> adapt to fault tolerance)
  • Can be cached in-memory to reuse. (Fast and interactive)

3.2. How to create RDD?

  • Read from data sources (e.g.: HDFS, json files, text files,… any kind of files)
  • Transforming other RDDs using RDD’s parallel operations (transformations and actions). If such, RDD will keep information about how it was derived from other RDDs.


So, what should an RDD have?

  • A set of partitions.
  • A set of dependencies on parent RDD:
    • Narrow dependency if it derives from only 1 parent
    • Wide dependency if it has more than 2 parents (eg: joining 2 parents)
  • A function to compute the partitions from its parents
  • Metadata about its partitioning scheme and data placement (eg: preferred location to compute for each partition)
  • Partitioner (defines strategy of partitioning its partitions)

You may be interested in having a look at the RDD class:

abstract class RDD[T: ClassTag](
    @transient private var sc: SparkContext, // sparkcontext
    @transient private var deps: Seq[Dependency[_]] // sequence of dependencies
  ) extends Serializable with Logging { // serializable

// common fields
val id: Int = sc.newRddId()
val partitioner: Option[Partitioner] = None
private var storageLevel
private var dependencies_
private var partitions_

//common methods
protected def getPartitions: Array[Partition]
protected def getDependencies: Seq[Dependency[_]] = deps
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
def persist(newLevel: StorageLevel): this.type // cache in-memory
def cache(): this.type = persist()

3.3. Transformations and actions, what are they?

Each RDD will have 2 sets of parallel operations: transformation and action.

  • Transformation operations are similar to the lazy loading technique in modern relational data engine (eg Entity Framework in .NET). They are the set of operations of an RDD that define how its data should be transformed. The result should be another RDD, but remember that it will actually do nothing until we call an action on that RDD. Why lazy execution? Because we are expecting to apply some optimization of the series of transformation on such RDD.

  • Actions: Applies all transformations on RDD (if has) and then performs the action to obtain results. After performing action on RDD, the result will be returned to the driver program or written to the storage system. We can easily recognize whether an operation is a transformation or an action by looking at the return value: another RDD (transformation) versus primitive and built-in types such as: int, long, List<Object>, Array<Object>,… (action).


Read more at my other post RDD operations: transformations and actions

Additionally, we can tell Spark to keep the computed RDDs somewhere to reuse with RDD persistence feature:

rdd.persist() // cache in-memory using default STORAGE_LEVEL (MEMORY_ONLY)
rdd.persist(STORAGE_LEVEL) // cache on specific level

rdd.cache() // call persist()

Read more about Storage Level here

Remember that caching acts the same as transformations: leaves the dataset lazy, but hints that it should be kept in memory after the first time it is computed. And if the memory doesn’t fit the dataset, Spark will spill all or parts of the dataset to the disk.

Each type of RDD might have more or less operations. Some frequently used RDD that we may encounter are: SchemaRDD (RDD with rows and column metadatas, we can query the data by using SparkSQL features), HadoopRDD (dataset was read from HDFS), MappredRDD, FilteredRDD, …

Besides providing the abstraction of datasets: RDDs and parallel operations on these, Spark also supports the mechanism of sharing variables: Broadcast Variables and Accumulators.

  • Broadcast variables: read-only (eg: look-up table) variables that will be copied to workers once (same as DistributedCache in MapReduce). The technique to achieve: for example I have a look-up table X. The values of X should be compressed and saved in a shared file system (actually it’s saved in Tachyon distributed file system which Spark is using). Then, Spark will create an object Y such that we serialize(Y), we’ll get the path to file file containing values of X. Y will be sent to all workers so that workers can check if X wasn’t cached, they can read the file.
    // sc:SparkContext
    val broadcastVal = sc.broadcast(rdd) // initialize broadcast a variable
    broadcastVal.value // access broadcast variable at workers
  • Accumulators (same as Counter in MapReduce): worker can only add (increase) using associative operations. Accumulators are usually used in parallel sums. Only driver can read an accumulator value.

3.4. Spark execution stack

Two ways of manipulate data in Spark are:

  • Use interactive Shell (support scala, python, java)
  • Create a standalone application (write a driver program)


3.5. Component Stack

alt Read more at Modules in Apache Spark project This is just a quick introduction of Spark. Please refer to my future posts to understand how Spark has been becoming beloved by the community. Thank you!

  1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. Technical Report UCB/EECS-2011-82. July 2011 

Leave a Comment