[Code Reading] Narrow Dependency and Wide Dependency Implementation in Spark

Files: Dependency.scala As mentioned about different types of dependencies of RDDs in my previous post, today I’m going to dive more about its implementation.

alt Types of Dependency

As you can see from the class diagram, dependency is divided into two types: narrow and shuffle (wide).

For NarrowDependency, we have 3 concrete classes: OneToOneDependency, RangeDependency and PruneDependency:

  • OneToOneDependency: one partition of the parent RDD is used by one partition of the child RDD. So, method getParents(partitionID: Int) which gets the parent partitions for a child partition just returns that partitionID.
  • RangeDependency: is useful for Union operation where a set of child partitions depends on a set of parent partitions.
  • PruneDependency: used by PartitionPruningRDD: the child RDD contains a subset of partitions of the parents.

Note that NarrowDependency provides the method getParents(partitionID: Int) so that the DAGScheduler can put into good use in determining the preferred location to compute the task: if the getParents method that being called on RDDs returns the same value (same partition) then the transformations on RDDs should be computed locally.

In Spark, here are the list of transformations that are narrow dependencies:

  • Filter, map, mapValues, flatMap, flatMapValues
  • Glom, pipe, zipWithIndex, cartesian, union
  • MapPartitionsWithInputSplit, mapPartitions, mapPartitionsWithIndex, mapPartitionsWithContext
  • Sample, randomSplit

ShuffleDependency :only PairRDDs with <Key, Value> have this kind of dependency and they involve the shuffle step. Therefore, we need a partitioner (to decide where to put intermediate values), a serializer (to serialize object so that object can be sent through network), a shuffleHandler to handle shuffle step, …

Here is the list of transformations that are wide dependencies:

  • SortByKey, combineByKey, partitionBy
  • RepartitionAndSortWithinPartitions

Special case:

  • Coalesce: could perform shuffle or not depending on given parameters.
  • SubtractByKey, cogroup: depend on its partitions’ location, could be narrow or wide dependency.

Leave a Comment