Files: Dependency.scala As mentioned about different types of dependencies of RDDs in my previous post, today I’m going to dive more about its implementation.
Types of Dependency
As you can see from the class diagram, dependency is divided into two types: narrow and shuffle (wide).
NarrowDependency, we have 3 concrete classes:
OneToOneDependency: one partition of the parent RDD is used by one partition of the child RDD. So, method
getParents(partitionID: Int)which gets the parent partitions for a child partition just returns that partitionID.
RangeDependency: is useful for
Unionoperation where a set of child partitions depends on a set of parent partitions.
PruneDependency: used by
PartitionPruningRDD: the child RDD contains a subset of partitions of the parents.
NarrowDependency provides the method
getParents(partitionID: Int) so that the
DAGScheduler can put into good use in determining the preferred location to compute the task: if the getParents method that being called on RDDs returns the same value (same partition) then the transformations on RDDs should be computed locally.
In Spark, here are the list of transformations that are narrow dependencies:
<Key, Value> have this kind of dependency and they involve the shuffle step. Therefore, we need a partitioner (to decide where to put intermediate values), a
serializer (to serialize object so that object can be sent through network), a
shuffleHandler to handle shuffle step, …
Here is the list of transformations that are wide dependencies:
Coalesce: could perform shuffle or not depending on given parameters.
cogroup: depend on its partitions’ location, could be narrow or wide dependency.