The module storage in Spark provides the data access service for application, including:
- Reads and stores data from various sources: HDFS, Local disk, RAM or even fetch blocks of data from other nodes via network.
- Caches data (RDDs) in different levels and formats: in-memory, disk, …
The most striking difference between Spark and Hadoop is the leverage of RAM memory in cluster to produce fast results. Thus, it’s worth to go a bit deeper to understand the caching API.
How RDD caching system works
- The application will first call 2 methods:
persist()on the RDD object.
- The object
persistentRDDsas the list of cached RDDs. SparkContext can hold the reference of RDD being cached or remove that and then call gc. It can even periodically remove null RDD or old cached RDD thanks to the
Cache at nodes
CacheManager stores computed Rdds to BlockManager so that it can return a cached RDD or compute it.
- “write-once” key-value store on each worker
- Serves shuffle data (locally or remotely fetch data) as well as cached RDDs
- Tracks StorageLevel (RAM, disk) for each block
- Spills data to disk if memory is insufficient
- Can replicate data across nodes
- Asynchronous IO based networking library
- Allows fetching blocks from BlockManagers
- Allows prioritization/chunking across connections
- Fetch logic tries to optimize for block sizes
- Track where each map task in a shuffle ran
- Tells reduce tasks the map locations
- Each worker caches the locations to avoid re-fetching
- A “generation ID” passed with each Task allows invalidating the cache when map outputs are lost.
Different levels of caching
The class & object
org.apache.spark.storage.StorageLevel act as flags for controlling the storage level of a RDD:
Where to keep data:
- RAM memory (by default)
- Tachyon (off_heap)
- Hard disk (in case out of in-memory) and in which form
- Serialized format
- Replicated partitions on multiple nodes
|Level||Use Disk||Use Memory||Use Off Heap||Object(Deserialized)||#Replication|
At the moment, Off Heap storage level does not support using disk, ram, serialized & multiple replications.
MEMORY_ONLY: if the RDD doesn’t fit in memory, some partition will not be cached (will be re-computed when needed).
MEMORY_AND_DISK: if the in memory, store the partitions that don’t fit in memory on disk (re-read them when needed)
OFF_HEAP: stores RDD in serialized format in Tachyon. Compared to
OFF_HEAPreduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Refer to Tachyon project for more information. By storing the memory in Tachyon, the crash of an executor no longer leads to losing memory in cache