Here are the essential materials that I’ve found to facilitate my learning of Spark:
1. Learning Spark
- “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, 2011”: Apache Spark started as a research project at UC Berkeley in the AMPLab, thus we should have a look at their first paper. The slide and recorded presentation is also available.
- Spark screencast: [Outdated] Get your hands dirty by following the official Spark screencast to get to know how to install Spark and run some examples.
I’ve always thought that learning by reading books is one of the most effective ways thanks to their well-structured content, the ease of getting an overview and follow the steps. As Spark is still a young and promising project (first released is October 15, 2012), until this moment (11-2014), there are only 2 available books:
- Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, O’Reilly - highly recommended
- Fast Data Processing with Spark - not recommended
- Advanced Analytics with Spark - recommended
- Spark in Action
- Cloud Computing & Distributed Systems - Eurecom Institute
- Introduction to Big Data with Apache Spark - Databricks & UC Berkeley
- Scalable Machine Learning - Databricks & UC Berkeley
2. Spark internals for developers
If you want to read the Spark’s source code, understand the system or tweak Spark’s internals:
- Spark’s Wiki: to understand how the contribution works and get the developer tools
- AMP Camp 4 lab: you will find a gentle presentation on Spark’s ecosystem and hands-on exercises to understand the features that Spark is supporting.
- Programming in Scala: Spark is written in Scala, so getting familiar with Scala is a must.