The application isolation in current Spark’s architecture results in the impossibility of sharing data (mostly RDDs) between different applications without writing them to an external storage. Refer to my post on Spark deployment modes to understand how an application acquires resources in a cluster.
- Externalize RDDs (save the serialized/deserialized RDDs into Distributed File System so that the other application can read: eg: HDFS (hard disk) or Tachyon (in-memory speed)).
- Use SparkJobServer: the key idea is to build a central server that allows to run a job via a shared SparkContext (the Application no longer creates the SparkContext but the SparkJobServer does. Thus, RDDs are now sharable), or an independent SparkContext. Features of SparkJobServer:
- “Spark as a Service”: Simple REST interface for all aspects of job, context management
- Supports sub-second low-latency jobs via long-running job contexts
- Start and stop job contexts for RDD sharing and low-latency jobs; change resources on restart
- Kill running jobs via stop context
- Separate jar uploading step for faster job startup
- Asynchronous and synchronous job API. Synchronous API is great for low latency jobs!
- Works with Standalone Spark as well as Mesos and yarn-client
- Job and jar info is persisted via a pluggable DAO interface
- Named RDDs to cache and retrieve RDDs by name, improving RDD sharing and reuse among jobs.