How to install Apache Spark in Standalone cluster

Published:

Tags:

Categories:

By following this tutorial, you will be able to:

  • Install Apache Spark latest version 1.2.1 in Standalone mode on Ubuntu from scratch.
  • Configure a small cluster, which comprises of 1 Master node and 1 Worker node (You can easily configure for a 2nd Worker node if needed).

Get a fresh copy of Ubuntu version 14.04 and install it on Vmware first.

Live instructions:

1. Install Java SDK 6

# optional - remove openjdk if your installed it
sudo apt-get purge openjdk*

# install Oracle Java SDK 6
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java6-installer

# specify the JAVA_HOME environment variable in /etc/environment
sudo nano /etc/environment
JAVA_HOME=/usr/lib/jvm/java-6-oracle/

# force OS to reload the /etc/environment file
source /etc/environment

Now, you can verify the that java is installed correctly by running:

java -version
echo $JAVA_HOME

alt

2. Install Scala 2.10.4

Download Scala 2.10.4 here (scroll down and choose the .deb package). After that, just double click on it and Ubuntu Software Center will do the rest!

scala -version

3.Install SSH Remote Access

Spark uses SSH (to access its nodes) which would normally require the user to enter a password. However, this requirement can be eliminated by creating and setting up SSH certificates:

# On Worker nodes, we install SSH Server so that we can access this node from Master node
sudo apt-get install openssh-server
# On Master node, we generate a rsa key for remote access
ssh-keygen
# To access Worker nodes via SSH without providing password (just use our rsa key), we need to copy our public key to each Worker node
ssh-copy-id -i ~/.ssh/id_rsa.pub <username_on_remote_machine>@<IP_address_of_that_remote_machine>
# Example:
ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]

4.Install Hadoop 2.6 (optional )

  • Get a copy of Hadoop Link
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys # add public key to the list of authorized keys
  • Config Hadoop environment variables
nano ~/.bashrc
# add following lines
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-6-oracle/
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native"
#HADOOP VARIABLES END
source ~/.bashrc
  • Config settings:

/etc/hadoop/core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

etc/hadoop/yarn-site.xml

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

etc/hadoop/hdfs-site.xml

mkdir -p ~/hadoop_store/hdfs/namenode
mkdir -p ~/hadoop_store/hdfs/datanode
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/ntkhoa/hadoop_store/hdfs/namenode</value> <!-- Path to store NameNode data in your local folder-->
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/ntkhoa/hadoop_store/hdfs/datanode</value> <!-- Path to store NameNode data in your local folder-->
</property>
  • Finally run the commands:
hdfs namenode -format # format the HDFS
start-dfs.sh
start-yarn.sh
jps

The default HTTP port number of Hadoop web services:

50070    NameNode
50090   Secondary NameNode
8088    Cluster and Applications
50075   DataNode

5. Download pre-built version of Spark & install it

  • You can download Spark 1.2.1 prebuit for Hadoop 2.4 here.
    alt
  • Save and extract that under your home folder.
  • Specifies the Worker Nodes for the Master Node by making a copy of the file ./conf/slaves.template and name it ./conf/slaves at Master Node, then remove the “localhost” line, and add the IP addresses of your Worker Nodes line by line.
  • Similarly, change some configuration of the file ./conf/spark-env.sh
export SPARK_MASTER_IP=192.168.85.135 # the IP address of the Master Node so that the Worker Nodes know where to connect to
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=800m
export SPARK_WORKER_INSTANCES=2

In my setup, the Master Node has IP 192.168.85.135 and the Worker Node has IP 192.168.85.136

6.Test

Test the local version

./bin/spark-shell #Launch the Spark shell
scala:> sc.parallelize(1 to 1000).count() # it should return 1000
scala:> exit # exit spark shell
./bin/run-example SparkPi # run the Pi example

Launch the cluster and test on our cluster

We’ve installed all the prerequisites and successfully built Spark. Now, it’s finally time to start the server and play with it.

./sbin/start-all.sh # start our cluster
./sbin/stop-all.sh # if you want to stop our cluster

Then go to http://192.168.85.135:8080/ to see the Master’s web UI, you should see 2 worker nodes

alt

Now, it’s time to launch the Spark shell connecting to the Server

MASTER=spark://192.168.85.135:7077 ./bin/spark-shell
scala:> sc.parallelize(1 to 1000).count() # it should return 1000

Enjoy! :)

Leave a Comment