Installing Spark on EC2

This is an account of setting up Spark on my small EC2 cluster of two m3.medium spot instances. Spot instances are good way of saving on cost of on demand prices, and you also get the option of retaining your instances till the spot prices are below your chosen maximum bid. There are many well written guides about setting up Spark on an EC2 cluster but I still got stuck at a few places. I will be describing those here, along with what was the reason for getting stuck. This will be helpful for those who face similar problems.

I will not go into the details of each step, but delve into details of only the troubleshooting parts. 

Step 1: Create an IAM role for EC2 service role. This step is not required for setup of Spark. This is required only when accessing other AWS services.
Step 2: Create security group with SSH access from your local work machine. This step is crucial, as without this we cannot SSH into the EC2 machine.
Step 3: Launch EC2 instances with IAM role and security group. Note that IAM role and security group can be added after the EC2 instances have been created as well. So, these steps need not be followed in the same order.
Step 4: Download and install Spark
wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
sudo tar zxvf spark-2.2.0-bin-hadoop2.7.tgz -C /opt
sudo ln -fs spark-2.2.0-bin-hadoop2.7 /opt/spark
Add the following to .bash_profile and source it.
export SPARK_HOME=/opt/spark
PATH=$PATH:$SPARK_HOME/bin
export PATH
Step 5: Smoke test Spark
spark-submit --version
You should see the following output
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_131
Branch
Compiled by user jenkins on 2017-06-30T22:58:04Z
Revision
Url
Type --help for more information.
Step 6: Repeat the above steps for all instances. AWS UI provides 'Run Command' which one can use to run commands on all or selected instances.

Troubleshooting: During the smoke test I ran into the below error.
$ spark-submit --version
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/spark/launcher/Main : 
Unsupported major.minor version 52.0
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:803)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
The reason for this error was that the JRE installed on the system was 1.7. I had to update the JRE to 1.8 and that resolved the issue.
java -version
sudo yum remove java-1.7.0-openjdk
sudo yum install java-1.8.0

Comments

Popular posts from this blog

Performance improvement of MySQL inserts in Python using batching

Connect to MySQL 5.7 from Python using SSL

Connect to MySQL Server 5.7 from PHP 7.0 using SSL