Running a Spark job on EC2 cluster

In a previous blog we saw how to install Spark on EC2. I am doing this so that I can save on the cost of EMR on top of EC2 which can be over two thousand USD per year for large instances. Even for smaller instances the savings can be up to 30%. In this blog entry we will see how to run a spark job on a cluster. You can run Spark jobs in local mode where the job run locally on a single machine. To run Spark jobs on a cluster, a cluster manager is required. Spark has its own simple cluster manager, and its called the Standalone cluster manager. Industry applications usually swap the Standalone cluster manager for either Apache Mesos or Hadoop YARN.For this example I have setup a small cluster with one t2.micro instance (1 vCPU, 1G), which will act as the master and two m3.medium instances (1 vCPU, 3.7G) which will be the workers.

Before setting up the cluster make sure that the cluster security group has sufficient permissions and the master and slaves can communicate with each other. In the security group for this cluster add three inbound rules which allow all TCP, UDP and ICMP traffic from within the cluster security group.

Next we have to start the master. The IP address and spark port (default: 7077) of the master will be required to start the slaves. If the cluster is in a single VPC, as was in my case, internal IP addresses will work. If the cluster nodes are not in a single VPC, you will have to use public IP addresses. To view the master and slave UIs from local home computer, the public IPs will anyway be required. The master UI is available on http://master-ip:8080. The connection string can be obtained from the master UI.
# On Master
[ec2-user@master ~]$ $SPARK_HOME/sbin/start-master.sh

# On Slave 1
[ec2-user@slave1 ~]$ $SPARK_HOME/sbin/start-slave.sh spark://ip-172-31-14-55:7077

# On Slave 2
[ec2-user@slave2 ~]$ $SPARK_HOME/sbin/start-master.sh spark://ip-172-31-14-55:7077
In the previous blog we ran a Spark job on WoW auction data locally. Now we will run this on the EC2 cluster. To run this on the cluster, a few changes to setup part of the code had to be done. I moved the file location from local to S3. I used the Boto3 library to access S3 content. To access S3 programmatically one has to first create an IAM user for programmatic access and then use the provided access key and secret key. Below is the setup code.
During the submission of the spark job, we have to provide information that there is a cluster available, otherwise the job will run locally and not use the cluster. This information can be provided from the command line using
spark-submit --master spark://ip-172-31-14-55:7077 app.py
or by adding the following line to spark-defaults.conf file.
spark.master                     spark://ip-172-31-14-55:7077
I also changed the settings on file spark-env.sh, but it did not make much of a difference. I increased the worker instances from 1 to 2, but it slowed down the job slightly. I increased the executor memory from the default 1G to 2G but it did not speed up the job.
SPARK_WORKER_INSTANCES=1
SPARK_EXECUTOR_MEMORY=2g
The progress details of the job can be tracked on the master UI. The UI shows the number of workers, CPU and memory used. One can use this information to tweak around with the configuration parameters. The UI also shows the duration of the job and past jobs. Note that you lose the information of past jobs if you restart your cluster master. Also note that only 2.7G of memory is free and available for Spark on a 3.7G instance.
That's my first job running successfully on an EC2 cluster. I will carry on with this series and next I will focus on setting up a web server which runs this job every hour and displays the results on a web page. In case you are reading this and have some queries, I will be glad to hear from you. Please leave a comment and I will respond soon.

Comments

  1. thanks for providing valuable information.it saves our time to search..keep update with your blogs..Python Online Training Hyderabad

    ReplyDelete
  2. thanks for sharing nice information and nice article and very useful information.....
    MORE: http://www.orienit.com/courses/spark-and-scala-training-in-hyderabad

    ReplyDelete

Post a Comment

Popular posts from this blog

Performance improvement of MySQL inserts in Python using batching

Connect to MySQL 5.7 from Python using SSL

Connect to MySQL Server 5.7 from PHP 7.0 using SSL