Running a Spark job on EC2 cluster

July 26, 2017

In a previous blog we saw how to install Spark on EC2. I am doing this so that I can save on the cost of EMR on top of EC2 which can be over two thousand USD per year for large instances. Even for smaller instances the savings can be up to 30%. In this blog entry we will see how to run a spark job on a cluster. You can run Spark jobs in local mode where the job run locally on a single machine. To run Spark jobs on a cluster, a cluster manager is required. Spark has its own simple cluster manager, and its called the Standalone cluster manager. Industry applications usually swap the Standalone cluster manager for either Apache Mesos or Hadoop YARN.For this example I have setup a small cluster with one t2.micro instance (1 vCPU, 1G), which will act as the master and two m3.medium instances (1 vCPU, 3.7G) which will be the workers.

Before setting up the cluster make sure that the cluster security group has sufficient permissions and the master and slaves can communicate with each other. In the security group for this cluster add three inbound rules which allow all TCP, UDP and ICMP traffic from within the cluster security group.

Next we have to start the master. The IP address and spark port (default: 7077) of the master will be required to start the slaves. If the cluster is in a single VPC, as was in my case, internal IP addresses will work. If the cluster nodes are not in a single VPC, you will have to use public IP addresses. To view the master and slave UIs from local home computer, the public IPs will anyway be required. The master UI is available on http://master-ip:8080. The connection string can be obtained from the master UI.

# On Master
[ec2-user@master ~]$ $SPARK_HOME/sbin/start-master.sh

# On Slave 1
[ec2-user@slave1 ~]$ $SPARK_HOME/sbin/start-slave.sh spark://ip-172-31-14-55:7077

# On Slave 2
[ec2-user@slave2 ~]$ $SPARK_HOME/sbin/start-master.sh spark://ip-172-31-14-55:7077

In the previous blog we ran a Spark job on WoW auction data locally. Now we will run this on the EC2 cluster. To run this on the cluster, a few changes to setup part of the code had to be done. I moved the file location from local to S3. I used the Boto3 library to access S3 content. To access S3 programmatically one has to first create an IAM user for programmatic access and then use the provided access key and secret key. Below is the setup code.

	import boto3
	import json
	import sys

	# Create a SparkSession (the config bit is only for Windows!)
	if sys.platform == 'win32':
	spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("WoWAuctions").getOrCreate()
	else:
	spark = SparkSession.builder.appName("WoWAuctions").getOrCreate()
	sc = spark.sparkContext

	access_key = ""
	secret_key = ""

	wow_data_bucket = "wowauctionapp"
	session = boto3.Session(access_key, secret_key)
	s3 = session.client('s3')
	response = s3.list_buckets()
	buckets = [bucket['Name'] for bucket in response['Buckets']]
	if wow_data_bucket not in buckets:
	print("Bucket: ", wow_data_bucket, " not found")
	quit()
	else:
	print("Bucket: ", wow_data_bucket, " found")

	file_key = "test/auctions.json"
	obj = s3.get_object(Bucket=wow_data_bucket, Key=file_key)
	auctiondata = json.loads(obj['Body'].read().decode('utf-8'))

view raw wow-03.py hosted with ❤ by GitHub

During the submission of the spark job, we have to provide information that there is a cluster available, otherwise the job will run locally and not use the cluster. This information can be provided from the command line using
spark-submit --master spark://ip-172-31-14-55:7077 app.py
or by adding the following line to spark-defaults.conf file.

spark.master                     spark://ip-172-31-14-55:7077

I also changed the settings on file spark-env.sh, but it did not make much of a difference. I increased the worker instances from 1 to 2, but it slowed down the job slightly. I increased the executor memory from the default 1G to 2G but it did not speed up the job.

SPARK_WORKER_INSTANCES=1
SPARK_EXECUTOR_MEMORY=2g

The progress details of the job can be tracked on the master UI. The UI shows the number of workers, CPU and memory used. One can use this information to tweak around with the configuration parameters. The UI also shows the duration of the job and past jobs. Note that you lose the information of past jobs if you restart your cluster master. Also note that only 2.7G of memory is free and available for Spark on a 3.7G instance.

That's my first job running successfully on an EC2 cluster. I will carry on with this series and next I will focus on setting up a web server which runs this job every hour and displays the results on a web page. In case you are reading this and have some queries, I will be glad to hear from you. Please leave a comment and I will respond soon.

Comments

UnknownAugust 21, 2017 at 4:42 AM
nice and this blog very usefull information Python Online Training Hyderabad
ReplyDelete
Replies
UnknownAugust 28, 2017 at 3:45 AM
good and usefull information Python Online Training Hyderabad
ReplyDelete
Replies
likithaSeptember 7, 2017 at 10:22 PM
thanks for providing valuable information.it saves our time to search..keep update with your blogs..Python Online Training Hyderabad
ReplyDelete
Replies
ohhanibaekJuly 15, 2019 at 12:15 AM
thanks for sharing nice information and nice article and very useful information.....
MORE: http://www.orienit.com/courses/spark-and-scala-training-in-hyderabad
ReplyDelete
Replies

Search This Blog

Data Drudgery

Running a Spark job on EC2 cluster

Comments

Post a Comment

Popular posts from this blog

Connect to MySQL 5.7 from Python using SSL

Connect to MySQL Server 5.7 from PHP 7.0 using SSL

Performance improvement of MySQL inserts in Python using batching