Data Drudgery

Posts

Showing posts with the label python

Performance improvement of MySQL inserts in Python using batching

August 01, 2017

In this blog entry we will see how much performance improvement we can get by using batching for inserts in MySQL 5.7 from a Python3 client. In this example the number of rows to be inserted is around 140K. The json blob for this data has a size of around ~25MB. First we will insert them one by one. Note even in this case, I am inserting all of them and finally doing a single commit. The default behavior of the mysql-connector library is that autocommit is false. This is good for performance. If we commit after each insert, the performance will be even worse. Below are the code and performance details. Number of auctions: 139770 Number of rows affected : 139770 Time taken for insert: 34.17 s For my use case, I was okay with around 30 seconds of insert time for the bulk update, but thought of trying out the batch insert. I ran into a problem because of the default max packet size for MySQL execute statements. I had to change it in my.cnf file by adding max packet siz...

Connect to MySQL 5.7 from Python using SSL

July 30, 2017

In the previous blog we saw how to connect to MySQL Server 5.7 from PHP using SSL. In this entry I will describe the steps I took to connect to MySQL Server 5.7 from Python code. I used mysql-connector as the Python library. The default install (version 2.2.3) failed, so I had to install the previous version (v. 2.1.6). sudo pip install mysql-connector==2.1.6 The code to connect using Python is similar to PHP with similar parameters which have to be set. I did run into a few issues with my query not getting parsed if I passed query parameters as parameters. So I had to prepare the whole query and send it as a single complete query with no parameters in the 'execute' function. Below is my piece of working code. As I went further and tried to put data fetched from WoW API, I faced further problems with text encoding. I also had problems with creating multi line queries. I could not find a lot of discussion material on Stackoverflow. I found out though that Python 2...

Running a Spark job on EC2 cluster

July 26, 2017

In a previous blog we saw how to install Spark on EC2. I am doing this so that I can save on the cost of EMR on top of EC2 which can be over two thousand USD per year for large instances. Even for smaller instances the savings can be up to 30%. In this blog entry we will see how to run a spark job on a cluster. You can run Spark jobs in local mode where the job run locally on a single machine. To run Spark jobs on a cluster, a cluster manager is required. Spark has its own simple cluster manager, and its called the Standalone cluster manager. Industry applications usually swap the Standalone cluster manager for either Apache Mesos or Hadoop YARN .For this example I have setup a small cluster with one t2.micro instance (1 vCPU, 1G), which will act as the master and two m3.medium instances (1 vCPU, 3.7G) which will be the workers. Before setting up the cluster make sure that the cluster security group has sufficient permissions and the master and slaves can communicate with each...

Experimenting with WoW data - Part 2

July 20, 2017

In the last part we went through how to get the WoW auction data using the developer APIs. The auction data dump (auctions.json) is updated once every hour. As I noticed, that this dump is updated just before the hour in UTC. So scheduled job to get the updated auction dump every hour should work fine. In this section we will use Spark to do some basic analysis on the auction data. Simple items have a row like the one shown below. Items like legendary equipment, pets will have additional fields like bonusLists, petSpeciesId etc. Let's take a look at a row of auction data. {"auc":1018440074,"item":41114,"owner":"Lougincath", "ownerRealm":"Dalaran","bid":507084,"buyout":1000000,"quantity":1, "timeLeft":"LONG","rand":0,"seed":0,"context":0} Next we will put the auction json data into a dataframe. As the datadump has some additional me...

Experimenting with WoW data - Part 1

July 17, 2017

I will now delve into real data and the dataset I have chosen is the auction data for World of Warcraft. Each realm has its own auction house, and I will start with the auction house of a single realm. There are more than two hundred realms in NA region alone. There are three more regions, Europe, Asia-Pacific and China. To fetch the data, one will need an API key, which can be easily obtained by registering on Blizzard Dev . Below is the simple code to get the data and print it. Please note that this API returns the metadata, the response contains the location for complete auction data, which has to be fetched next. This response also contains the last modified time which can be persisted so that it can be used to find when the auction data dump has been updated. Auction metadata The data dump takes quite a while to retrieve (few minutes on my broadband connection), so it would be nice to add a progress status message. Below is a nice utility function to do that. The print...

Aggregate using Python Spark (pyspark)

July 12, 2017

Finally I am getting hands on with data processing and here I am posting a simple aggregate task using Python Spark. The task is to calculate the aggregate spend by customer and display the data in sorted order. Aggregation is a simple reduce job on the key value pairs of customer ID and each individual spend. Spark provides sorting by key [ sortByKey() ] out of the box, but to sort by value, one needs to provide a lambda to the more generic sortBy() function.