Data Drudgery

Posts

Showing posts from July, 2017

Connect to MySQL 5.7 from Python using SSL

July 30, 2017

In the previous blog we saw how to connect to MySQL Server 5.7 from PHP using SSL. In this entry I will describe the steps I took to connect to MySQL Server 5.7 from Python code. I used mysql-connector as the Python library. The default install (version 2.2.3) failed, so I had to install the previous version (v. 2.1.6). sudo pip install mysql-connector==2.1.6 The code to connect using Python is similar to PHP with similar parameters which have to be set. I did run into a few issues with my query not getting parsed if I passed query parameters as parameters. So I had to prepare the whole query and send it as a single complete query with no parameters in the 'execute' function. Below is my piece of working code. As I went further and tried to put data fetched from WoW API, I faced further problems with text encoding. I also had problems with creating multi line queries. I could not find a lot of discussion material on Stackoverflow. I found out though that Python 2...

Connect to MySQL Server 5.7 from PHP 7.0 using SSL

July 27, 2017

It took me from late morning to evening to be able to connect to MySQL Server 5.7 from PHP 7.0 over SSL on an EC2 machine with Amazon Linux OS image. There were many steps and each had its own challenge. Below I am mentioning the steps and how I worked around the problems. Step 1: Upgrading PHP from 5.6 to 7.0. First one has to remove the existing version of PHP and then install a newer version. I tried installing version 7.1 first but I ran into dependency issues like libpng15 for which the easiest way to install is to build from source. To avoid falling into this dependency cycle, I tried installing version 7.0 and this one installed smoothly. # Remove PHP 5.6 [ec2-user@ ~]$ sudo yum remove php5* # Install PHP 7.0 [ec2-user@ ~]$ sudo yum install php70 php70-mysqlnd php70-imap php70-gd \ php70-pecl-memcache php70-pecl-apcu Step 2: Upgrading MySQL server from 5.6 to 5.7. The default repos did not have MySQL 5.7. I also could not install it by adding repos, as ev...

Running a Spark job on EC2 cluster

July 26, 2017

In a previous blog we saw how to install Spark on EC2. I am doing this so that I can save on the cost of EMR on top of EC2 which can be over two thousand USD per year for large instances. Even for smaller instances the savings can be up to 30%. In this blog entry we will see how to run a spark job on a cluster. You can run Spark jobs in local mode where the job run locally on a single machine. To run Spark jobs on a cluster, a cluster manager is required. Spark has its own simple cluster manager, and its called the Standalone cluster manager. Industry applications usually swap the Standalone cluster manager for either Apache Mesos or Hadoop YARN .For this example I have setup a small cluster with one t2.micro instance (1 vCPU, 1G), which will act as the master and two m3.medium instances (1 vCPU, 3.7G) which will be the workers. Before setting up the cluster make sure that the cluster security group has sufficient permissions and the master and slaves can communicate with each...

Experimenting with WoW data - Part 2

July 20, 2017

In the last part we went through how to get the WoW auction data using the developer APIs. The auction data dump (auctions.json) is updated once every hour. As I noticed, that this dump is updated just before the hour in UTC. So scheduled job to get the updated auction dump every hour should work fine. In this section we will use Spark to do some basic analysis on the auction data. Simple items have a row like the one shown below. Items like legendary equipment, pets will have additional fields like bonusLists, petSpeciesId etc. Let's take a look at a row of auction data. {"auc":1018440074,"item":41114,"owner":"Lougincath", "ownerRealm":"Dalaran","bid":507084,"buyout":1000000,"quantity":1, "timeLeft":"LONG","rand":0,"seed":0,"context":0} Next we will put the auction json data into a dataframe. As the datadump has some additional me...

Installing Spark on EC2

July 19, 2017

This is an account of setting up Spark on my small EC2 cluster of two m3.medium spot instances. Spot instances are good way of saving on cost of on demand prices, and you also get the option of retaining your instances till the spot prices are below your chosen maximum bid. There are many well written guides about setting up Spark on an EC2 cluster but I still got stuck at a few places. I will be describing those here, along with what was the reason for getting stuck. This will be helpful for those who face similar problems. I will not go into the details of each step, but delve into details of only the troubleshooting parts. Step 1: Create an IAM role for EC2 service role. This step is not required for setup of Spark. This is required only when accessing other AWS services. Step 2: Create security group with SSH access from your local work machine. This step is crucial, as without this we cannot SSH into the EC2 machine. Step 3: Launch EC2 instances with IAM ro...

Experimenting with WoW data - Part 1

July 17, 2017

I will now delve into real data and the dataset I have chosen is the auction data for World of Warcraft. Each realm has its own auction house, and I will start with the auction house of a single realm. There are more than two hundred realms in NA region alone. There are three more regions, Europe, Asia-Pacific and China. To fetch the data, one will need an API key, which can be easily obtained by registering on Blizzard Dev . Below is the simple code to get the data and print it. Please note that this API returns the metadata, the response contains the location for complete auction data, which has to be fetched next. This response also contains the last modified time which can be persisted so that it can be used to find when the auction data dump has been updated. Auction metadata The data dump takes quite a while to retrieve (few minutes on my broadband connection), so it would be nice to add a progress status message. Below is a nice utility function to do that. The print...

Aggregate using Python Spark (pyspark)

July 12, 2017

Finally I am getting hands on with data processing and here I am posting a simple aggregate task using Python Spark. The task is to calculate the aggregate spend by customer and display the data in sorted order. Aggregation is a simple reduce job on the key value pairs of customer ID and each individual spend. Spark provides sorting by key [ sortByKey() ] out of the box, but to sort by value, one needs to provide a lambda to the more generic sortBy() function.

Correlation for feature selection

July 11, 2017

One of the initial tasks in the creation of an ML model is to figure out what are the most important features in the feature set. This is useful in reducing the number of features in data-sets having thousands of features. In case of data-sets with less than hundred features, feature selection is helpful in reducing the final model size which is helpful if the model is to be used in real-time scenario in memory. There are many approaches to feature selection and some of them pretty involved like using information gain. A simpler approach is to use correlation where good feature subsets contain features highly correlated to the classification, yet uncorrelated to each other. One of the statistic used to measure correlation is the Pearson correlation coefficient . It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. Pearson's correlation coefficient is the cova...