Data Drudgery

Posts

An example application of algorithms and data structures in the software industry

June 11, 2019

A very common question among people new or about to enter the software industry is where and how are algorithms and data structures used. Software development is mostly about applying known algorithms to a particular problem domain and tweaking it with some data structure either to fit the input space or to make some particular performance improvement due to some specificity of the problem domain. Below is one such example from the domain of workplace analytics. Let’s take the example of workplace analytics, a new domain where organizations are using data to find out what perks work, what attributes contribute more to employee satisfaction and performance. One example is Google finding out, based on data, that new mothers had a very high attrition rate, and Google increased the maternity leave by two months and it reduced the attrition in new mothers by 50%. Similarly, some sales focused organizations have found using workplace analytics tools, that employees not the ones wor...

Performance improvement of MySQL inserts in Python using batching

August 01, 2017

In this blog entry we will see how much performance improvement we can get by using batching for inserts in MySQL 5.7 from a Python3 client. In this example the number of rows to be inserted is around 140K. The json blob for this data has a size of around ~25MB. First we will insert them one by one. Note even in this case, I am inserting all of them and finally doing a single commit. The default behavior of the mysql-connector library is that autocommit is false. This is good for performance. If we commit after each insert, the performance will be even worse. Below are the code and performance details. Number of auctions: 139770 Number of rows affected : 139770 Time taken for insert: 34.17 s For my use case, I was okay with around 30 seconds of insert time for the bulk update, but thought of trying out the batch insert. I ran into a problem because of the default max packet size for MySQL execute statements. I had to change it in my.cnf file by adding max packet siz...

Connect to MySQL 5.7 from Python using SSL

July 30, 2017

In the previous blog we saw how to connect to MySQL Server 5.7 from PHP using SSL. In this entry I will describe the steps I took to connect to MySQL Server 5.7 from Python code. I used mysql-connector as the Python library. The default install (version 2.2.3) failed, so I had to install the previous version (v. 2.1.6). sudo pip install mysql-connector==2.1.6 The code to connect using Python is similar to PHP with similar parameters which have to be set. I did run into a few issues with my query not getting parsed if I passed query parameters as parameters. So I had to prepare the whole query and send it as a single complete query with no parameters in the 'execute' function. Below is my piece of working code. As I went further and tried to put data fetched from WoW API, I faced further problems with text encoding. I also had problems with creating multi line queries. I could not find a lot of discussion material on Stackoverflow. I found out though that Python 2...

Connect to MySQL Server 5.7 from PHP 7.0 using SSL

July 27, 2017

It took me from late morning to evening to be able to connect to MySQL Server 5.7 from PHP 7.0 over SSL on an EC2 machine with Amazon Linux OS image. There were many steps and each had its own challenge. Below I am mentioning the steps and how I worked around the problems. Step 1: Upgrading PHP from 5.6 to 7.0. First one has to remove the existing version of PHP and then install a newer version. I tried installing version 7.1 first but I ran into dependency issues like libpng15 for which the easiest way to install is to build from source. To avoid falling into this dependency cycle, I tried installing version 7.0 and this one installed smoothly. # Remove PHP 5.6 [ec2-user@ ~]$ sudo yum remove php5* # Install PHP 7.0 [ec2-user@ ~]$ sudo yum install php70 php70-mysqlnd php70-imap php70-gd \ php70-pecl-memcache php70-pecl-apcu Step 2: Upgrading MySQL server from 5.6 to 5.7. The default repos did not have MySQL 5.7. I also could not install it by adding repos, as ev...

Running a Spark job on EC2 cluster

July 26, 2017

In a previous blog we saw how to install Spark on EC2. I am doing this so that I can save on the cost of EMR on top of EC2 which can be over two thousand USD per year for large instances. Even for smaller instances the savings can be up to 30%. In this blog entry we will see how to run a spark job on a cluster. You can run Spark jobs in local mode where the job run locally on a single machine. To run Spark jobs on a cluster, a cluster manager is required. Spark has its own simple cluster manager, and its called the Standalone cluster manager. Industry applications usually swap the Standalone cluster manager for either Apache Mesos or Hadoop YARN .For this example I have setup a small cluster with one t2.micro instance (1 vCPU, 1G), which will act as the master and two m3.medium instances (1 vCPU, 3.7G) which will be the workers. Before setting up the cluster make sure that the cluster security group has sufficient permissions and the master and slaves can communicate with each...

Experimenting with WoW data - Part 2

July 20, 2017

In the last part we went through how to get the WoW auction data using the developer APIs. The auction data dump (auctions.json) is updated once every hour. As I noticed, that this dump is updated just before the hour in UTC. So scheduled job to get the updated auction dump every hour should work fine. In this section we will use Spark to do some basic analysis on the auction data. Simple items have a row like the one shown below. Items like legendary equipment, pets will have additional fields like bonusLists, petSpeciesId etc. Let's take a look at a row of auction data. {"auc":1018440074,"item":41114,"owner":"Lougincath", "ownerRealm":"Dalaran","bid":507084,"buyout":1000000,"quantity":1, "timeLeft":"LONG","rand":0,"seed":0,"context":0} Next we will put the auction json data into a dataframe. As the datadump has some additional me...

Search This Blog