Posts

An example application of design patterns in the software industry

A common type of mechanism used to solve problems in the software industry is the workflow mechanism. at least in a very broad manner of description. A service has to do a particular job, but to do that particular job it has to query a couple of other services, and then process all the information, persist it and then update a downstream service with the processed information. I'll give an example from the advertising domain. Any advertising platform must find out a way to filter out bad ads, could be because of vulgar test, photo not following resolution guidelines etc. The seller interface registers the new ads in the main database. The ad quality service periodically gets the new ads from the registration service which are marked as under processing. The ad quality service has to make the ad go through multiple local checks with local persistence, and then finally it has to update the ad server with the final verdict on whether the ad follows all guidelines and is good to...

An example application of algorithms and data structures in the software industry

A very common question among people new or about to enter the software industry is where and how are algorithms and data structures used. Software development is mostly about applying known algorithms to a particular problem domain and tweaking it with some data structure either to fit the input space or to make some particular performance improvement due to some specificity of the problem domain. Below is one such example from the domain of workplace analytics. Let’s take the example of workplace analytics, a new domain where organizations are using data to find out what perks work, what attributes contribute more to employee satisfaction and performance. One example is Google finding out, based on data, that new mothers had a very high attrition rate, and Google increased the maternity leave by two months and it reduced the attrition in new mothers by 50%. Similarly, some sales focused organizations have found using workplace analytics tools, that employees not the ones wor...

Performance improvement of MySQL inserts in Python using batching

In this blog entry we will see how much performance improvement we can get by using batching for inserts in MySQL 5.7 from a Python3 client. In this example the number of rows to be inserted is around 140K. The json blob for this data has a size of around ~25MB. First we will insert them one by one. Note even in this case, I am inserting all of them and finally doing a single commit. The default behavior of the mysql-connector library is that autocommit is false. This is good for performance. If we commit after each insert, the performance will be even worse. Below are the code and performance details. Number of auctions: 139770 Number of rows affected : 139770 Time taken for insert: 34.17 s For my use case, I was okay with around 30 seconds of insert time for the bulk update, but thought of trying out the batch insert. I ran into a problem because of the default max packet size for MySQL execute statements. I had to change it in my.cnf file by adding max packet siz...

Connect to MySQL 5.7 from Python using SSL

In the previous blog we saw how to connect to MySQL Server 5.7 from PHP using SSL. In this entry I will describe the steps I took to connect to MySQL Server 5.7 from Python code. I used mysql-connector as the Python library. The default install (version 2.2.3) failed, so I had to install the previous version (v. 2.1.6). sudo pip install mysql-connector==2.1.6 The code to connect using Python is similar to PHP with similar parameters which have to be set. I did run into a few issues with my query not getting parsed if I passed query parameters as parameters. So I had to prepare the whole query and send it as a single complete query with no parameters in the 'execute' function. Below is my piece of working code. As I went further and tried to put data fetched from WoW API, I faced further problems with text encoding. I also had problems with creating multi line queries. I could not find a lot of discussion material on Stackoverflow. I found out though that Python 2...

Connect to MySQL Server 5.7 from PHP 7.0 using SSL

It took me from late morning to evening to be able to connect to MySQL Server 5.7 from PHP 7.0 over SSL on an EC2 machine with Amazon Linux OS image. There were many steps and each had its own challenge. Below I am mentioning the steps and how I worked around the problems. Step 1: Upgrading PHP from 5.6 to 7.0. First one has to remove the existing version of PHP and then install a newer version. I tried installing version 7.1 first but I ran into dependency issues like libpng15 for which the easiest way to install is to build from source. To avoid falling into this dependency cycle, I tried installing version 7.0 and this one installed smoothly. # Remove PHP 5.6 [ec2-user@ ~]$ sudo yum remove php5* # Install PHP 7.0 [ec2-user@ ~]$ sudo yum install php70 php70-mysqlnd php70-imap php70-gd \ php70-pecl-memcache php70-pecl-apcu Step 2: Upgrading MySQL server from 5.6 to 5.7. The default repos did not have MySQL 5.7. I also could not install it by adding repos, as ev...

Running a Spark job on EC2 cluster

Image
In a previous blog we saw how to install Spark on EC2. I am doing this so that I can save on the cost of EMR on top of EC2 which can be over two thousand USD per year for large instances. Even for smaller instances the savings can be up to 30%. In this blog entry we will see how to run a spark job on a cluster. You can run Spark jobs in local mode where the job run locally on a single machine. To run Spark jobs on a cluster, a cluster manager is required. Spark has its own simple cluster manager, and its called the Standalone cluster manager. Industry applications usually swap the Standalone cluster manager for either Apache Mesos or Hadoop YARN .For this example I have setup a small cluster with one t2.micro instance (1 vCPU, 1G), which will act as the master and two m3.medium instances (1 vCPU, 3.7G) which will be the workers. Before setting up the cluster make sure that the cluster security group has sufficient permissions and the master and slaves can communicate with each...

Experimenting with WoW data - Part 2

In the last part we went through how to get the WoW auction data using the developer APIs. The auction data dump (auctions.json) is updated once every hour. As I noticed, that this dump is updated just before the hour in UTC. So scheduled job to get the updated auction dump every hour should work fine. In this section we will use Spark to do some basic analysis on the auction data. Simple items have a row like the one shown below. Items like legendary equipment, pets will have additional fields like bonusLists, petSpeciesId etc. Let's take a look at a row of auction data. {"auc":1018440074,"item":41114,"owner":"Lougincath", "ownerRealm":"Dalaran","bid":507084,"buyout":1000000,"quantity":1, "timeLeft":"LONG","rand":0,"seed":0,"context":0} Next we will put the auction json data into a dataframe. As the datadump has some additional me...