Experimenting with WoW data - Part 2

In the last part we went through how to get the WoW auction data using the developer APIs. The auction data dump (auctions.json) is updated once every hour. As I noticed, that this dump is updated just before the hour in UTC. So scheduled job to get the updated auction dump every hour should work fine. In this section we will use Spark to do some basic analysis on the auction data. Simple items have a row like the one shown below. Items like legendary equipment, pets will have additional fields like bonusLists, petSpeciesId etc. Let's take a look at a row of auction data.
{"auc":1018440074,"item":41114,"owner":"Lougincath",
"ownerRealm":"Dalaran","bid":507084,"buyout":1000000,"quantity":1,
"timeLeft":"LONG","rand":0,"seed":0,"context":0}
Next we will put the auction json data into a dataframe. As the datadump has some additional metadata, I had to do some parsing to get just the auction data, which can be converted into a dataframe. I am also providing a schema to the createDataFrame method, so that the dataframe only has fields required for our initial analysis. Below is the python code for that. The output will look similar to the one I got. My dataset had 140819 rows.
Number of auctions:  140819
+----------+------+------------+----------+--------+--------+--------+
|       auc|  item|       owner|ownerRealm|     bid|  buyout|quantity|
+----------+------+------------+----------+--------+--------+--------+
|1017396389|  2982|Walkwalkwalk|   Dalaran|  314999|  314999|       1|
|1017396390| 36146|Walkwalkwalk|   Dalaran|  481939|  481939|       1|
|1017396384|  2981|Walkwalkwalk|   Dalaran| 1342809| 1342809|       1|
Let's go over a few basic examples of analyzing the auction data. First, lets find out the number of distinct items on the auction house. That comes to around thirteen thousand. I will also find the top items based on quantity on auction house, as well top items based on total market value (sum of all bids for a particular item). I am using UDF to get the maximum of two columns, but inbuilt function for max is also available in pyspark.sql.functions. Note that directly using python max function will not work. I got the below output.
Number of auctions:  140819
Number of distinct items on auction:  13364
+------+-------------+-----------------+
|  item|sum(quantity)|sum(effectiveBid)|
+------+-------------+-----------------+
|124113|       108507|      11678439349|
|123918|        59878|      18362965110|
|109119|        50460|       3197544964|
+------+-------------+-----------------+
only showing top 3 rows

+------+-------------+-----------------+
|  item|sum(quantity)|sum(effectiveBid)|
+------+-------------+-----------------+
| 82800|         3024|     524016606350|
|141641|            3|     124121362260|
| 55403|            4|      89003250767|
+------+-------------+-----------------+
only showing top 3 rows
In the next section I will try to decorate the analyzed data by fetching the decorations using the developer API. The item ID 124113 corresponds to 'Stonehide Leather', 123918 to 'Leystone Ore' and so on.

Comments

Popular posts from this blog

Connect to MySQL 5.7 from Python using SSL

Performance improvement of MySQL inserts in Python using batching

Connect to MySQL Server 5.7 from PHP 7.0 using SSL