Experimenting with WoW data

Experimenting with WoW data - Part 2

July 20, 2017

In the last part we went through how to get the WoW auction data using the developer APIs. The auction data dump (auctions.json) is updated once every hour. As I noticed, that this dump is updated just before the hour in UTC. So scheduled job to get the updated auction dump every hour should work fine. In this section we will use Spark to do some basic analysis on the auction data. Simple items have a row like the one shown below. Items like legendary equipment, pets will have additional fields like bonusLists, petSpeciesId etc. Let's take a look at a row of auction data.

{"auc":1018440074,"item":41114,"owner":"Lougincath",
"ownerRealm":"Dalaran","bid":507084,"buyout":1000000,"quantity":1,
"timeLeft":"LONG","rand":0,"seed":0,"context":0}

Next we will put the auction json data into a dataframe. As the datadump has some additional metadata, I had to do some parsing to get just the auction data, which can be converted into a dataframe. I am also providing a schema to the createDataFrame method, so that the dataframe only has fields required for our initial analysis. Below is the python code for that.

	from pyspark.sql import SparkSession
	from pyspark.sql.types import *
	import json

	# Create a SparkSession (the config bit is only for Windows!)
	spark = SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("WoWAuctions").getOrCreate()
	sc = spark.sparkContext

	auctiondata = json.loads(open('auctions.json', encoding="utf8").read())
	auctions = auctiondata["auctions"]
	print("Number of auctions: ", len(auctions))

	schema = StructType([
	StructField("auc", IntegerType(), False),
	StructField("item", IntegerType(), False),
	StructField("owner", StringType(), False),
	StructField("ownerRealm", StringType(), False),
	StructField("bid", LongType(), False),
	StructField("buyout", LongType(), False),
	StructField("quantity", IntegerType(), False)])

	df = spark.createDataFrame(auctions, schema)
	df.show()

view raw wow-02.py hosted with ❤ by GitHub

The output will look similar to the one I got. My dataset had 140819 rows.

Number of auctions:  140819
+----------+------+------------+----------+--------+--------+--------+
|       auc|  item|       owner|ownerRealm|     bid|  buyout|quantity|
+----------+------+------------+----------+--------+--------+--------+
|1017396389|  2982|Walkwalkwalk|   Dalaran|  314999|  314999|       1|
|1017396390| 36146|Walkwalkwalk|   Dalaran|  481939|  481939|       1|
|1017396384|  2981|Walkwalkwalk|   Dalaran| 1342809| 1342809|       1|

Let's go over a few basic examples of analyzing the auction data. First, lets find out the number of distinct items on the auction house. That comes to around thirteen thousand. I will also find the top items based on quantity on auction house, as well top items based on total market value (sum of all bids for a particular item). I am using UDF to get the maximum of two columns, but inbuilt function for max is also available in pyspark.sql.functions. Note that directly using python max function will not work.

	from pyspark.sql.functions import udf, col

	# creation of dataframe in previous gist
	distinctAuctionItems = df.select("item").distinct().count()
	print("Number of distinct items on auction: ", distinctAuctionItems)

	getEffectiveBid = udf(lambda col1, col2 : max(col1, col2), LongType())
	df = df.withColumn('effectiveBid', getEffectiveBid(col('bid'), col('buyout')))
	df = df.select("item", "quantity", "effectiveBid")
	df = df.groupBy("item").sum("quantity", "effectiveBid")
	df = df.select("item", col("sum(quantity)").alias("cum_quantities"), col("sum(effectiveBid)").alias("cum_bids"))
	dfq = df.sort(col("cum_quantities").desc())
	dfq.show(3)
	dfb = df.sort(col("cum_bids").desc())
	dfb.show(3)

view raw wow-02a.py hosted with ❤ by GitHub

I got the below output.

Number of auctions:  140819
Number of distinct items on auction:  13364
+------+-------------+-----------------+
|  item|sum(quantity)|sum(effectiveBid)|
+------+-------------+-----------------+
|124113|       108507|      11678439349|
|123918|        59878|      18362965110|
|109119|        50460|       3197544964|
+------+-------------+-----------------+
only showing top 3 rows

+------+-------------+-----------------+
|  item|sum(quantity)|sum(effectiveBid)|
+------+-------------+-----------------+
| 82800|         3024|     524016606350|
|141641|            3|     124121362260|
| 55403|            4|      89003250767|
+------+-------------+-----------------+
only showing top 3 rows

In the next section I will try to decorate the analyzed data by fetching the decorations using the developer API. The item ID 124113 corresponds to 'Stonehide Leather', 123918 to 'Leystone Ore' and so on.

Search This Blog

Data Drudgery

Experimenting with WoW data - Part 2

Comments

Post a Comment

Popular posts from this blog

Connect to MySQL 5.7 from Python using SSL

Connect to MySQL Server 5.7 from PHP 7.0 using SSL

Performance improvement of MySQL inserts in Python using batching