Aggregate using Python Spark (pyspark)
Finally I am getting hands on with data processing and here I am posting a simple aggregate task using Python Spark. The task is to calculate the aggregate spend by customer and display the data in sorted order. Aggregation is a simple reduce job on the key value pairs of customer ID and each individual spend.
Spark provides sorting by key [sortByKey()] out of the box, but to sort by value, one needs to provide a lambda to the more generic sortBy() function.
Spark provides sorting by key [sortByKey()] out of the box, but to sort by value, one needs to provide a lambda to the more generic sortBy() function.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from pyspark import SparkConf, SparkContext | |
import collections | |
conf = SparkConf().setMaster("local").setAppName("AggregateSpend") | |
sc = SparkContext(conf = conf) | |
def parseLine(line): | |
fields = line.split(',') | |
return (int(fields[0]), float(fields[2])) | |
lines = sc.textFile("file:///SparkCourse/customer-orders.csv") | |
parsedLines = lines.map(parseLine) | |
aggregateSpends = parsedLines.reduceByKey(lambda x, y: x + y) | |
aggregateSpendsSorted = aggregateSpends.sortBy(lambda x: x[1]) | |
results = aggregateSpendsSorted.collect() | |
# Another way of sorting by value | |
# results = dict(aggregateSpends.collect()) | |
# results = collections.OrderedDict(sorted(results.items(), key=lambda t: t[1])) | |
# results = results.items() | |
for key, value in results: | |
print("%i %f" % (key, value)) |
Comments
Post a Comment