Aggregate using Python Spark (pyspark)

July 12, 2017

Finally I am getting hands on with data processing and here I am posting a simple aggregate task using Python Spark. The task is to calculate the aggregate spend by customer and display the data in sorted order. Aggregation is a simple reduce job on the key value pairs of customer ID and each individual spend.

Spark provides sorting by key [sortByKey()] out of the box, but to sort by value, one needs to provide a lambda to the more generic sortBy() function.

Search This Blog

Data Drudgery

Aggregate using Python Spark (pyspark)

Comments

Post a Comment

Popular posts from this blog

Connect to MySQL 5.7 from Python using SSL

Connect to MySQL Server 5.7 from PHP 7.0 using SSL

Performance improvement of MySQL inserts in Python using batching