Aggregate using Python Spark (pyspark)

Finally I am getting hands on with data processing and here I am posting a simple aggregate task using Python Spark. The task is to calculate the aggregate spend by customer and display the data in sorted order. Aggregation is a simple reduce job on the key value pairs of customer ID and each individual spend. 

Spark provides sorting by key [sortByKey()] out of the box, but to sort by value, one needs to provide a lambda to the more generic sortBy() function. 


Comments

Popular posts from this blog

Connect to MySQL 5.7 from Python using SSL

Performance improvement of MySQL inserts in Python using batching

Connect to MySQL Server 5.7 from PHP 7.0 using SSL