Hide menu

732A54 and TDDE31 Big Data Analytics

BDA2 - Spark SQL - Assigments


Note: please make sure you have gone through the BDA Lab Introduction.

PySpark 3.5.1 API reference

For the BDA - Spark SQL, refer to the PySpark 3.5.1 API documentation listed below:

BDA2 - Spark SQL - Assigments

Redo the assigments from BDA1 using Spark SQL wherever possible. The initial processing of CSV files (such as splitting on ;) can be done using Spark's map.

There are two ways to write queries in Spark SQL , using built-in API functions or running SQL-like queries. To pass this lab, you need to use built-in API functions for all 5 assigments.

For each assignment, include the following data in your report, sorted as shown below.

Important: You may use either the local or yarn option to submit your jobs. However, to pass BDA1 and BDA2, you must demonstrate that you have used both options and submit the updated run_yarn*.q file (for instance, pick one question and show you use a different option to submit the job).

Assigment 1

year, station with the max, maxValue ORDER BY maxValue DESC
year, station with the min, minValue ORDER BY minValue DESC

Assigment 2

year, month, value ORDER BY value DESC
year, month, value ORDER BY value DESC

Assigment 3

year, month, station, avgMonthlyTemperature ORDER BY avgMonthlyTemperature DESC

Assigment 4

station, maxTemp, maxDailyPrecipitation ORDER BY station DESC

Assigment 5

year, month, avgMonthlyPrecipitation ORDER BY year DESC, month DESC


Page responsible: Huanyu Li
Last updated: 2026-04-08