Hide menu

732A54 and TDDE31 Big Data Analytics

BDA1 - Spark - Assignments


Note: please make sure you have gone through the BDA Lab Introduction.

PySpark 3.5.1 API reference

For the BDA Spark, refer to the PySpark 3.5.1 API documentation listed below:

BDA1 - Spark - Assignments

In this set of exercises you will work exclusively with Spark. This means that in your programs, you only need to create the SparkContext. In a number of exercises you will be asked to calculate temperature averages (daily and monthly). These are not always computed according to the standard definition of "average".

Please notice that in this domain, the daily average temperature is calculated by averaging the daily measured maximum and the daily measured minimum temperatures. The monthly average is calculated by averaging the daily maximums and minimums for that month. For example, to get the monthly average for October, take the maximums and minimums for each day, sum them up and divide by 62 (which is the same as taking the daily averages, summing them up and dividing by the number of days).1

Important: You may use either the local or yarn option to submit your jobs. However, to pass BDA1 and BDA2, you must demonstrate that you have used both options and submit the updated run_yarn*.q file (for instance, pick one question and show you use a different option to submit the job).

Assignment 1

What are the lowest and highest temperatures measured each year for the period 1950-2014? Provide the lists sorted in descending order with respect to the maximum temperature. In this exercise you will use the temperature-readings.csv file.

The output should at least contain the following information (you can also include a Station column so that you may find multiple stations that record the highest or lowest temperature):
Year, temperature

Note: Filtering before the reduce step will save time and resources when running your program.

Assignment 2

Count the number of readings for each month in the period 1950-2014 which are higher than 10 degrees. Repeat the exercise, this time taking only distinct readings from each station. That is, if a station reported a reading above 10 degrees in some month, then it appears only once in the count for that month.

In this exercise you will use the temperature-readings.csv file.

The output should contain the following information:
Year, month, count

Assignment 3

Find the average monthly temperature for each available station in Sweden. Your result should include the average temperature for each station for each month in the period 1960-2014. Bear in mind that not every station has readings for each month in this timeframe.

In this exercise you will use the temperature-readings.csv file.

The output should contain the following information:
Year, month, station number, average monthly temperature

Assignment 4

Provide a list of stations with their associated maximum measured temperatures and maximum measured daily precipitation. Show only those stations where the maximum temperature is between 25 and 30 degrees and the maximum daily precipitation is between 100 mm and 200 mm.

In this exercise you will use the temperature-readings.csv and precipitation-readings.csv files.

The output should contain the following information:
Station number, maximum measured temperature, maximum daily precipitation
How many records are found in the result? Please justify your answer.

Assignment 5

Calculate the average monthly precipitation for the Östergötland region (the list of stations is provided in a separate file) for the period 1993-2016. To do this, you will first need to calculate the total monthly precipitation for each station before calculating the monthly average (by averaging over stations).

In this exercise you will use the precipitation-readings.csv and stations-Ostergotland.csv files.

Hint (not for the SparkSQL lab): Avoid using joins here. stations-Ostergotland.csv is small and if distributed it will cause a number of unnecessary shuffles when joined with the precipitation RDD. If you distribute precipitation-readings.csv, either repartition your stations RDD to 1 partition or use the collect function to get a Python list and the broadcast function to broadcast the list to all nodes.

The output should contain the following information:
Year, month, average monthly precipitation

Questions

  • How do you modify a *.q script to submit a job?
  • For which assignment do you use "local" option or "yarn" option? (you need to pick one assignmnet and show you use a different option to submit the job)
  • Given an RDD in the format ((timestamp, city), (precipitation, temperature)), what is the key? How can you reduce the data (i.e., remove timestamp from the key) to keep only the maximum temperature per city while retaining the associated precipitation value? For the second question, you should describe how to transform and operate on the data (e.g., how to write the necessary lambda functions).



1 Note: In many countries, averages are calculated as described above. However, in Sweden, daily and monthly averages are calculated using Ekholm-Modén's formula, which in addition to the minimum and maximum daily temperature also takes into account readings at specific time points, the month, and the longitude of the station. For more information (in Swedish): https://www.smhi.se/kunskapsbanken/klimat/normaler/hur-beraknas-medeltemperatur


Page responsible: Huanyu Li
Last updated: 2026-04-08