Hide menu

732A54 and TDDE31 Big Data Analytics

BDA Lab Introduction


BDA Lab Introduction

In BDA labs, you will work on the Sigma environment, which is a HPC cluster from the National Supercomputer Center (NSC). There are three lab topics including BDA1-Spark, BDA2-Spark SQL, and BDA3-Machine Learning with Spark. For each topic, there are some questions that you need to solve over the data from the Swedish Meteorological and Hydrological Institute (SMHI).

Sigma connection

There are two ways to connect to Sigma: (1) Thinlinc connection, or (2) VS Code with Remote-SSH.

Option 1: Thinlinc connection
  1. If you have already logged in to a computer in one of the SU rooms, you can run the following command to start the Thinlinc application.
    $ module load courses/732A54
    or,
    $ module load courses/TDDE31
    then run:
    $ tlclient
    If you are using your own computer, you can download Thinlinc directly.
  2. A graphical user interface will appear. Enter the server address (sigma.nsc.liu.se), your account name, and password.
    You will be asked for a verification code from your authenticator app.
  3. If your login is successful, you will see the graphical user interface of Sigma.
    Open a terminal window to prepare a working folder, copy the demo code, and check the files as shown in the steps below.
  4. Check what exists in the default folder and create a new folder to use as your working folder for the labs.
    [x_antli@sigma ~]$ ls
    [x_antli@sigma ~]$ mkdir BDA-lab
  5. We provide some demo code. Copy it to your newly created working folder and run the demo code to see how to submit jobs to the cluster.
    Run the following commands one by one. Note: the content after $ is the command to run.
    [x_antli@sigma ~]$ cp -r /software/sse2/tetralith_el9/manual/spark/course-examples/BDA_demo/ ./BDA-lab/
    [x_antli@sigma ~]$ cd BDA-lab
    [x_antli@sigma BDA-lab]$ cd BDA_demo
    [x_antli@sigma BDA_demo]$ ls -l
  6. You will see the folder contents, including the Python scripts (BDA1_demo.py, BDA2_demo.py), a folder containing the lab data (input_data), and several job submission scripts with the extension *.q.
    total 4
    -rw-rw-r-- 1 x_antli x_antli 820 Mar 31 20:03 BDA1_demo.py
    -rw-rw-r-- 1 x_antli x_antli 820 Mar 31 20:03 BDA2_demo.py
    drwxrwxr-x 2 x_antli x_antli 4096 Mar 31 20:03 input_data
    -rwxrwxr-x 1 x_antli x_antli 1352 Mar 31 20:03 run_local.q
    -rwxrwxr-x 1 x_antli x_antli 2578 Mar 31 20:03 run_local_with_historyserver.q
    -rwxrwxr-x 1 x_antli x_antli 1460 Mar 31 20:03 run_yarn.q
    -rwxrwxr-x 1 x_antli x_antli 2565 Mar 31 20:03 run_yarn_with_historyserver.q
  7. Check what is inside the input data folder.
    [x_antli@sigma BDA_demo]$ cd input_data
    [x_antli@sigma input_data]$ ls -l
    total 2742401
    -rw-rw-r-- 1 x_antli x_antli 661060231 Mar 31 20:03 precipitation-readings.csv
    -rw-rw-r-- 1 x_antli x_antli 2858 Mar 31 20:03 stations-Ostergotland.csv
    -rw-rw-r-- 1 x_antli x_antli 67699 Mar 31 20:03 stations.csv
    -rw-rw-r-- 1 x_antli x_antli 28606 Mar 31 20:03 temperature-readings-small.csv
    -rw-rw-r-- 1 x_antli x_antli 2146781235 Mar 31 20:03 temperature-readings.csv
    You can also use the tail command to preview a CSV file's contents. A more detailed description of the data is given in Data Description.
    [x_antli@sigma input_data]$ tail precipitation-readings.csv
    99280;2016-06-30;21:00:00;0.0;G
    99280;2016-06-30;22:00:00;0.0;G
    99280;2016-06-30;23:00:00;0.0;G
Option 2: Remote-SSH connection via VS Code

You can find the instructions for connecting to the NSC server via VS Code here. The key steps are:

  • Install the Remote - SSH extension in VS Code.
  • Add a new SSH host: sigma.nsc.liu.se using your Sigma account credentials.
  • Once connected, copy the demo code folder to your own folder at Sigma and edit your scripts via VS Code.
A screen recording demonstrating how to connect to Sigma using VS Code and Remote-SSH

Run (demo) code

In the BDA labs, you need to submit a PySpark job to the cluster so that your PySpark code runs in the distributed environment using Hadoop, HDFS, and Spark.
This is done using the spark-submit command, which takes several input arguments. Think of spark-submit like python3 , it is the command you use to run your PySpark script from a terminal.
An example of using spark-submit is as follows:
spark-submit --deploy-mode cluster --master yarn --num-executors 9 --driver-memory 2g --executor-memory 2g --executor-cores 4 CODE.py
This example specifies that the job will run on the cluster, using yarn as the cluster manager, along with resource allocation options including the number of executors, driver memory, executor memory, and executor cores.
Based on the settings of NSC and Sigma, we use a non-interactive way to submit jobs on Sigma. This means that instead of running commands live, you write all instructions into a script file and submit it to the queue. Specifically, we use the sbatch command to submit a job by specifying the project, resource reservation, and one of the .q scripts (which already contains the spark-submit command).

  1. Submit a job using sbatch:
    [x_antli@sigma BDA_demo]$ sbatch -A liu-compute-2026-6 --reservation devel run_yarn_with_historyserver.q
    Running this command will submit a job and return a job ID.
    Submitted batch job 41111139
  2. Use the squeue command to check the status of your submitted jobs:
    [x_antli@sigma BDA_demo]$ squeue -u x_antli
    JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    4111139 sigma run_yarn x_antli CG 00:30 2 n[1235-1236]
  3. Once the job is finished, check the folder. You will see a log file named slurm-4111139.out, which contains the final output if the job succeeds, or an error message if the job fails (for example, if your code contains syntax errors).
    Two additional folders are generated: spark and output. The spark folder contains detailed logs. The output folder is a copy of the HDFS output, as specified in run_yarn_with_historyserver.q.
    This output folder usually contains results split across multiple partition files, if the PySpark code saves results as text files.
    [x_antli@sigma BDA_demo]$ ls
    total 4
    -rw-rw-r-- 1 x_antli x_antli 820 Mar 31 20:03 BDA1_demo.py
    -rw-rw-r-- 1 x_antli x_antli 820 Mar 31 20:03 BDA2_demo.py
    drwxrwxr-x 2 x_antli x_antli 4096 Mar 31 20:03 input_data
    -rwxrwxr-x 1 x_antli x_antli 1352 Mar 31 20:03 run_local.q
    -rwxrwxr-x 1 x_antli x_antli 2578 Mar 31 20:03 run_local_with_historyserver.q
    -rwxrwxr-x 1 x_antli x_antli 1460 Mar 31 20:03 run_yarn.q
    -rwxrwxr-x 1 x_antli x_antli 2565 Mar 31 20:03 run_yarn_with_historyserver.q
    drwxrwxr-x 2 x_antli x_antli 4096 Mar 31 20:03 output
    drwxrwxr-x 2 x_antli x_antli 4096 Mar 31 20:03 spark
    -rw-rw-r-- 1 x_antli x_antli 820 Mar 31 20:03 slurm-4111139.out
    [x_antli@sigma BDA_demo]$ cd output
    [x_antli@sigma output]$ ls
    _SUCCESS part-00000 part-00001
  4. The --reservation devel option specifies the default reservation (devel) for your job. During lab sessions, dedicated resources are reserved for you. Replace devel with the specific reservation name for a lab session (e.g., liu-bda-2026-04-14).
    Run the listreservations command to see the available reservation names. Reservation names appear in the first column (e.g., liu-bda-2026-04-14).
    [x_antli@sigma BDA_demo]$ listreservations
    Reservations available to user:x_antli / project(s):liu-compute-2026-6
    devel from 2026-04-03T08:00:00 to 2026-04-03T22:00:00 ALL USERS
    liu-bda-2026-04-14 from 2026-04-14T15:15:00 to 2026-04-14T17:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-04-16 from 2026-04-16T08:15:00 to 2026-04-16T10:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-04-17 from 2026-04-17T15:15:00 to 2026-04-17T17:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-04-21 from 2026-04-21T13:15:00 to 2026-04-21T17:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-04-23 from 2026-04-23T08:15:00 to 2026-04-23T10:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-04-24 from 2026-04-24T15:15:00 to 2026-04-24T17:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-04-28 from 2026-04-28T13:15:00 to 2026-04-28T17:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-04-30 from 2026-04-30T08:15:00 to 2026-04-30T10:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-05-04 from 2026-05-04T10:15:00 to 2026-05-04T12:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-05-05 from 2026-05-05T13:15:00 to 2026-05-05T17:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-05-07 from 2026-05-07T08:15:00 to 2026-05-07T10:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-05-12 from 2026-05-12T13:15:00 to 2026-05-12T17:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-05-18 from 2026-05-18T10:15:00 to 2026-05-18T12:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-05-19 from 2026-05-19T13:15:00 to 2026-05-19T17:15:00 (project:liu-compute-2026-6)
    liu-bda-2026-05-22 from 2026-05-22T15:15:00 to 2026-05-22T17:15:00 (project:liu-compute-2026-6)
  5. Important: Please avoid submitting multiple jobs at the same time. A job usually takes one to several minutes to run. You can cancel a job using the scancel command by specifying the job ID as shown below (replace JOBID with the actual ID shown by squeue -u USERNAME):
    [x_antli@sigma BDA_demo]$ scancel JOBID

Data Description

The data includes air temperature and precipitation readings from 812 stations in Sweden.
The stations include both currently active stations and historical stations that have been closed down.
The latest readings available for active stations are from October 10, 2016.
The air temperature and precipitation records are hourly readings (this is important to keep in mind, since you will need to perform aggregations for some exercises).
However, some stations provide only one reading every three hours.
The provided files (data) are CSV files with headers removed.
Values are separated by semicolons. Some files are too large to open in a standard text editor. Use bash commands such as tail and more to preview a file's contents. The provided files are:

  • temperature-readings.csv , approx. 2 GB
  • temperature-readings-small.csv , use this file to test your code locally. For final submissions, run your code using the full temperature readings file.
  • precipitation-readings.csv , approx. 660 MB
  • stations.csv
  • stations-Ostergotland.csv
The column headers for temperature-readings-small.csv and temperature-readings.csv are:
Station number Date Time Air temperature (in °C) Quality
The column headers for precipitation-readings.csv are:
Station number Date Time Precipitation (unit in mm) Quality

Quality values: G = controlled and confirmed values; Y = suspected or aggregated values.

The column headers for stations.csv and stations-Ostergotland.csv are:
Station number Station name Measurement height Latitude Longitude Readings from (date and time) Readings to (date and time) Elevation

Scripts for submitting jobs

In the demo code folder, there are four scripts for submitting jobs. Here is an explanation of their differences:

  • The history server option saves all event logs to the spark folder, allowing you to review them after a job finishes. For BDA labs, you can choose whether or not to use the history server option.
  • run_local*.q and run_yarn*.q differ in how the job is deployed. The local option runs the job in client mode, without HDFS, using the Sigma network file system instead. The yarn option runs the job in cluster mode, using HDFS.
Note: You need to use the correct input and output paths depending on which option you choose.
  • For the local option, use a path on the Sigma file system, for example:
    temperature_file = sc.textFile("file:///home/x_antli/Desktop/2026/BDA_demo/input_data/temperature-readings-small.csv")
    where /home/x_antli/ is your default home folder on Sigma.
  • For the yarn option, use a path on HDFS, for example:
    temperature_file = sc.textFile("hdfs:///user/x_antli/BDA/input/temperature-readings-small.csv")
    where /user/x_antli/ is the default HDFS home folder. Subfolders such as /BDA/input/ are created using HDFS commands included in the run_yarn*.q scripts. The script contains more HDFS command which are based on the following:
    • hadoop fs -mkdir <FOLDER_NAME> , make a folder on HDFS
    • hadoop fs -mkdir -p <FOLDER_NAME> <FOLDER_NAME> , make multiple folders
    • hadoop fs -test -d <FOLDER_NAME> , if the path is a directory, return 0
    • hadoop fs -rm -r <FOLDER_NAME> , delete the directory and any content under it recursively
    • hadoop fs -cat <FOLDER_ON_HDFS> [local] , copy HDFS path to stdout
    • hadoop fs -copyFromLocal <localsrc> ... <dst> , copy one or more files from local Sigma to HDFS
    • hadoop fs -copyToLocal <dst> ... <localsrc> , copy one or more files from HDFS to local Sigma
Important: You may use either the local or yarn option to submit your jobs. However, to pass BDA1 and BDA2, you must demonstrate that you have used both options and submit the updated run_yarn*.q file (for instance, pick one question and show you use a different option to submit the job).

Write your own code

Now you are ready to write your own code to solve the exercises in BDA1.
If you have used the Thinlinc option to connect to sigma, now you can use VS Code to write your code by running the following commands in a terminal:
[x_antli@sigma BDA_demo]$ module load VSCode/latest-bdist
[x_antli@sigma BDA_demo]$ code &
For the time being, if the above commands do not work, use the following command as an alternative to start VS Code.
[x_antli@sigma BDA_demo]$ /proj/liu-compute-2026-6/shared/VSCode-linux-x64/code --no-sandbox &
Note: When checking the results of your PySpark job, open the SLURM log file generated after your job completes. To locate the final output quickly, search for the keyword "FINAL OUTPUT". This keyword appears because it is explicitly echoed in the .q submission script. Additionally, searching for the name of your script (e.g., BDA1_demo.py) can help you identify any errors that were reported during execution, such as syntax errors or runtime exceptions, which are typically printed alongside the script name in the log output.

Some common errors (to be added):

  • HDFS file paths stated in the Python code are not prepared correctly inside the run_yarn*.q files.
  • Calling functions (e.g., RDD operations) on the wrong RDD variables.
  • Lambda function's input arguments are not matched with the RDD's value pairs.


Page responsible: Huanyu Li
Last updated: 2026-04-14