732A54 and TDDE31 Big Data Analytics
BDA Lab Introduction
BDA Lab Introduction
In BDA labs, you will work on the Sigma environment, which is a HPC cluster from the National Supercomputer Center (NSC). There are three lab topics including BDA1-Spark, BDA2-Spark SQL, and BDA3-Machine Learning with Spark. For each topic, there are some questions that you need to solve over the data from the Swedish Meteorological and Hydrological Institute (SMHI).
Sigma connection
There are two ways to connect to Sigma: (1) Thinlinc connection, or (2) VS Code with Remote-SSH.
Option 1: Thinlinc connection-
If you have already logged in to a computer in one of the SU rooms, you can run the following command to start the Thinlinc application.
$ module load courses/732A54or,
$ module load courses/TDDE31then run:
$ tlclientIf you are using your own computer, you can download Thinlinc directly.
-
A graphical user interface will appear. Enter the server address (
sigma.nsc.liu.se), your account name, and password.
You will be asked for a verification code from your authenticator app. -
If your login is successful, you will see the graphical user interface of Sigma.
Open a terminal window to prepare a working folder, copy the demo code, and check the files as shown in the steps below. -
Check what exists in the default folder and create a new folder to use as your working folder for the labs.
[x_antli@sigma ~]$ ls
[x_antli@sigma ~]$ mkdir BDA-lab
-
We provide some demo code. Copy it to your newly created working folder and run the demo code to see how to submit jobs to the cluster.
Run the following commands one by one. Note: the content after$is the command to run.
[x_antli@sigma ~]$ cp -r /software/sse2/tetralith_el9/manual/spark/course-examples/BDA_demo/ ./BDA-lab/
[x_antli@sigma ~]$ cd BDA-lab
[x_antli@sigma BDA-lab]$ cd BDA_demo
[x_antli@sigma BDA_demo]$ ls -l
You will see the folder contents, including the Python scripts ( -
Check what is inside the input data folder.
[x_antli@sigma BDA_demo]$ cd input_data
[x_antli@sigma input_data]$ ls -l
total 2742401You can also use the
-rw-rw-r-- 1 x_antli x_antli 661060231 Mar 31 20:03 precipitation-readings.csv
-rw-rw-r-- 1 x_antli x_antli 2858 Mar 31 20:03 stations-Ostergotland.csv
-rw-rw-r-- 1 x_antli x_antli 67699 Mar 31 20:03 stations.csv
-rw-rw-r-- 1 x_antli x_antli 28606 Mar 31 20:03 temperature-readings-small.csv
-rw-rw-r-- 1 x_antli x_antli 2146781235 Mar 31 20:03 temperature-readings.csv
tailcommand to preview a CSV file's contents. A more detailed description of the data is given in Data Description.
[x_antli@sigma input_data]$ tail precipitation-readings.csv
99280;2016-06-30;21:00:00;0.0;G
99280;2016-06-30;22:00:00;0.0;G
99280;2016-06-30;23:00:00;0.0;G
BDA1_demo.py, BDA2_demo.py), a folder containing the lab data (input_data), and several job submission scripts with the extension *.q.
total 4
-rw-rw-r-- 1 x_antli x_antli 820 Mar 31 20:03 BDA1_demo.py
-rw-rw-r-- 1 x_antli x_antli 820 Mar 31 20:03 BDA2_demo.py
drwxrwxr-x 2 x_antli x_antli 4096 Mar 31 20:03 input_data
-rwxrwxr-x 1 x_antli x_antli 1352 Mar 31 20:03 run_local.q
-rwxrwxr-x 1 x_antli x_antli 2578 Mar 31 20:03 run_local_with_historyserver.q
-rwxrwxr-x 1 x_antli x_antli 1460 Mar 31 20:03 run_yarn.q
-rwxrwxr-x 1 x_antli x_antli 2565 Mar 31 20:03 run_yarn_with_historyserver.q
You can find the instructions for connecting to the NSC server via VS Code here. The key steps are:
- Install the Remote - SSH extension in VS Code.
- Add a new SSH host:
sigma.nsc.liu.seusing your Sigma account credentials. - Once connected, copy the demo code folder to your own folder at Sigma and edit your scripts via VS Code.
Run (demo) code
In the BDA labs, you need to submit a PySpark job to the cluster so that your PySpark code runs in the distributed environment using Hadoop, HDFS, and Spark.
This is done using the spark-submit command, which takes several input arguments. Think of spark-submit like python3 , it is the command you use to run your PySpark script from a terminal.
An example of using spark-submit is as follows:
spark-submit --deploy-mode cluster --master yarn --num-executors 9 --driver-memory 2g --executor-memory 2g --executor-cores 4 CODE.py
This example specifies that the job will run on the cluster, using yarn as the cluster manager, along with resource allocation options including the number of executors, driver memory, executor memory, and executor cores.
Based on the settings of NSC and Sigma, we use a non-interactive way to submit jobs on Sigma. This means that instead of running commands live, you write all instructions into a script file and submit it to the queue. Specifically, we use the sbatch command to submit a job by specifying the project, resource reservation, and one of the .q scripts (which already contains the spark-submit command).
-
Submit a job using sbatch:
[x_antli@sigma BDA_demo]$ sbatch -A liu-compute-2026-6 --reservation devel run_yarn_with_historyserver.q
Running this command will submit a job and return a job ID.
Submitted batch job 41111139 -
Use the
squeuecommand to check the status of your submitted jobs:
[x_antli@sigma BDA_demo]$ squeue -u x_antli
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4111139 sigma run_yarn x_antli CG 00:30 2 n[1235-1236]
-
Once the job is finished, check the folder. You will see a log file named
slurm-4111139.out, which contains the final output if the job succeeds, or an error message if the job fails (for example, if your code contains syntax errors).
Two additional folders are generated:sparkandoutput. Thesparkfolder contains detailed logs. Theoutputfolder is a copy of the HDFS output, as specified inrun_yarn_with_historyserver.q.
Thisoutputfolder usually contains results split across multiple partition files, if the PySpark code saves results as text files.
[x_antli@sigma BDA_demo]$ ls
total 4
-rw-rw-r-- 1 x_antli x_antli 820 Mar 31 20:03 BDA1_demo.py
-rw-rw-r-- 1 x_antli x_antli 820 Mar 31 20:03 BDA2_demo.py
drwxrwxr-x 2 x_antli x_antli 4096 Mar 31 20:03 input_data
-rwxrwxr-x 1 x_antli x_antli 1352 Mar 31 20:03 run_local.q
-rwxrwxr-x 1 x_antli x_antli 2578 Mar 31 20:03 run_local_with_historyserver.q
-rwxrwxr-x 1 x_antli x_antli 1460 Mar 31 20:03 run_yarn.q
-rwxrwxr-x 1 x_antli x_antli 2565 Mar 31 20:03 run_yarn_with_historyserver.q
drwxrwxr-x 2 x_antli x_antli 4096 Mar 31 20:03 output
drwxrwxr-x 2 x_antli x_antli 4096 Mar 31 20:03 spark
-rw-rw-r-- 1 x_antli x_antli 820 Mar 31 20:03 slurm-4111139.out
[x_antli@sigma BDA_demo]$ cd output
[x_antli@sigma output]$ ls
_SUCCESS part-00000 part-00001 -
The
--reservation develoption specifies the default reservation (devel) for your job. During lab sessions, dedicated resources are reserved for you. Replacedevelwith the specific reservation name for a lab session (e.g.,liu-bda-2026-04-14).
Run thelistreservationscommand to see the available reservation names. Reservation names appear in the first column (e.g.,liu-bda-2026-04-14).
[x_antli@sigma BDA_demo]$ listreservations
Reservations available to user:x_antli / project(s):liu-compute-2026-6
devel from 2026-04-03T08:00:00 to 2026-04-03T22:00:00 ALL USERS
liu-bda-2026-04-14 from 2026-04-14T15:15:00 to 2026-04-14T17:15:00 (project:liu-compute-2026-6)
liu-bda-2026-04-16 from 2026-04-16T08:15:00 to 2026-04-16T10:15:00 (project:liu-compute-2026-6)
liu-bda-2026-04-17 from 2026-04-17T15:15:00 to 2026-04-17T17:15:00 (project:liu-compute-2026-6)
liu-bda-2026-04-21 from 2026-04-21T13:15:00 to 2026-04-21T17:15:00 (project:liu-compute-2026-6)
liu-bda-2026-04-23 from 2026-04-23T08:15:00 to 2026-04-23T10:15:00 (project:liu-compute-2026-6)
liu-bda-2026-04-24 from 2026-04-24T15:15:00 to 2026-04-24T17:15:00 (project:liu-compute-2026-6)
liu-bda-2026-04-28 from 2026-04-28T13:15:00 to 2026-04-28T17:15:00 (project:liu-compute-2026-6)
liu-bda-2026-04-30 from 2026-04-30T08:15:00 to 2026-04-30T10:15:00 (project:liu-compute-2026-6)
liu-bda-2026-05-04 from 2026-05-04T10:15:00 to 2026-05-04T12:15:00 (project:liu-compute-2026-6)
liu-bda-2026-05-05 from 2026-05-05T13:15:00 to 2026-05-05T17:15:00 (project:liu-compute-2026-6)
liu-bda-2026-05-07 from 2026-05-07T08:15:00 to 2026-05-07T10:15:00 (project:liu-compute-2026-6)
liu-bda-2026-05-12 from 2026-05-12T13:15:00 to 2026-05-12T17:15:00 (project:liu-compute-2026-6)
liu-bda-2026-05-18 from 2026-05-18T10:15:00 to 2026-05-18T12:15:00 (project:liu-compute-2026-6)
liu-bda-2026-05-19 from 2026-05-19T13:15:00 to 2026-05-19T17:15:00 (project:liu-compute-2026-6)
liu-bda-2026-05-22 from 2026-05-22T15:15:00 to 2026-05-22T17:15:00 (project:liu-compute-2026-6)
-
Important: Please avoid submitting multiple jobs at the same time. A job usually takes one to several minutes to run. You can cancel a job using the
scancelcommand by specifying the job ID as shown below (replaceJOBIDwith the actual ID shown bysqueue -u USERNAME):
[x_antli@sigma BDA_demo]$ scancel JOBID
Data Description
The data includes air temperature and precipitation readings from 812 stations in Sweden.
The stations include both currently active stations and historical stations that have been closed down.
The latest readings available for active stations are from October 10, 2016.
The air temperature and precipitation records are hourly readings (this is important to keep in mind, since you will need to perform aggregations for some exercises).
However, some stations provide only one reading every three hours.
The provided files (data) are CSV files with headers removed.
Values are separated by semicolons. Some files are too large to open in a standard text editor.
Use bash commands such as tail and more to preview a file's contents. The provided files are:
temperature-readings.csv, approx. 2 GBtemperature-readings-small.csv, use this file to test your code locally. For final submissions, run your code using the full temperature readings file.precipitation-readings.csv, approx. 660 MBstations.csvstations-Ostergotland.csv
temperature-readings-small.csv and temperature-readings.csv are:| Station number | Date | Time | Air temperature (in °C) | Quality |
|---|
precipitation-readings.csv are:| Station number | Date | Time | Precipitation (unit in mm) | Quality |
|---|
Quality values: G = controlled and confirmed values; Y = suspected or aggregated values.
stations.csv and stations-Ostergotland.csv are:| Station number | Station name | Measurement height | Latitude | Longitude | Readings from (date and time) | Readings to (date and time) | Elevation |
|---|
Scripts for submitting jobs
In the demo code folder, there are four scripts for submitting jobs. Here is an explanation of their differences:
- The
history serveroption saves all event logs to thesparkfolder, allowing you to review them after a job finishes. For BDA labs, you can choose whether or not to use the history server option. run_local*.qandrun_yarn*.qdiffer in how the job is deployed. The local option runs the job in client mode, without HDFS, using the Sigma network file system instead. The yarn option runs the job in cluster mode, using HDFS.
- For the local option, use a path on the Sigma file system, for example:
temperature_file = sc.textFile("file:///home/x_antli/Desktop/2026/BDA_demo/input_data/temperature-readings-small.csv")
where/home/x_antli/is your default home folder on Sigma. - For the yarn option, use a path on HDFS, for example:
temperature_file = sc.textFile("hdfs:///user/x_antli/BDA/input/temperature-readings-small.csv")
where/user/x_antli/is the default HDFS home folder. Subfolders such as/BDA/input/are created using HDFS commands included in therun_yarn*.qscripts. The script contains more HDFS command which are based on the following:hadoop fs -mkdir <FOLDER_NAME>, make a folder on HDFShadoop fs -mkdir -p <FOLDER_NAME> <FOLDER_NAME>, make multiple foldershadoop fs -test -d <FOLDER_NAME>, if the path is a directory, return 0hadoop fs -rm -r <FOLDER_NAME>, delete the directory and any content under it recursivelyhadoop fs -cat <FOLDER_ON_HDFS> [local], copy HDFS path to stdouthadoop fs -copyFromLocal <localsrc> ... <dst>, copy one or more files from local Sigma to HDFShadoop fs -copyToLocal <dst> ... <localsrc>, copy one or more files from HDFS to local Sigma
run_yarn*.q file (for instance, pick one question and show you use a different option to submit the job).
Write your own code
Now you are ready to write your own code to solve the exercises in BDA1.
If you have used the Thinlinc option to connect to sigma, now you can use VS Code to write your code by running the following commands in a terminal:
[x_antli@sigma BDA_demo]$ module load VSCode/latest-bdist
For the time being, if the above commands do not work, use the following command as an alternative to start VS Code.
[x_antli@sigma BDA_demo]$ code &
[x_antli@sigma BDA_demo]$ /proj/liu-compute-2026-6/shared/VSCode-linux-x64/code --no-sandbox &
Note: When checking the results of your PySpark job, open the SLURM log file generated after your job completes.
To locate the final output quickly, search for the keyword "FINAL OUTPUT". This keyword appears because it is explicitly echoed in the
.q submission script.
Additionally, searching for the name of your script (e.g., BDA1_demo.py) can help you identify any errors that were reported during execution, such as syntax errors or runtime exceptions, which are typically printed alongside the script name in the log output.
Some common errors (to be added):
- HDFS file paths stated in the Python code are not prepared correctly inside the
run_yarn*.qfiles. - Calling functions (e.g., RDD operations) on the wrong RDD variables.
- Lambda function's input arguments are not matched with the RDD's value pairs.
Page responsible: Huanyu Li
Last updated: 2026-04-14
