Software and Systems Research Seminar Series
The SaS Seminars are a permanent series of open seminars of the Division of Software and Systems (SaS) at the Department of Computer and Information Science (IDA), Linköping University. The objective of the seminars is to present outstanding research and ideas/problems relevant for SaS present and future activities. In particular, seminars cover the SaS research areas software engineering, programming models and environments, software and system modeling and simulation, system software, embedded SW/HW systems, computer systems engineering, parallel and distributed computing, realtime systems, system dependability, and software and system verification and testing.
Two kinds of seminars are planned:
talks by invited speakers not affiliated with SaS,
internal seminars presenting lab research to whole SaS (and other interested colleagues).
The speakers are expected to give a broad perspective of the presented research, adressing the audience with a general computer science background but possibly with no specific knowledge in the domain of the presented research. The normal length of a presentation is 60 minutes, including discussion.
The SaS seminars are coordinated by Christoph Kessler.
SaS seminars 2024
Machine Learning for Anomaly Detection in Edge Clouds
Javad Forough, Umeå University, Sweden
Tuesday, 27 Feb. 2024, 15:15, room John von Neumann, IDA
Edge clouds have emerged as a crucial architectural paradigm, revolutionizing data processing and analysis by decentralizing computational capabilities closer to data sources and end-users at the edge of the network. Anomaly detection is a vital task in these environments, ensuring the reliability and security of edge-based systems, particularly in critical applications like autonomous vehicles and healthcare. However, integrating anomaly detection into edge clouds presents several challenges, including resource limitations, scarcity of labeled data specific to edge environments, and the need for precise anomaly detection algorithms. This talk explores how machine learning techniques, including transfer learning, knowledge distillation, reinforcement learning, deep sequential models, and deep ensemble learning, enhance anomaly detection in edge clouds.
Javad Forough is a WASP Academic PhD Student at Umeå University, with expertise in anomaly detection for edge clouds. He has also collaborated as a visiting researcher at Imperial College London. Javad's research is centered on elevating the reliability and security of edge-based systems. His work is dedicated to addressing challenges related to resource limitations, data scarcity, and the development of precise anomaly detection algorithms within edge cloud environments.
On Inter-dataset Code Duplication and Data Leakage in Large Language Models
Dr. Jose Antonio Hernandez Lopez, PELAB, IDA, Linköping University
Thursday, 22 Feb. 2024, 10:15, room Alan Turing, IDA
Large language models (LLMs) have exhibited remarkable proficiency in diverse software engineering (SE) tasks, such as code summarization, code translation, and code search. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase.
Data leakage, i.e., using information of the test set to perform the model training, is a well-known issue in training of machine learning models. A manifestation of this issue is the intersection of the training and testing splits. While intra-dataset code duplication examines this intersection within a given dataset and has been addressed in prior research, inter-dataset code duplication, which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of LLMs evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics.
This work explores the phenomenon of inter-dataset code duplication and its impact on evaluating LLMs across diverse SE tasks. We conduct an empirical study using the CodeSearchNet dataset (CSN), a widely adopted pre-training dataset, and five fine-tuning datasets used for various SE tasks. We first identify the intersection between the pre-training and fine-tuning datasets using a deduplication process. Then, we fine-tune four models pre-trained on CSN (CodeT5, CodeBERT, GraphCodeBERT, and UnixCoder) to evaluate their performance on samples encountered during pre-training and those unseen during that phase.
Our findings reveal a potential threat to the evaluation of various LLMs across multiple SE tasks, stemming from the inter-dataset code duplication phenomenon. Moreover, we demonstrate that this threat is accentuated by factors like the LLMâ€™'s size and the chosen fine-tuning technique. Based on our findings, we delve into prior research that may be susceptible to this threat. Additionally, we offer guidance to SE researchers on strategies to prevent inter-dataset code duplication.
Jose Antonio Hernandez Lopez is a WASP postdoctoral researcher at Linköping University under the supervision of Daniel Varro at PELAB, IDA. He holds a PhD from the University of Murcia, Spain, specializing in the application of machine learning to model-driven engineering. During his PhD, he received several awards at past top modeling conferences, including the Best Foundation Paper Award at MODELS23 and the Distinguished Paper Award at MODELS21. Currently, he focuses his research on large language models for code. Additionally, he has contributed to publications in esteemed software engineering venues like ASE conference and TSE journal.
Previous SaS SeminarsFor previous SaS seminars in 2001 - 2023 see below.
Page responsible: Christoph Kessler
Last updated: 2024-02-16