Semantic Parsing for Text Analytics
Research project funded by the Centre for Industrial Information Technology (CENIIT), project no. 15.02
Much of what we know and think is expressed as written text, and more and more of it is available in digital form: on personal computers, corporate networks, and on the Internet. This project will develop new techniques for transforming textual data into structured information, and eventually, into actionable intelligence. The process, which involves information retrieval, natural language processing, and machine learning, is known as text analytics.
Text analytics is an interdisciplinary and international field of research with a large number of industrial applications. In business intelligence, it helps to make optimal use of enterprise data, a lot of which (such as technical documentation and corporate email) only exists in unstructured or semi-structured form. In product management and marketing, it helps monitor customer satisfaction, analyse markets, and identify trends, increasingly by the analysis of social media. Other applications include knowledge discovery in clinical text, patents, and legal documents. All of these applications are of significant commercial interest: A recent report  forecasts the global text analytics market to grow from $1.64 billion in 2014 to $4.90 billion by 2019.
One of the key component technologies in text analytics is semantic parsing, the automatic mapping of a sentence into a formal representation of its meaning. This project will initially focus on meaning representations in the form of dependency graphs; an example is shown below (, DM #20021001, simplified).
Dependency graphs spell out information about the core semantic content of the analysed sentence, the ‘who did what to whom’. For instance, in the example graph the verb halve is identified as a semantic predicate with two arguments: banks (the answer to the question ‘who will halve?’) and debt (the answer to the question ‘what will be halved?’). Predicate–argument information such as this has been shown to be useful for many downstream applications, including information extraction, question answering, and machine translation. The goal of this project is to develop parsers that can extract dependency graphs from large volumes of text with high accuracy and speed.
Parsing to dependency graphs has been a very active research area in the past few years. In spite of this, several important research questions remain unresolved. This project will specifically address the following:
- How can we develop semantic parsers that strike a good balance between accuracy and efficiency?
- How can we learn semantic parsers from large amounts of data without human intervention, and how can we handle the dynamicity of that data?
- How can we integrate semantic parsers into systems with existing forms of structured data, such as relational databases or models of software systems?
These questions define the main areas of activity for this project:
Most algorithmic techniques for dependency parsing are restricted to tree-shaped graphs in which a word can be the argument of at most one semantic predicate. Thus they cannot be used to obtain graphs such as the one shown above, where banks and debt are linked to three predicates. The limitation to tree-shaped graphs means that current parsers are necessarily inaccurate, and that important information is lost. This project will extend dependency parsing to cases where the target structures are not necessarily tree-shaped.
Unfortunately, it is well-known that many computational problems on general graphs, including parsing, cannot be solved efficiently. However, this project can build on the observation, obtained in preliminary work , that the dependency graphs that are relevant in the context of semantic parsing are nearly tree-shaped, differing only minimally from the closest tree approximation. The project will be first to develop efficient parsing algorithms that exploit this property.
Text analytics makes heavy use of machine learning techniques. Most of these techniques rely on human experts to specify which features of the data are relevant for the intended analysis, and to produce training data that is then used to score the actual impact of each feature. A major limitation of this approach is that it requires extensive and expensive human intervention whenever the data changes, such as when targeting a new type of texts or a new language. This is a serious problem in many practical applications, where textual data is of great variety and becomes available in large volumes and with high speed.
Recent advances in computer science and machine learning have led to the development of new machine learning techniques that are able to learn the relevant features themselves; these techniques are known as feature learning or deep learning.
This project will be first to apply these techniques in the context of semantic parsing to dependency graphs. The goal is to provide parsers that can be learned from data without extensive human intervention.
Semantic representations in the form of dependency graphs have been motivated primarily on linguistic grounds, and while they have been shown to be useful for many applications, there is a lack of principled work on how they can be integrated with other forms of structured information as it often exists in industrial contexts, such as the concepts stored in a relational database or the relations between software components as they are formalized in an UML diagram. This integration is essential for turning textual data into actionable intelligence.
This project aims to develop a general framework for the integration of dependency graphs and other forms of structured information based on the view that the integration problem essentially is a problem of mapping between various graph-shaped representations. In attacking this problem, the project will build on concepts and techniques from graph transformation , a long-established sub-field of theoretical computer science which has recently attracted significant interest in text analytics. The integration framework will be developed in close collaboration with the project’s industrial partners, which will help ensure that it is able to provide support for practically relevant problems.
In the first three years, the project will develop and implement new algorithms and machine learning techniques for semantic parsing, and evaluate these techniques in industrial applications. The scientific results of the project will be published in high-quality journals and conference proceedings, initially in the areas of natural language processing and machine learning. Implementations will be evaluated in the context of in-house applications at the industrial partners, with the aim of extending them into fully functional software components that can be integrated into existing systems. There will be regular meetings with the industrial partners to align the research with actual needs and practice. The project will also be used as an incubator for new research ideas; these will be turned into funding proposals for follow-up work. At an organizational level, the plan is to start building a research group around the project, initially consisting of the PI and a doctoral student.
The project is being carried out at the Department of Computer and Information Science (IDA). The department has a long-standing research group on natural language processing and is currently building up a new, interdisciplinary group on machine learning. The present project strengthens the competence of both of these groups and increase their international visibility.
- Marco Kuhlmann, project leader
- Riley Capshaw, PhD student; associated
- Jenny Kunz, PhD student
- Robin Kurtz, PhD student; associated
 L. Jonsson. Increasing anomaly handling efficiency in large organizations using applied machine learning. In Proceedings of the 35th International Conference on Software Engineering (ICSE) Doctoral Symposium, pages 1361–1364, San Francisco, USA, 2013.
 M. Kuhlmann. Cubic-time graph parsing with a simple scoring scheme. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 395–399, Dublin, Republic of Ireland, 2014.
 MarketsAndMarkets. Text analytics market by applications (enterprise, web-based and software, data analysis, search-based, others), users (SMBs, enterprises) and deployment model (cloud, on-premise) - market forecasts and analysis (2014–2019). Market Report TC 2391, April 2014.
 S. Oepen, M. Kuhlmann, Y. Miyao, D. Zeman, D. Flickinger, J. Hajič, A. Ivanova, and Y. Zhang. SemEval 2014 Task 8: Broad-coverage semantic dependency parsing. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 63–72, Dublin, Republic of Ireland, 2014.
 G. Rozenberg, editor. Handbook of Graph Grammars and Computing by Graph Transformation. World Scientific, Singapore, 1997.
[P18] Robin Kurtz, Stephan Oepen, and Marco Kuhlmann.
End-to-End Negation Resolution as Graph Parsing.
In Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies, pages 14–24, Online, 2020.
[P17] Riley Capshaw, Marco Kuhlmann, and Eva Blomqvist.
Probing a Semantic Dependency Parser for Translational Relation Embeddings.
In Proceedings of the Workshop on Deep Learning for Knowledge Graphs, 2020.
[P16] Fredrik Sand Aronsson, Marco Kuhlmann, Vesna Jelić, and Per Östberg.
Is Cognitive Impairment Associated with Reduced Syntactic Complexity in Writing? Evidence from Automated Text Analysis.
[P15] Stephan Oepen, Omri Abend, Jan Hajič, Daniel Hershcovich, Marco Kuhlmann, Tim O'Gorman, Nianwen Xue, Jayeol Chun, Milan Straka, and Zdeňka Urešová.
MRP 2019: Cross-Framework Meaning Representation Parsing.
In Proceedings of the CoNLL 2019 Shared Task: Cross-Framework Meaning Representation Parsing, pages 1–27, Hong Kong, China, 2019.
[P14] Marco Kuhlmann, Andreas Maletti, and Lena Katharina Schiffer.
The Tree-Generative Capacity of Combinatory Categorial Grammars.
In Proceedings of the IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pages 44:1–44:14, Mumbai, India, 2019.
[P13] Robin Kurtz and Marco Kuhlmann.
The Interplay Between Loss Functions and Structural Constraints in Dependency Parsing.
Northern European Journal of Language Technology, 6:43–66, 2019.
[P12] Robin Kurtz, Daniel Roxbo, and Marco Kuhlmann.
Improving Semantic Dependency Parsing with Syntactic Features.
In Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing, pages 12–21, Turku, Finland, 2019.
[P11] Marco Kuhlmann, Giorgio Satta, and Peter Jonsson.
On the Complexity of CCG Parsing.
Computational Linguistics, 44(3):447–482, 2018.
[P10] Robin Kurtz and Marco Kuhlmann.
Exploiting Structure in Parsing to 1-Endpoint-Crossing Graphs.
In Proceedings of the 15th International Conference on Parsing Technologies (IWPT), pages 78–87, Pisa, Italy, 2017.
[P09] Marco Kuhlmann, Giorgio Satta, and Peter Jonsson.
On the Complexity of CCG Parsing.
CoRR, abs/1702.06594, 2017.
[P08] Per Fallgren, Jesper Segeblad, and Marco Kuhlmann.
Towards a Standard Dataset of Swedish Word Vectors.
In Proceedings of the Sixth Swedish Language Technology Conference (SLTC), Umeå, Sweden, 2016.
[P07] Marco Kuhlmann and Stephan Oepen.
Towards a Catalogue of Linguistic Graph Banks.
Computational Linguistics, 42(4):819–827, 2016.
[P06] Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Silvie Cinková, Dan Flickinger, Jan Hajič, Angelina Ivanova, and Zdeňka Urešová.
Towards Comparability of Linguistic Graph Banks for Semantic Parsing.
In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), pages 3991–3995, Portorož, Slovenia, 2016.
[P05] Marco Kuhlmann and Peter Jonsson.
Parsing to Noncrossing Dependency Graphs.
Transactions of the Association for Computational Linguistics, 3(Nov):559–570, 2015.
[P05] Frank Drewes, Kevin Knight, and Marco Kuhlmann.
Formal Models of Graph Transformation in Natural Language Processing (Dagstuhl Seminar 15122).
Dagstuhl Reports, 5(3):143–161, 2015.
[P04] Marco Kuhlmann, Alexander Koller, and Giorgio Satta.
Lexicalization and Generative Power in CCG.
Computational Linguistics, 41(2):187–219, 2015.
[P03] Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Silvie Cinková, Dan Flickinger, Jan Hajič, and Zdeňka Urešová.
SemEval 2015 Task 18: Broad-Coverage Semantic Dependency Parsing.
In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 915–926, Denver, CO, USA, 2015.
[P02] Peter Jonsson and Marco Kuhlmann.
Maximum Pagenumber-k Subgraph Is NP-Complete.
CoRR, abs/1504.05908, 2015.
[P01] Marco Kuhlmann.
Tabulation of Noncrossing Acyclic Digraphs.
CoRR, abs/1504.04993, 2015.
[T35] Marc Pàmies Massip.
Multilingual Identification of Offensive Content in Social Media.
MSc in Computer Science, 2020. External project at Helsinki University.
[T34] Alexander Häger.
Contextualizing Music Recommendations.
MSc in Computer Science and Engineering, 2020. External project at Spotify, Boston, USA.
[T33] Min-Chun Shih.
Exploring Cross-lingual Sublanguage Classification with Multi-lingual Word Embeddings.
MSc in Statistics and Machine Learning, 2020.
[T32] Robin Ellgren.
Exploring Emerging Entities and Named Entity Disambiguation in News Articles.
MSc in Information Technology, 2020. External project at iMatrics, Linköping.
[T31] Ludvig Westerdahl.
Predicting the Financial Impact of the CEO’s Comments in Quarterly Reports.
MSc in Computer Science, 2020. External project at Redeye, Stockholm.
[T30] Jesper Hedlund and Emma Nilsson Tengstrand.
A Comparison between Different Recommender System Approaches for a Book and an Author Recommender System.
MSc in Computer Science and Engineering, 2020. External project at Storytel Sweden AB, Stockholm.
[T29] Pontus Svensson.
Automated Image Suggestions for News Articles: An Evaluation of Text and Image Representations in an Image Retrieval System.
MSc in Computer Science and Engineering, 2020. External project at Consid, Linköping.
[T28] Rebecca Lindblom.
News Value Prediction with Textual Features and Machine Learning.
MSc in Computer Science and Engineering, 2020. External project at iMatrics, Linköping.
[T27] Ludvig Noring.
Predicting Swedish News Article Popularity.
MSc in Computer Science and Engineering, 2020. External project at Schibsted Sverige AB, Stockholm.
[T26] Simon Keisala.
Using a Character-Based Language Model for Caption Generation.
MSc in Computer Science, 2020.
[T25] Harald Pettersson.
Sentiment Analysis and Transfer Learning Using Recurrent Neural Networks: An Investigation of the Power of Transfer Learning.
MSc in Computer Science and Engineering, 2019. External project at Findwise AB, Stockholm.
[T24] Milda Pocevičiūtė.
Machine Learning Framework for Automated Case Assignment of Radiology Report Requests.
MSc in Statistics and Machine Learning, 2019. External project at Sectra AB, Linköping.
[T23] Anna-Katharina Fürgut.
Mining Symptom Phrases within Free-Text Answers to Anamnesis Questionnaires.
MSc in Statistics and Machine Learning, 2019. External project at Doctrin AB, Stockholm.
[T22] Harald Grant.
Extractive Multi-Document Summarization of News Articles.
MSc in Computer Science, 2019. External project at Schibsted Sverige AB, Stockholm.
[T21 ] Max Lund.
Duplicate Detection and Text Classification on Simplified Technical English.
MSc in Computer Science, 2019. External project at Etteplan, Linköping.
[T20] Johannes Palm Myllylä.
Domain Adaptation for Hypernym Discovery via Automatic Collection of Domain-Speciﬁc Training Data.
MSc in Computer Science and Engineering, 2019. External project at Fodina Language Technology AB, Linköping.
[T19] Gustav Gränsbo.
Word Clustering in an Interactive Text Analysis Tool.
MSc in Computer Science and Engineering, 2019. External project at Gavagai AB, Stockholm.
[T18] Daniel Roxbo.
A Detailed Analysis of Semantic Dependency Parsing with Deep Neural Networks.
MSc in Computer Science, 2019.
[T17] Sanne Ingvarsson.
Using Machine Learning to Learn from Bug Reports: Towards Improved Testing Efficiency.
MSc in Electrical Engineering, 2019. External project at Sectra AB, Linköping.
[T16] Sijin Cheng.
Relevance Feedback-based Optimization of Search Queries for Patents.
MSc in Computer Science, 2019. External project at IamIP Sverige AB, Sundbyberg.
[T15] Alice Reinaudo.
Hierarchical Text Classification of Fiction Books.
MSc in Computer Science, 2019. External project at Storytel Sweden AB, Stockholm.
[T14] Fredrik Öhrström.
Cluster Analysis with Meaning: Detecting Texts that Convey the Same Message.
MSc in Computer Science, 2019. External project at Etteplan, Linköping.
[T13] Jesper Bäck.
Domain Similarity Metrics for Predicting Transfer Learning Performance.
MSc in Computer Science, 2018. External project at Consid, Linköping.
[T12] Lina Gunnarsson.
Semiautomatic De-Identification of Patient Data.
MSc in Biomedical Engineering, 2018. External project at Sectra AB, Linköping.
[T11] Simon Lindblad.
Labeling Clinical Reports with Active Learning and Topic Modeling.
MSc in Computer Science, 2018. External project at Sectra AB, Linköping.
[T10] Justus Johansson Lindkvist.
Automatic De-Identification of Personally Identifiable Information.
MSc in Electrical Engineering, 2018. External project at Sectra AB, Linköping.
[T09] Riley Capshaw.
Relation Classification using Semantically-Enhanced Syntactic Dependency Paths: Combining Semantic and Syntactic Dependencies for Relation Classification using Long Short-Term Memory Networks.
MSc in Computer Science, 2018.
[T08] Francesco Cucari.
Development of an Artificial Intelligence System for Localizing Bugs in Large Industrial Software Projects.
MSc in Artificial Intelligence and Robotics, 2017. External project at Ericsson AB, Linköping.
[T07] Nils Axelsson.
Dynamic Programming Algorithms for Semantic Dependency Parsing.
MSc in Computer Science and Engineering, 2017.
[T06] Jesper Segeblad.
Putting a Spin on SPINN: Representations of Syntactic Structure in Neural Network Sentence Encoders for Natural Language Inference.
MSc in Cognitive Science, 2017.
[T05] Joakim Gylling.
Transition-Based Dependency Parsing with Neural Networks.
BSc in Software Engineering, 2017.
[T04] Benjamin Helmersson.
Definition Extraction from Swedish Technical Documentation.
BA in Cognitive Science, 2016. External project at Fodina Language Technology AB, Linköping.
[T03] Zonghan Wu.
Neural Networks for Dependency Parsing.
MSc in Statistics and Data Mining, 2016.
[T02] Martina Nyberg.
Kommunikativa funktioner hos emotikoner i svenska twitterinlägg.
BA in Cognitive Science, 2015. External project at Gavagai, Stockholm.
[T01] Sarah Hantosi Albertsson.
Textuella särdrag som kvalitet. En studie om att automatiskt mäta kvalitet i teknisk dokumentation.
BA in Cognitive Science, 2015. External project at Saab, Linköping.
Page responsible: Marco Kuhlmann
Last updated: 2020-10-09