Hide menu

Semantic Parsing for Text Analytics

Research project funded by the Centre for Industrial Information Technology (CENIIT), project no. 15.02

Background

Much of what we know and think is expressed as written text, and more and more of it is available in digital form: on personal computers, corporate networks, and on the Internet. This project will develop new techniques for transforming textual data into structured information, and eventually, into actionable intelligence. The process, which involves information retrieval, natural language processing, and machine learning, is known as text analytics.

Text analytics is an interdisciplinary and international field of research with a large number of industrial applications. In business intelligence, it helps to make optimal use of enterprise data, a lot of which (such as technical documentation and corporate email) only exists in unstructured or semi-structured form. In product management and marketing, it helps monitor customer satisfaction, analyse markets, and identify trends, increasingly by the analysis of social media. Other applications include knowledge discovery in clinical text, patents, and legal documents. All of these applications are of significant commercial interest: A recent report [3] forecasts the global text analytics market to grow from $1.64 billion in 2014 to $4.90 billion by 2019.

Project description

One of the key component technologies in text analytics is semantic parsing, the automatic mapping of a sentence into a formal representation of its meaning. This project will initially focus on meaning representations in the form of dependency graphs; an example is shown below ([4], DM #20021001, simplified).

Dependency graphs spell out information about the core semantic content of the analysed sentence, the ‘who did what to whom’. For instance, in the example graph the verb halve is identified as a semantic predicate with two arguments: banks (the answer to the question ‘who will halve?’) and debt (the answer to the question ‘what will be halved?’). Predicate–argument information such as this has been shown to be useful for many downstream applications, including information extraction, question answering, and machine translation. The goal of this project is to develop parsers that can extract dependency graphs from large volumes of text with high accuracy and speed.

Parsing to dependency graphs has been a very active research area in the past few years. In spite of this, several important research questions remain unresolved. This project will specifically address the following:

  • How can we develop semantic parsers that strike a good balance between accuracy and efficiency?
  • How can we learn semantic parsers from large amounts of data without human intervention, and how can we handle the dynamicity of that data?
  • How can we integrate semantic parsers into systems with existing forms of structured data, such as relational databases or models of software systems?

These questions define the main areas of activity for this project:

Algorithm development

Most algorithmic techniques for dependency parsing are restricted to tree-shaped graphs in which a word can be the argument of at most one semantic predicate. Thus they cannot be used to obtain graphs such as the one shown above, where banks and debt are linked to three predicates. The limitation to tree-shaped graphs means that current parsers are necessarily inaccurate, and that important information is lost. This project will extend dependency parsing to cases where the target structures are not necessarily tree-shaped.

Unfortunately, it is well-known that many computational problems on general graphs, including parsing, cannot be solved efficiently. However, this project can build on the observation, obtained in preliminary work [2], that the dependency graphs that are relevant in the context of semantic parsing are nearly tree-shaped, differing only minimally from the closest tree approximation. The project will be first to develop efficient parsing algorithms that exploit this property.

Machine learning

Text analytics makes heavy use of machine learning techniques. Most of these techniques rely on human experts to specify which features of the data are relevant for the intended analysis, and to produce training data that is then used to score the actual impact of each feature. A major limitation of this approach is that it requires extensive and expensive human intervention whenever the data changes, such as when targeting a new type of texts or a new language. This is a serious problem in many practical applications, where textual data is of great variety and becomes available in large volumes and with high speed.

Recent advances in computer science and machine learning have led to the development of new machine learning techniques that are able to learn the relevant features themselves; these techniques are known as feature learning or deep learning.

This project will be first to apply these techniques in the context of semantic parsing to dependency graphs. The goal is to provide parsers that can be learned from data without extensive human intervention.

System integration

Semantic representations in the form of dependency graphs have been motivated primarily on linguistic grounds, and while they have been shown to be useful for many applications, there is a lack of principled work on how they can be integrated with other forms of structured information as it often exists in industrial contexts, such as the concepts stored in a relational database or the relations between software components as they are formalized in an UML diagram. This integration is essential for turning textual data into actionable intelligence.

This project aims to develop a general framework for the integration of dependency graphs and other forms of structured information based on the view that the integration problem essentially is a problem of mapping between various graph-shaped representations. In attacking this problem, the project will build on concepts and techniques from graph transformation [5], a long-established sub-field of theoretical computer science which has recently attracted significant interest in text analytics. The integration framework will be developed in close collaboration with the project’s industrial partners, which will help ensure that it is able to provide support for practically relevant problems.

Visions and plans

In the first three years, the project will develop and implement new algorithms and machine learning techniques for semantic parsing, and evaluate these techniques in industrial applications. The scientific results of the project will be published in high-quality journals and conference proceedings, initially in the areas of natural language processing and machine learning. Implementations will be evaluated in the context of in-house applications at the industrial partners, with the aim of extending them into fully functional software components that can be integrated into existing systems. There will be regular meetings with the industrial partners to align the research with actual needs and practice. The project will also be used as an incubator for new research ideas; these will be turned into funding proposals for follow-up work. At an organizational level, the plan is to start building a research group around the project, initially consisting of the PI and a doctoral student.

In its second half, the project will shift focus to the question of system integration. At an organizational level, the project vision is to establish a strong, internationally competitive research group on text analytics at Linköping University. The group would be externally financed by grants from funding agencies such as Vetenskapsrådet or the European Research Council.

Research environment

The project is being carried out at the Department of Computer and Information Science (IDA). The department has a long-standing research group on natural language processing and is currently building up a new, interdisciplinary group on machine learning. The present project strengthens the competence of both of these groups and increase their international visibility.

Project staff:

References

[1] L. Jonsson. Increasing anomaly handling efficiency in large organizations using applied machine learning. In Proceedings of the 35th International Conference on Software Engineering (ICSE) Doctoral Symposium, pages 1361–1364, San Francisco, USA, 2013.

[2] M. Kuhlmann. Cubic-time graph parsing with a simple scoring scheme. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 395–399, Dublin, Republic of Ireland, 2014.

[3] MarketsAndMarkets. Text analytics market by applications (enterprise, web-based and software, data analysis, search-based, others), users (SMBs, enterprises) and deployment model (cloud, on-premise) - market forecasts and analysis (2014–2019). Market Report TC 2391, April 2014.

[4] S. Oepen, M. Kuhlmann, Y. Miyao, D. Zeman, D. Flickinger, J. Hajič, A. Ivanova, and Y. Zhang. SemEval 2014 Task 8: Broad-coverage semantic dependency parsing. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 63–72, Dublin, Republic of Ireland, 2014.

[5] G. Rozenberg, editor. Handbook of Graph Grammars and Computing by Graph Transformation. World Scientific, Singapore, 1997.

Publications

Robin Kurtz and Marco Kuhlmann.
Exploiting Structure in Parsing to 1-Endpoint-Crossing Graphs.
In Proceedings of the 15th International Conference on Parsing Technologies (IWPT), Pisa, Italy, 2017.
Accepted for publication.

Marco Kuhlmann, Giorgio Satta, and Peter Jonsson.
On the Complexity of CCG Parsing.
CoRR, abs/1702.06594, 2017.

Per Fallgren, Jesper Segeblad, and Marco Kuhlmann.
Towards a Standard Dataset of Swedish Word Vectors.
In Proceedings of the Sixth Swedish Language Technology Conference (SLTC), Umeå, Sweden, 2016.

Marco Kuhlmann and Stephan Oepen.
Towards a Catalogue of Linguistic Graph Banks.
Computational Linguistics, 42(4):819–827, 2016.

Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Silvie Cinková, Dan Flickinger, Jan Hajič, Angelina Ivanova, and Zdeňka Urešová.
Towards Comparability of Linguistic Graph Banks for Semantic Parsing.
In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), pages 3991–3995, Portorož, Slovenia, 2016.

Marco Kuhlmann and Peter Jonsson.
Parsing to Noncrossing Dependency Graphs.
Transactions of the Association for Computational Linguistics, 3(Nov):559–570, 2015.

Frank Drewes, Kevin Knight, and Marco Kuhlmann.
Formal Models of Graph Transformation in Natural Language Processing (Dagstuhl Seminar 15122).
Dagstuhl Reports, 5(3):143–161, 2015.

Marco Kuhlmann, Alexander Koller, and Giorgio Satta.
Lexicalization and Generative Power in CCG.
Computational Linguistics, 41(2):187–219, 2015.

Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Silvie Cinková, Dan Flickinger, Jan Hajič, and Zdeňka Urešová.
SemEval 2015 Task 18: Broad-Coverage Semantic Dependency Parsing.
In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 915–926, Denver, CO, USA, 2015.

Peter Jonsson and Marco Kuhlmann.
Maximum Pagenumber-k Subgraph Is NP-Complete.
CoRR, abs/1504.05908, 2015.

Marco Kuhlmann.
Tabulation of Noncrossing Acyclic Digraphs.
CoRR, abs/1504.04993, 2015.

Student theses

Jesper Segeblad.
Putting a Spin on SPINN: Representations of Syntactic Structure in Neural Network Sentence Encoders for Natural Language Inference.
Master’s thesis, 2017.

Joakim Gylling.
Transition-Based Dependency Parsing with Neural Networks.
Bachelor’s thesis, 2017. Co-supervision with Rita Kovordanyi.

Nils Axelsson.
Dynamic Programming Algorithms for Semantic Dependency Parsing.
Master’s thesis, 2017. Main supervisor.

Wiktor Strandqvist.
Neural Networks for Part-of-Speech Tagging.
Bachelor’s thesis, 2016.

Sarah Hantosi Albertsson.
Textuella särdrag som kvalitet. En studie om att automatiskt mäta kvalitet i teknisk dokumentation.
Bachelor’s thesis, 2015. Co-supervision with Erik H. Karlsson (Saab AB).

Martina Nyberg.
Kommunikativa funktioner hos emotikoner i svenska twitterinlägg.
Bachelor’s thesis, 2015. Co-supervision with Magnus Sahlgren (Gavagai AB).


Page responsible: Marco Kuhlmann
Last updated: 2017-09-09