Hide menu

729G17 Language Technology


Course introduction

Welcome to the course! This introductory module consists of a short lecture that introduces you to language technology as an application area, as well as to the content and organisation of the course.

Materials

Detailed information about the course organisation and examination is available on this webpage.

Concepts

  • ambiguity, contextuality, multilinguality
  • combinatorial explosion

Topic 1: Text classification

Text classification is the task of categorising text documents into predefined classes. In this module you will be introduced to text classification and its applications, learn how to evaluate text classifiers using standard validation methods, and get to know the Naive Bayes classifier, a simple but effective probabilistic model for text classification.

Basic materials

Basic concepts
  • accuracy, precision, recall
  • Naive Bayes classifier
  • maximum likelihood estimation, additive smoothing
Basic procedures
  • evaluate a text classifier based on accuracy, precision, and recall
  • apply the classification rule of the Naive Bayes classifier to a text
  • learn the probabilities of a Naive Bayes classifier using maximum likelihood estimation and additive smoothing

Advanced materials

In the advanced section you will be introduced to the multi-class perceptron classifier and how to implement it.

Advanced concepts
  • averaged perceptron classifier (advanced)
Advanced procedures
  • apply the classification rule of the averaged perceptron classifier to a text (advanced)
  • learn an averaged perceptron classifier from data (advanced)

Topic 2: Language modelling

Language modelling is about building models of what words are more or less likely to occur in some language. This section focuses on n-gram models, which have a wide range of applications such as in predictive text input, automatic spelling correction, and machine translation. You will be introduced to techniques for training n-gram models on data and the evaluation of n-gram models. The final part of the section introduces edit distance, which in connection with language models can be used for automatic spelling correction.

Basic materials

Basic concepts
  • n-gram model
  • additive smoothing
  • perplexity, entropy

Basic procedures

  • learn an n-gram model using additive smoothing
  • evaluate an n-gram model using entropy

Advanced materials

The advanced material for this section is the Wagner–Fisher algorithm for computing the Levenshtein distance between two words.

Advanced concepts
  • Levenshtein distance
  • Wagner–Fisher algorithm
Advanced procedures
  • simulate the Wagner–Fisher algorithm

Topic 3: Part-of-speech tagging

A part-of-speech tagger is a computer program that tags each word in a sentence with its part of speech, such as noun, adjective, or verb. In this section you will learn how to evaluate part-of-speech taggers, and be introduced to two methods for part-of-speech tagging: hidden Markov models (which generalise the Markov models that you encountered in the section on language modelling) and the multi-class perceptron.

Materials

Basic concepts
  • part of speech, part-of-speech tagger
  • accuracy, precision, recall
  • hidden Markov model
  • multi-class perceptron, feature window
Basic procedures
  • evaluate a part-of-speech tagger based on accuracy, precision, and recall
  • compute the probability of a tagged sentence in a hidden Markov model

Advanced materials

The advanced material for this section is the Viterbi algorithm for computing the most probable tag sequence for a sentence under a hidden Markov model.

Advanced concepts
  • Viterbi algorithm
Advanced procedures
  • simulate the Viterbi algorithm

Topic 4: Syntactic analysis

Syntactic analysis, also called syntactic parsing, is the task of mapping a sentence to a formal representation of its syntactic structure. In this lecture you will learn about parsing to two target representations: phrase structure trees and dependency trees. The central model for the parsing to phrase structure trees is that of a probabilistic context-free grammar. For parsing to dependency trees, you will learn about the transition-based dependency parsing algorithm, which is also used by Google.

Materials

Concepts

  • phrase structure tree, dependency tree
  • probabilistic context-free grammar
  • transition-based dependency parser

Procedures

  • learn a probabilistic context-free grammar from a treebank
  • simulate a transition-based dependency parser

Topic 5: Semantic analysis

In this lecture you will learn about word senses and the problems they posed for language technology, as well as about two important problems in semantic analysis: word sense disambiguation and word similarity. For each task you will learn about both knowledge-based and data-driven methods, including the popular continuous bag-of-words model used in Google’s word2vec software.

Materials

Concepts

  • word sense, homonymy, polysemy
  • synonymy, antonymy, hyponymy, hypernymy, WordNet
  • Simplified Lesk algorithm
  • word similarity, distributional hypothesis, co-occurrence matrix

Procedures

  • simulate the Simplified Lesk algorithm
  • compute the path length-based similarity of two words
  • derive a co-occurrence matrix from a document collection

Page responsible: Marco Kuhlmann
Last updated: 2017-09-22