{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# L5: Semantic analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A **word space model** of word meanings represents words as vectors in a high-dimensional vector space. In this lab you will experiment with a word space model which trained on the Swedish Wikipedia using [word2vec](https://code.google.com/archive/p/word2vec/). In order to use word2vec in Python, we use the [gensim](https://radimrehurek.com/gensim/) library.\n", "\n", "The library and some more essentials for this lab are contained in the module we load in the following cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import lt5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explore the lab system" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the next cell to load the pre-trained word space model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = lt5.load_model(\"/courses/729G17/labs/l5/data/wikipedia-sv.bin\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model consists of word vectors. In Python a word vector is represented as an [*array*](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html). For the purposes of this lab, you can treat arrays as lists. The next line of code prints the vector for the word *student*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model['student']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All vectors in the model have the same dimensionality $n$; this value is a parameter that is fixed when training the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Write code to print the following:
\n", "Write code that computes the model’s accuracy on the task of predicting the fourth word in every analogy pair, given the other three. Feel free to use the complete()
function that you implemented for Problem 3.
The analogies in the example file have been picked from ten different categories. Invent names for these categories. Which categories would you call semantic (related to the meaning of the words), which would you call syntactic (related to the form and the grammatical behaviour of the words)?
\n", "Select four categories, and find one new example for each of them. Of the four examples, two should be examples where the model succeeds in reproducing the intended analogy, and two should be examples where where the model fails to do so.
\n", "Part 1: Naming the categories
\n", "\n", "Example | Category |
---|---|
1 | your name for the category here |
2 | your name for the category here |
3 | your name for the category here |
4 | your name for the category here |
5 | your name for the category here |
6 | your name for the category here |
7 | your name for the category here |
8 | your name for the category here |
9 | your name for the category here |
10 | your name for the category here |
Part 2: New examples for four of the categories
\n", "\n", "Category | Example | Model’s completion |
---|---|---|
0 | man kvinna kung drottning | man kvinna kung drottning |
1 | your new example here | model’s completion here |
2 | your new example here | model’s completion here |
3 | your new example here | model’s completion here |
4 | your new example here | model’s completion here |
The lecture on word embeddings mentioned several shortcomings of the model. Design an experiment to find concrete examples that illustrate these shortcomings. Write a short reflection piece about your experience. Use the following prompts:
\n", "Once you have finished all problems, submit this notebook according to the instructions on the course web site.
\n", "