# L5: Semantic analysis

## Introduction

A **word space model** of word meanings represents words as vectors in a high-dimensional vector space. In this lab you will experiment with a word space model which trained on the Swedish Wikipedia using [word2vec](https://code.google.com/archive/p/word2vec/). In order to use word2vec in Python, we use the [gensim](https://radimrehurek.com/gensim/) library.

The library and some more essentials for this lab are contained in the module we load in the following cell.

In [None]:
import lt5

## Explore the lab system

Run the next cell to load the pre-trained word space model:

In [None]:
model = lt5.load_model("/courses/729G17/labs/l5/data/wikipedia-sv.bin")

The model consists of word vectors. In Python a word vector is represented as an [*array*](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html). For the purposes of this lab, you can treat arrays as lists. The next line of code prints the vector for the word *student*:

In [None]:
model['student']

All vectors in the model have the same dimensionality $n$; this value is a parameter that is fixed when training the model.

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Write some code that prints $n$ for the model we loaded earlier.
</div>
</div>

In [None]:
# TODO: Insert code here to solve Problem 1.

Given a word space model, we can compute the semantic similarity between words using the cosine distance between their respective word vectors. The next line of code showcases how to compute the cosine distance:

In [None]:
print(model.similarity('student', 'lärare'))

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
<p>Write code to print the following:</p>
<ul>
<li>the cosine distance between some word of your liking and the word itself</li>
<li>the cosine distance between two words that are, according to your judgement, semantically related</li>
<li>the cosine distance between two words that are, according to your judgement, semantically unrelated</li>
</ul>
</div>
</div>

In [None]:
# TODO: Insert code here to solve Problem 2.

## Word analogies

In a word analogy task you are given two pairs of words that share a common semantic relation. A well-known example is *man/woman* and *king/queen*, where the semantic relation could be dubbed &lsquo;female&rsquo;. The task is to predict one of the words, e.g. *queen*, given the other three. By doing that we answer the question: &lsquo;*man* is to *woman* as *king* is to —?&rsquo;.

### Predict the fourth word

[Mikolov et al. (2013)](http://www.aclweb.org/anthology/N13-1090) have shown that the word analogy task can be solved by adding and substracting word vectors in a word2vec-model: the vector for *queen* is close (in terms of cosine distance) to the vector *king* $-$ *man* $+$ *woman*. In the next problem you will implement this idea.

<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
    
Write a function `complete()` that takes the first three words of a word analogy quadruple as input and predicts the fourth word.
</div>
</div>

To solve the problem you should complete the following code cell:

In [None]:
def complete(model, a, b, c):
    """Returns the fourth word in the analogy quadruple"""
    # TODO: Replace the next line with your own code to solve Problem 3.
    return lt5.complete(model, a, b, c)

The function is supposed to be called like this:

In [None]:
complete(model, "man", "kvinna", "kung")

To solve Problem&nbsp;3 you can use the following method of the model:

`model.most_similar(pos, neg, n)`

The method takes as its inputs two lists with words (strings), `pos` and `neg`, and a number `n`, and returns the `n` closest vectors to the vector that one gets by adding all the vectors in the `pos` list and subtracting all the vectors in the `neg` list. Here is an example:

In [None]:
print(model.most_similar(['kung', 'kvinna'], ['man'], 3))

### Categories of analogies

Word vectors are computed based on co-occurrence counts: words that co-occur frequently with certain other words are going to have similar vectors. In order to get a better understanding of the model&rsquo;s possibilities and limitations, we load a list of ten analogy pairs:

In [None]:
analogies = lt5.load_data("/courses/729G17/labs/l5/data/analogies.txt")

Each element of `analogies` is a string consisting of four space-separated words. Here is an example:

In [None]:
print(analogies[0])

<div class="panel panel-primary">
<div class="panel-heading">Problem 4</div>
<div class="panel-body">
<p>Write code that computes the model&rsquo;s accuracy on the task of predicting the fourth word in every analogy pair, given the other three. Feel free to use the <code>complete()</code> function that you implemented for Problem&nbsp;3.</p>
</div>
</div>

Use the next code cell to solve the problem:

In [None]:
def evaluate(model, analogies):
    """Computes the accuracy of the specified model on the specified list of analogy quadruples"""
    # TODO: Replace the following line with your own code
    return lt5.evaluate(model, analogies)

print(evaluate(model, analogies))

<div class="panel panel-primary">
<div class="panel-heading">Problem 5</div>
<div class="panel-body">
    <p>The analogies in the example file have been picked from ten different categories. Invent names for these categories. Which categories would you call semantic (related to the <em>meaning</em> of the words), which would you call syntactic (related to the <em>form</em> and the <em>grammatical behaviour</em> of the words)?</p>
    <p>Select four categories, and find one new example for each of them. Of the four examples, two should be examples where the model succeeds in reproducing the intended analogy, and two should be examples where where the model fails to do so.</p>
</div>
</div>

*TODO: Answer for Problem 5 by completing the following tables*

<p><strong>Part 1: Naming the categories</strong></p>

<table>
    <tr><th>Example</th><th>Category</th></tr>
    <tr><td>1</td><td>your name for the category here</td></tr>
    <tr><td>2</td><td>your name for the category here</td></tr>
    <tr><td>3</td><td>your name for the category here</td></tr>
    <tr><td>4</td><td>your name for the category here</td></tr>
    <tr><td>5</td><td>your name for the category here</td></tr>
    <tr><td>6</td><td>your name for the category here</td></tr>
    <tr><td>7</td><td>your name for the category here</td></tr>
    <tr><td>8</td><td>your name for the category here</td></tr>
    <tr><td>9</td><td>your name for the category here</td></tr>
    <tr><td>10</td><td>your name for the category here</td></tr>
</table>

<p><strong>Part 2: New examples for four of the categories</strong></p>

<table>
    <tr><th>Category</th><th>Example</th><th>Model&rsquo;s completion</th></tr>
    <tr><td>0</td><td>man kvinna kung <em>drottning</em></td><td>man kvinna kung <em>drottning</em></td></tr>
    <tr><td>1</td><td>your new example here</td><td>model&rsquo;s completion here</td></tr>
    <tr><td>2</td><td>your new example here</td><td>model&rsquo;s completion here</td></tr>
    <tr><td>3</td><td>your new example here</td><td>model&rsquo;s completion here</td></tr>
    <tr><td>4</td><td>your new example here</td><td>model&rsquo;s completion here</td></tr>
</table>

In [None]:
# TODO: Enter code here to solve Problem 5

## Limitations of word embeddings

In the last problem of this lab, you will reflect on shortcomings of word embeddings.

<div class="panel panel-primary">
<div class="panel-heading">Problem 6 (Reflection)</div>
<div class="panel-body">
    <p>The lecture on word embeddings mentioned several shortcomings of the model. Design an experiment to find concrete examples that illustrate these shortcomings. Write a short reflection piece about your experience. Use the following prompts:</p>
    <ul>
        <li>How did you set up the experiment? What were the results?</li>
        <li>Based on your previous knowledge, did you expect the results? How do you explain them?</li>
        <li>What did you learn from this experiment? How, exactly, did you learn it? Why does this learning matter?</li>
    </ul>
</div>
</div>

In [None]:
# TODO: Enter code here to solve Problem 6

*TODO: Insert your text for Problem 6 here*

<div class="alert alert-info">
    <p>Once you have finished all problems, submit this notebook according to the instructions on the course web site.</p>
</div>