{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# L5: Semantic analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A **word space model** of word meanings represents words as vectors in a high-dimensional vector space. In this lab you will experiment with a word space model which trained on the Swedish Wikipedia using [word2vec](https://code.google.com/archive/p/word2vec/). In order to use word2vec in Python, we use the [gensim](https://radimrehurek.com/gensim/) library.\n", "\n", "The library and some more essentials for this lab are contained in the module we load in the following cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import lt5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explore the lab system" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the next cell to load the pre-trained word space model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = lt5.load_model(\"/courses/729G17/labs/l5/data/wikipedia-sv.bin\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model consists of word vectors. In Python a word vector is represented as an [*array*](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html). For the purposes of this lab, you can treat arrays as lists. The next line of code prints the vector for the word *student*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model['student']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All vectors in the model have the same dimensionality \$n\$; this value is a parameter that is fixed when training the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
Problem 1
\n", "
\n", "Write some code that prints \$n\$ for the model we loaded earlier.\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: Insert code here to solve Problem 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a word space model, we can compute the semantic similarity between words using the cosine distance between their respective word vectors. The next line of code showcases how to compute the cosine distance:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(model.similarity('student', 'lärare'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
Problem 2
\n", "
\n", "

Write code to print the following:

\n", "
\n", "
• the cosine distance between some word of your liking and the word itself
• \n", "
• the cosine distance between two words that are, according to your judgement, semantically related
• \n", "
• the cosine distance between two words that are, according to your judgement, semantically unrelated
• \n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: Insert code here to solve Problem 2." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word analogies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a word analogy task you are given two pairs of words that share a common semantic relation. A well-known example is *man/woman* and *king/queen*, where the semantic relation could be dubbed ‘female’. The task is to predict one of the words, e.g. *queen*, given the other three. By doing that we answer the question: ‘*man* is to *woman* as *king* is to —?’.\n", "\n", "### Predict the fourth word\n", "\n", "[Mikolov et al. (2013)](http://www.aclweb.org/anthology/N13-1090) have shown that the word analogy task can be solved by adding and substracting word vectors in a word2vec-model: the vector for *queen* is close (in terms of cosine distance) to the vector *king* \$-\$ *man* \$+\$ *woman*. In the next problem you will implement this idea." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
Problem 3
\n", "
\n", " \n", "Write a function `complete()` that takes the first three words of a word analogy quadruple as input and predicts the fourth word.\n", "
\n", "
\n", "\n", "To solve the problem you should complete the following code cell:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def complete(model, a, b, c):\n", " \"\"\"Returns the fourth word in the analogy quadruple\"\"\"\n", " # TODO: Replace the next line with your own code to solve Problem 3.\n", " return lt5.complete(model, a, b, c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function is supposed to be called like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "complete(model, \"man\", \"kvinna\", \"kung\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To solve Problem 3 you can use the following method of the model:\n", "\n", "`model.most_similar(pos, neg, n)`\n", "\n", "The method takes as its inputs two lists with words (strings), `pos` and `neg`, and a number `n`, and returns the `n` closest vectors to the vector that one gets by adding all the vectors in the `pos` list and subtracting all the vectors in the `neg` list. Here is an example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(model.most_similar(['kung', 'kvinna'], ['man'], 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Categories of analogies\n", "\n", "Word vectors are computed based on co-occurrence counts: words that co-occur frequently with certain other words are going to have similar vectors. In order to get a better understanding of the model’s possibilities and limitations, we load a list of ten analogy pairs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "analogies = lt5.load_data(\"/courses/729G17/labs/l5/data/analogies.txt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each element of `analogies` is a string consisting of four space-separated words. Here is an example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(analogies)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
Problem 4
\n", "
\n", "

Write code that computes the model’s accuracy on the task of predicting the fourth word in every analogy pair, given the other three. Feel free to use the `complete()` function that you implemented for Problem 3.

\n", "
\n", "
\n", "\n", "Use the next code cell to solve the problem:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def evaluate(model, analogies):\n", " \"\"\"Computes the accuracy of the specified model on the specified list of analogy quadruples\"\"\"\n", " # TODO: Replace the following line with your own code\n", " return lt5.evaluate(model, analogies)\n", "\n", "print(evaluate(model, analogies))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
Problem 5
\n", "
\n", "

The analogies in the example file have been picked from ten different categories. Invent names for these categories. Which categories would you call semantic (related to the meaning of the words), which would you call syntactic (related to the form and the grammatical behaviour of the words)?

\n", "

Select four categories, and find one new example for each of them. Of the four examples, two should be examples where the model succeeds in reproducing the intended analogy, and two should be examples where where the model fails to do so.

\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*TODO: Answer for Problem 5 by completing the following tables*\n", "\n", "

Part 1: Naming the categories

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ExampleCategory
1your name for the category here
2your name for the category here
3your name for the category here
4your name for the category here
5your name for the category here
6your name for the category here
7your name for the category here
8your name for the category here
9your name for the category here
10your name for the category here
\n", "\n", "

Part 2: New examples for four of the categories

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
CategoryExampleModel’s completion
0man kvinna kung drottningman kvinna kung drottning
1your new example heremodel’s completion here
2your new example heremodel’s completion here
3your new example heremodel’s completion here
4your new example heremodel’s completion here
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: Enter code here to solve Problem 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Limitations of word embeddings\n", "\n", "In the last problem of this lab, you will reflect on shortcomings of word embeddings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
Problem 6 (Reflection)
\n", "
\n", "

The lecture on word embeddings mentioned several shortcomings of the model. Design an experiment to find concrete examples that illustrate these shortcomings. Write a short reflection piece about your experience. Use the following prompts:

\n", "
\n", "
• How did you set up the experiment? What were the results?
• \n", "
• Based on your previous knowledge, did you expect the results? How do you explain them?
• \n", "
• What did you learn from this experiment? How, exactly, did you learn it? Why does this learning matter?
• \n", "
\n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: Enter code here to solve Problem 6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*TODO: Insert your text for Problem 6 here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Once you have finished all problems, submit this notebook according to the instructions on the course web site.

\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" }, "latex_envs": { "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0 } }, "nbformat": 4, "nbformat_minor": 1 }