Index of /divisions/hcs/nlplab/swectors

Name	Last modified	Size

Parent Directory		-
create_vectors.py	2016-10-25 10:07	1.5K
sentences/	2020-06-22 14:43	-
suc.saldo	2016-06-06 15:26	398K
swectors-300dim.txt.bz2	2016-07-16 04:30	187M

Swectors

This is the code used for the paper "Towards a Standard Dataset of Swedish Word Vectors", pdf can be found here.

Creating Swedish word vectors

The script takes four parameters, method (cbow or sgns), dimensionality, window size and iterations.

Such as:

python3 create_vectors.py cbow 300 10 40

python3 create_vectors.py sgns 50 10 5

A textfile is created where each the first index of each row is a unique word and the rest of the row is each element of the vector, separated by spaces.

Data

The training set is located in 'sentences', each row corresponds to one sentence. Included is a sample of Göteborgsposten-2013 (100k rows).

Evaluation

Tools and instructions to how the use QVEC-CCA can be found here. For a quick start, simply download the file and add 'suc.saldo', then use the following line to evalute a set a vectors.

./qvec_cca.py --in_vectors /path/to/vecs --in_oracle suc.saldo

Contact

Per Fallgren

Jesper Segeblad

Marco Kuhlmann