Commit 60cdeeee authored by Valentin Pelloin's avatar Valentin Pelloin
Browse files

datasets and others

parent bd2b7ee0
%% Cell type:markdown id: tags:
# Using Gensim with `svd2vec` output
[Gensim](https://pypi.org/project/gensim/) is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.
Gensim can use `word2vec` to compute similarity (and more!) between words. `svd2vec` can save it's vectors in a `word2vec` format that Gensim can process.
In this notebook it is shown how you can use Gensim with vectors learnt from `svd2vec`. We also compare our results with the pure word2vec model.
%% Cell type:markdown id: tags:
---
## I - Preparation
%% Cell type:code id: tags:
``` python
from svd2vec import svd2vec
from svd2vec import svd2vec, FilesIO
from gensim.models import Word2Vec
from gensim.models.keyedvectors import Word2VecKeyedVectors
```
%% Cell type:code id: tags:
``` python
# Gensim does not have any implementation of an analogy method, so we add one here (3CosAdd)
def analogy_keyed(self, a, b, c, topn=10):
return self.most_similar(positive=[b, c], negative=[a], topn=topn)
Word2VecKeyedVectors.analogy = analogy_keyed
def analogy_w2v(self, a, b, c, topn=10):
return self.wv.most_similar(positive=[b, c], negative=[a], topn=topn)
Word2Vec.analogy = analogy_w2v
```
%% Cell type:code id: tags:
``` python
# we load our previously made text8 document list
documents = [open("text8", "r").read().split(" ")[1:]]
```
%% Cell type:code id: tags:
``` python
from svd2vec import Utils
documents = Utils.split(documents[0], 1701)
documents = FilesIO.load_corpus("text8")
```
%% Cell type:markdown id: tags:
---
## II - Models construction
%% Cell type:markdown id: tags:
### SVD with svd2vec
%% Cell type:code id: tags:
``` python
#svd2vec_svd = svd2vec(documents, size=100, window=5, min_count=100, verbose=False)
svd2vec_svd = svd2vec.load("svd.svd2vec")
svd2vec_svd = svd2vec(documents, size=300, window=5, min_count=100, verbose=False)
```
%% Cell type:markdown id: tags:
### SVD with Gensim from svd2vec
%% Cell type:code id: tags:
``` python
# we first need to export svd2vec_svd to the word2vec format
svd2vec_svd.save_word2vec_format("svd.word2vec")
# we then load the model using Gensim
gensim_svd = Word2VecKeyedVectors.load_word2vec_format("svd.word2vec")
```
%% Cell type:markdown id: tags:
### word2vec
%% Cell type:code id: tags:
``` python
import os
if not os.path.isfile("w2v.word2vec") or True:
# we train the model using word2vec (needs to be installed)
!word2vec -min-count 100 -size 300 -window 5 -train text8 -output w2v.word2vec
# we load it
word2vec_w2v = Word2VecKeyedVectors.load_word2vec_format("w2v.word2vec")
```
%% Cell type:markdown id: tags:
%% Output
### word2vec with Gensim
Starting training using file text8
Vocab size: 11816
Words in train file: 15471434
Alpha: 0.000005 Progress: 100.04% Words/thread/sec: 208.82k
%% Cell type:code id: tags:
%% Cell type:markdown id: tags:
``` python
import gensim
gensim_w2v = gensim.models.Word2Vec(documents, size=100, window=5, min_count=100, workers=16)
```
### word2vec with Gensim
%% Cell type:code id: tags:
``` python
len(list(gensim_w2v.wv.vocab.keys()))
gensim_w2v = Word2Vec(documents, size=300, window=5, min_count=100, workers=16)
```
%% Output
11815
%% Cell type:markdown id: tags:
---
## III - Cosine similarity comparison
%% Cell type:code id: tags:
``` python
def compare_similarity(w1, w2):
print("cosine similarity between", w1, "and", w2, ":")
print("\tsvd2vec_svd ", svd2vec_svd.similarity(w1, w2))
print("\tgensim_svd ", gensim_svd.similarity(w1, w2))
print("\tgensim_w2v ", gensim_w2v.wv.similarity(w1, w2))
print("\tword2vec_w2v", word2vec_w2v.similarity(w1, w2))
def compare_analogy(w1, w2, w3, topn=3):
def analogy_str(model):
a = model.analogy(w1, w2, w3, topn=topn)
s = "\n\t\t".join(["{: <20}".format(w) + str(c) for w, c in a])
return "\n\t\t" + s
print("analogy similaties :", w1, "is to", w2, "as", w3, "is to?")
print("\tsvd2vec_svd", analogy_str(svd2vec_svd))
print("\tgensim_svd", analogy_str(gensim_svd))
print("\tgensim_w2v", analogy_str(gensim_w2v))
print("\tword2vec_w2v", analogy_str(word2vec_w2v))
```
%% Cell type:code id: tags:
``` python
compare_similarity("good", "bad")
```
%% Output
cosine similarity between good and bad :
svd2vec_svd 0.4951483093832256
gensim_svd 0.4951475
gensim_w2v 0.7723463
word2vec_w2v 0.728928
svd2vec_svd 0.5542564783462338
gensim_svd 0.55425656
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-734433f8ebf5> in <module>()
----> 1 compare_similarity("good", "bad")
<ipython-input-8-20c164662123> in compare_similarity(w1, w2)
3 print("\tsvd2vec_svd ", svd2vec_svd.similarity(w1, w2))
4 print("\tgensim_svd ", gensim_svd.similarity(w1, w2))
----> 5 print("\tgensim_w2v ", gensim_w2v.wv.similarity(w1, w2))
6 print("\tword2vec_w2v", word2vec_w2v.similarity(w1, w2))
7
NameError: name 'gensim_w2v' is not defined
%% Cell type:code id: tags:
``` python
compare_similarity("truck", "car")
```
%% Output
cosine similarity between truck and car :
svd2vec_svd 0.8725645794464922
gensim_svd 0.8725649
gensim_w2v 0.71462846
word2vec_w2v 0.6936528
%% Cell type:code id: tags:
``` python
compare_analogy("january", "month", "monday")
```
%% Output
analogy similaties : january is to month as monday is to?
svd2vec_svd
friday 0.7990049263196153
holiday 0.7774813849657727
day 0.7696653269345999
gensim_svd
friday 0.7990041971206665
holiday 0.7774807810783386
day 0.7696648836135864
gensim_w2v
week 0.7143122553825378
evening 0.6310715675354004
weekend 0.6066169142723083
word2vec_w2v
week 0.7236202359199524
evening 0.5867935419082642
weekend 0.5843297839164734
%% Cell type:code id: tags:
``` python
compare_analogy("paris", "france", "berlin")
```
%% Output
analogy similaties : paris is to france as berlin is to?
svd2vec_svd
germany 0.7687125088187668
reich 0.7243489014216623
sch 0.7123675101373064
gensim_svd
germany 0.7687125205993652
reich 0.7243496179580688
sch 0.712367594242096
gensim_w2v
germany 0.8262317180633545
finland 0.7536041140556335
austria 0.7173164486885071
word2vec_w2v
germany 0.840154767036438
austria 0.6982203722000122
poland 0.6571524143218994
%% Cell type:code id: tags:
``` python
compare_analogy("man", "king", "woman")
```
%% Output
analogy similaties : man is to king as woman is to?
svd2vec_svd
crowned 0.623713716342001
isabella 0.6024687219275104
consort 0.6019050828977524
princess 0.5237731172162106
isabella 0.5202350726282744
vii 0.49219104719485585