prashant.goojar2346 asked . 21/05/2019

How can I use a word2vec model to train a machine learning classifier using MATLAB?

How can I use a word2vec model to train a machine learning classifier using MATLAB?

Matlab

Expert Answer

Kshitij Singh ( Staff ) answered . 22/05/2019


I have a very rudimentary knowledge of MATLAB, having had to use it for a few Coursera classes I attended. But given that it is a language with libraries to do matrix manipulation, I am guessing that MATLAB machine learning algorithms (both built-in and ones you would create from scratch) use matrix input and output.

Word2vec (and other embeddings) is basically just a dictionariy with words for keys and a fixed dimensional vector (300-d) for values. Assuming you are building a text classifier, something that takes in sentences (sequence of tokens) and predicts a sentiment (positive, neutral, negative), you could look these words up in the dictionary and extract their vectors, so your sentence is now a sequence of 300-d vectors, and your input is a matrix of shape (number-of-sentences, 300).

If you are asking about the mechanics of how to convert a binary word2vec vector to something usable in MATLAB, then I would recommend using gensim (Python framework), it provides a very simple API to download the word2vec model. You can then iterate through the model to write out the contents to a text file, which you can import into MATLAB. Here is some (untested) Python code that might be helpful to start with.

                        import gensim.downloader as api

fout = open("/path/to/your/textfile.tsv", "w")
w2v = api.load("word2vec-google-news-300")
for word in w2v.vocab.keys():
    vector = w2v[word]
    vector_str = ",".join(["{:.7e}".format(x) for x in vector.tolist()])
    fout.write("\t".format(word, vector_str))

fout.close()

This will write out the contents of the model into a file where each line is the word followed by a TAB character, followed by a comma-separated list of 300 numbers for the word vector.