NLP word vector - word2vec

word vector

Firstly, in recent years, word vector is the basic knowledge of NLP, which is also very important.
Among them, words or phrases from vocabulary are mapped to vectors of real Numbers, which is to translate human language symbols into Numbers that can be calculated by machines, which can improve the quality of machine translation.


For example, word_1 is expressed as a vector [0,0,1] in articles and statements. There are many models or tools for word-to-vector transformation. Eg: one-hot, n-gram, word2vec

word2vec

Word2vec is a tool


Word2vec contains two models skp-gram and CBOW, as well as two efficient training methods negative sampling and hiearchical softmax. Why introduce word2vec separately, because it can well express the similarity and analogy between different words


顺便说说这两个语言模型。统计语言模型statistical language model就是给你几个词,在这几个词出现的前提下来计算某个词出现的(事后)概率。


CBOW也是统计语言模型的一种,顾名思义就是根据某个词前面的C个词或者前后C个连续的词,来计算某个词出现的概率。Skip-Gram Model相反,是根据某个词,然后分别计算它前后出现某几个词的各个概率。

以“我爱北京天安门”这句话为例。假设我们现在关注的词是“爱”,C=2时它的上下文分别是“我”,“北京天安门”。CBOW模型就是把“我” “北京天安门” 的one hot表示方式作为输入,也就是C个1xV的向量,分别跟同一个VxN的大小的系数矩阵W1相乘得到C个1xN的隐藏层hidden layer,然后C个取平均所以只算一个隐藏层。


这个过程也被称为线性激活函数(这也算激活函数?分明就是没有激活函数了)。然后再跟另一个NxV大小的系数矩阵W2相乘得到1xV的输出层,这个输出层每个元素代表的就是词库里每个词的事后概率。输出层需要跟ground truth也就是“爱”的one hot形式做比较计算loss。这里需要注意的就是V通常是一个很大的数比如几百万,计算起来相当费时间,除了“爱”那个位置的元素肯定要算在loss里面,word2vec就用基于huffman编码的Hierarchical softmax筛选掉了一部分不可能的词,然后又用nagetive samping再去掉了一些负样本的词所以时间复杂度就从O(V)变成了O(logV)。Skip gram训练过程类似,只不过输入输出刚好相反。