🌇 🚶🏻 🧠 您对word2vec的了解都不是真的 🐤 ⚓️ 👧🏽

原始科学文章和无数博客文章中对word2vec作为否定样本Skip-gram体系结构的经典解释如下：

while(1) { 1. vf = vector of focus word 2. vc = vector of focus word 3. train such that (vc . vf = 1) 4. for(0 <= i <= negative samples): vneg = vector of word *not* in context train such that (vf . vneg = 0) }

的确，如果您用谷歌搜索[word2vec skipgram]，我们将看到：

但是所有这些实现都是错误的 。

C语言中word2vec的原始实现工作原理不同，并且根本不同 。那些专业地使用word2vec中的单词嵌入来实现系统的人员，可以执行以下任一操作：

直接调用C的原始实现。
使用gensim实现，该实现从源C进行音译，直到变量名称匹配。

实际上， gensim是我所知道的唯一真正的C实现 。

C实现

C实现实际上为每个单词支持两个向量 。这个词的一个向量是焦点，第二个向量是上下文。（似乎很熟悉？对，GloVe开发人员从word2vec借用了一个想法，而没有提到这个事实！）

用C代码实现的能力异常出色：

syn0数组包含单词的向量嵌入（如果单词作为焦点单词出现）。这是一个随机初始化 。

 https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L369 for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) { next_random = next_random * (unsigned long long)25214903917 + 11; syn0[a * layer1_size + b] = (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size; }

当它作为上下文单词出现时，另一个syn1neg数组包含单词的向量。这里初始化为零 。

在训练过程中（Skip-gram，否定样本，尽管其他情况大致相同），我们首先选择焦点词。在培训过程中，无论是正面还是负面的例子，都将保留它。在对正样本和负样本进行训练后，聚焦向量的梯度会累积在缓冲区中并应用于聚焦词。

 if (negative > 0) for (d = 0; d < negative + 1; d++) { // if we are performing negative sampling, in the 1st iteration, // pick a word from the context and set the dot product target to 1 if (d == 0) { target = word; label = 1; } else { // for all other iterations, pick a word randomly and set the dot //product target to 0 next_random = next_random * (unsigned long long)25214903917 + 11; target = table[(next_random >> 16) % table_size]; if (target == 0) target = next_random % (vocab_size - 1) + 1; if (target == word) continue; label = 0; } l2 = target * layer1_size; f = 0; // find dot product of original vector with negative sample vector // store in f for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2]; // set g = sigmoid(f) (roughly, the actual formula is slightly more complex) if (f > MAX_EXP) g = (label - 1) * alpha; else if (f < -MAX_EXP) g = (label - 0) * alpha; else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; // 1. update the vector syn1neg, // 2. DO NOT UPDATE syn0 // 3. STORE THE syn0 gradient in a temporary buffer neu1e for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1]; } // Finally, after all samples, update syn1 from neu1e https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L541 // Learn weights input -> hidden for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c];

为什么随机和零初始化？

再一次，由于原始文章和Internet上的任何地方都没有对此进行解释，所以我只能推测。

假设是，当否定样本来自整个文本并且不按频率加权时，您可以选择任何单词 ，并且通常选择一个其向量根本没有经过训练的单词。如果此向量具有含义，则它将随机移动真正重要的单词。

底线是将所有否定示例设置为零，以便仅出现或多或少出现的向量会影响另一个向量的表示。

这实际上很棘手，而且我从未想过初始化策略有多重要。

我为什么要写这个

我花了两个月的时间来尝试复制word2vec，如原始的科学出版物和互联网上无数的文章所述，但是失败了。尽管我已尽力而为，但我无法达到与word2vec相同的结果。

我无法想象该出版物的作者实际上制造了一种行不通的算法，而实现却完全不同。

最后，我决定研究来源。三天以来，我确信自己误解了代码，因为实际上互联网上的每个人都在谈论另一种实现。

我不知道为什么Internet上的原始出版物和文章没有对word2vec的真正机制说什么，所以我决定自己发布此信息。

这也解释了GloVe的根本选择，即为否定语境设置单独的向量-他们只是做了word2vec所做的，但是告诉了人们:)。

这是科学诀窍吗？我不知道，这是一个难题。但是说实话，我非常生气。可能，我将永远无法再认真对待机器学习中的算法解释：下次，我将立即查看源代码。

您对word2vec的了解都不是真的

C实现

为什么随机和零初始化？

我为什么要写这个

More articles: