Mathematical Background

[WFC+17], [BK16]

The Pair Paradygm

Item pairs are at the center of [MCCD13] and its derivatives. Instead of processing a whole sequence, only two items are considered at a single step. This section discusses how to select them and what they represent.

Input-Output

The most straightforward way to define an item pair is in the supervised case. The left-hand side is the input (a.k.a. feature item) and the right-hand side is the output (a.k.a. label item).

Skip-Gram

Why Negative Sampling?

Softmax Formulation

Let \((a, b)\) a pair of items, where \(a \in A\) is the source and \(b \in B\) the target. The actual meaning depends on the use case, as discussed above.

The conditional probability of observing \(b\) given \(a\) is defined by a softmax on all possibilities, as it is a regular multi-class task:

\[P(b \mid a ; \mathbf{u}, \mathbf{v}) = \frac{e^{\mathbf{u}_a^T \mathbf{v}_b}}{\sum_{b'} e^{\mathbf{u}_a^T \mathbf{v}_{b'}}}\]

The log-likelihood is therefore defined as:

\[\mathcal{L} (a, b ; \mathbf{u}, \mathbf{v}) = -\log P(b \mid a ; \mathbf{u}, \mathbf{v}) = -\mathbf{u}_a^T \mathbf{v}_b + \log \sum_{b'} e^{\mathbf{u}_a^T \mathbf{v}_{b'}}\]
\[\frac{\partial}{\partial \mathbf{u}_a} \mathcal{L} (a, b ; \mathbf{u}, \mathbf{v}) = -\mathbf{v}_b + \sum_{b'} P(b' \mid a ; \mathbf{u}, \mathbf{v}) \mathbf{v}_{b'}\]

However, this implies a summation over every \(b' \in B\), which is computationally expensive for large vocabularies.

Noise Contrastive Estimation Formulation

Noise Contrastive Estimation (Gutmann and Hyvärinen [GH10]) is proposed by Mnih and Teh [MT12] as a stable sampling method, to reduce the cost induced by softmax computation. In a nutshell, the model is trained to distinguish observed (positive) samples from random noise. Logistic regression is applied to minimize the negative log-likelihood, i.e. cross-entropy of our training example against the \(k\) noise samples:

\[\mathcal{L} (a, b) = - \log P(y = 1 \mid a, b) + k \mathbb{E}_{b' \sim Q}\left[ - \log P(y = 0 \mid a, b) \right]\]

To avoid computating the expectation on the whole vocabulary, a Monte Carlo approximation is applied. \(B^* \subseteq B\), with \(\vert B^* \vert = k\), is therefore the set of random samples used to estimate it:

\[\mathcal{L} (a, b) = - \log P(y = 1 \mid a, b) - k \sum_{b' \in B^* \subseteq B} \log P(y = 0 \mid a, b')\]

We are effectively generating samples from two different distributions: positive pairs are sampled from the empirical training set, while negative pairs come from the noise distribution \(Q\).

\[P(y, b \mid a) = \frac{1}{k + 1} P(b \mid a) + \frac{k}{k + 1} Q(b)\]

Hence, the probability that a sample came from the training distribution:

\[P(y = 1 \mid a, b) = \frac{P(b \mid a)}{P(b \mid a) + k Q(b)}\]
\[P(y = 0 \mid a, b) = 1 - P(y = 1 \mid a, b)\]

However, \(P(b \mid a)\) is still defined as a softmax:

\[P(b \mid a ; \mathbf{u}, \mathbf{v}) = \frac{e^{\mathbf{u}_a^T \mathbf{v}_b}}{\sum_{b'} e^{\mathbf{u}_a^T \mathbf{v}_{b'}}}\]

Both Mnih and Teh [MT12] and Vaswani et al. [VZFC13] arbitrarily set the denominator to 1. The underlying idea is that, instead of explicitly computing this value, it could be defined as a trainable parameter. Zoph et al. [ZVMK16] actually report that even when trained, it stays close to 1 with a low variance.

Hence:

\[P(b \mid a ; \mathbf{u}, \mathbf{v}) = e^{\mathbf{u}_a^T \mathbf{v}_b}\]

The negative log-likelihood can then be computed as usual:

\[\mathcal{L} (a, b ; \mathbf{u}, \mathbf{v}) = -\log P (a, b ; \mathbf{u}, \mathbf{v})\]

Mnih and Teh [MT12] report that using \(k = 25\) is sufficient to match the performance of the regular softmax.

Negative Sampling Formulation

Negative Sampling, popularised by Mikolov et al. [MSC+13], can be seen as an approximation of NCE. As defined previously, NCE is based on the following:

\[P(y = 1 \mid a, b ; \mathbf{u}, \mathbf{v}) = \frac{e^{\mathbf{u}_a^T \mathbf{v}_b}}{e^{\mathbf{u}_a^T \mathbf{v}_b} + k Q(b)}\]

Negative Sampling simplifies this computation by replacing \(k Q(b)\) by 1. Note that \(k Q(b) = 1\) is true when \(B^* = B\) and \(Q\) is the uniform distribution.

\[P(y = 1 \mid a, b ; \mathbf{u}, \mathbf{v}) = \frac{e^{\mathbf{u}_a^T \mathbf{v}_b}}{e^{\mathbf{u}_a^T \mathbf{v}_b} + 1} = \sigma \left( \mathbf{u}_a^T \mathbf{v}_b \right)\]

Hence:

\[P(a, b ; \mathbf{u}, \mathbf{v}) = \sigma \left( \mathbf{u}_a^T \mathbf{v}_b \right) \prod_{b' \in B^* \subseteq B} \left( 1 - \sigma \left( \mathbf{u}_a^T \mathbf{v}_b \right) \right)\]
\[\mathcal{L} (a, b ; \mathbf{u}, \mathbf{v}) = -\log \sigma \left( \mathbf{u}_a^T \mathbf{v}_b \right) - \sum_{b' \in B^* \subseteq B} \log \left( 1 - \sigma \left( \mathbf{u}_a^T \mathbf{v}_b' \right) \right)\]

For more details, see Goldberg and Levy’s notes [GL14].

To compute the gradient, let us rewrite the loss as:

\[\mathcal{L} (a, b ; \mathbf{u}, \mathbf{v}) = -\ell_{a, b, 1} - \sum_{b' \in B^* \subseteq B} \ell_{a, b', 0}\]

where

\[\ell_{a, b, y} = \log \sigma \left( y - \mathbf{u}_a^T \mathbf{v}_b \right)\]

Then:

\[\begin{split}\begin{array}{lll} \frac{\partial}{\partial \mathbf{u}_a} \ell (a, b, y) & = & \frac{1}{y - \sigma \left(\mathbf{u}_a^T \mathbf{v}_b \right)} \left( - \sigma \left(\mathbf{u}_a^T \mathbf{v}_b \right) \left( 1 - \sigma \left(\mathbf{u}_a^T \mathbf{v}_b \right) \right) \right) \mathbf{v}_b \\ & = & \left( y - \sigma \left(\mathbf{u}_a^T \mathbf{v}_b \right) \right) \mathbf{v}_b \end{array}\end{split}\]

And similarly:

\[\frac{\partial}{\partial \mathbf{v}_b} \ell (a, b, y) = \left( y - \sigma \left(\mathbf{u}_a^T \mathbf{v}_b \right) \right) \mathbf{u}_a\]

Additional Considerations

Normalization

By setting the denominator to 1, as proposed above, the model essentially learns to self-normalize. However, Devlin et al. [DZH+14] suggest to add a squared error penalty to enforce the equivalence. Andreas and Klein [AK15] even suggest that it should even be sufficient to only normalize a fraction of the training examples and still obtain approximate self-normalising behaviour.

Item distribution balancing

In word2vec, Mikolov et al. [MSC+13] use a subsampling approach to reduce the impact of frequent words. Each word has a probability

\[P(w_i) = 1 - \sqrt{ \left( \frac{t}{f(w_i)} \right) }\]

of being discarded, where \(f(w_i)\) is its frequency and \(t\) a chosen threshold, typically around \(10^{-5}\).

References

AK15

Jacob Andreas and Dan Klein. When and why are log-linear models self-normalizing? In Rada Mihalcea, Joyce Yue Chai, and Anoop Sarkar, editors, HLT-NAACL, 244–249. The Association for Computational Linguistics, 2015. URL: http://dblp.uni-trier.de/db/conf/naacl/naacl2015.html#AndreasK15.

BK16

Oren Barkan and Noam Koenigstein. Item2vec: neural item embedding for collaborative filtering. 2016. cite arxiv:1603.04259. URL: http://arxiv.org/abs/1603.04259.

DZH+14

Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. Fast and robust neural network joint models for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1370–1380. Baltimore, Maryland, June 2014. Association for Computational Linguistics. URL: http://www.aclweb.org/anthology/P14-1129.

GL14

Yoav Goldberg and Omer Levy. Word2vec explained: deriving mikolov et al.'s negative-sampling word-embedding method. 2014. cite arxiv:1402.3722. URL: http://arxiv.org/abs/1402.3722.

GH10

Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Yee Whye Teh and D. Mike Titterington, editors, AISTATS, volume 9 of JMLR Proceedings, 297–304. JMLR.org, 2010. URL: http://dblp.uni-trier.de/db/journals/jmlr/jmlrp9.html#GutmannH10.

MCCD13

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, 2013. URL: http://dblp.uni-trier.de/db/journals/corr/corr1301.html#abs-1301-3781.

MSC+13(1,2)

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. 2013. URL: http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.

MT12(1,2,3)

Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. In ICML. icml.cc / Omnipress, 2012. URL: http://dblp.uni-trier.de/db/conf/icml/icml2012.html#MnihT12.

VZFC13

Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. Decoding with large-scale neural language models improves translation. In EMNLP, 1387–1392. ACL, 2013. URL: http://dblp.uni-trier.de/db/conf/emnlp/emnlp2013.html#VaswaniZFC13.

WFC+17

Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, and Jason Weston. Starspace: embed all the things! 2017. cite arxiv:1709.03856. URL: http://arxiv.org/abs/1709.03856.

ZVMK16

Barret Zoph, Ashish Vaswani, Jonathan May, and Kevin Knight. Simple, fast noise-contrastive estimation for large rnn vocabularies. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, HLT-NAACL, 1217–1222. The Association for Computational Linguistics, 2016. URL: http://dblp.uni-trier.de/db/conf/naacl/naacl2016.html#ZophVMK16.