Linvis Blog

word2vec_paper_reading

2019-07-31

Continuous Bag-of-Word Model(CBOW)

One word

the vocabulary size is V , and the hidden layer size is N

CBOW: input is content words, output is target word

First, let’s see an simple model, one input word, one output word.

for example, two words [‘I’, ‘like’], input is ‘I’, output is ‘like’

here, we use one-hot encode, ${x_1, x_2…x_v}$, form is $[0,1,…0]$(only one 1)

two weight matrix,$W$ and $W’$

Forward propagation

To get hidden layer:

$\mathbf h$ is $N * 1$ dimension

then we have a output vertor named $\mathbf W’$

$\mathbf u$ is $V * 1$, same as $\mathbf x$

usually, we use softmax to get the probability.

and it is vertor form, if we look at each element’s form of vector, we will get:

here, we know x is one-hot, so $xk = 1, x{k’}=0, k’ \neq k$, and it multiple with $W^T$,

for example:

and for $\mathbf u$, each element of $\mathbf u$ should be:

for y:

backward propagation

for hidden-output vector

according maximum likelihood estimation, we want

y_true = [0, 1, 0, 0], y_pred = [0.1, 0.80, 0.01, 0.2], here $j^* = 1, y_{j^*}=0.8$, and we want $\max y_{j*}$

where $E=-\log p\left(w{O} | w{I}\right)$ is our loss function, and we want mininum it.

let’s get its derivative:

here because loss is about $j^$, and derivative is about $j$, so we have $t_j = 1 \ when \ j = j^, else \ t_j = 0$

then go on:

that is good, we can use it in gradient descent

or

for input-hidden vector

for gradient descent

Multi-word context

We use the average

and

Skip-Gram Model

different with CBOW, skip-gram uses one word input and multiple output words.

for example, we have a word with one-hot $[0, 0, 1, 0, 0]$, and have 3 words which is assiociated,

so y will be $[1, 1, 0, 1,0]$

forward propagation

the calculation of input and hidden is same with CBOW.

for hidden to output:

Note: in the Figure 3, we see there is C $W’{N*V}$, but actually they are one $W’{N*V}$,

and why so many y?

for example, like what we say before, the y is [1, 1, 0, 1, 0], for 3 words assiociated, and if we divide it to 3 vector, it is [1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 0, 1, 0],

it explains why we have C output in Figure 3, but they have one shared weight.

backpropagation

For loss, it changed to

For hidden-output vector

and derivative, each word in C

so, sum it:

gradient descent

or

For input-hidden vector

where $\mathbf{EH}$ is an N-dimension vector, each component of which is defined as:

扫描二维码,分享此文章