cs224n的第一个作业，包括softmax、神经网络基础和词向量

Softmax

Softmax常数不变性

s o f t m a x (x)_{i} = \frac{e^{x_{i}}}{\sum_{j} e^{x_{j}}}

一般在计算softmax的时候，避免太大的数，要加一个常数。一般是减去最大的数。

s o f t m a x (x) = s o f t m a x (x + c)

关键代码

python

def softmax(x):
    exp_func = lambda x: np.exp(x - np.max(x))
    sum_func = lambda x: 1.0 / np.sum(x)
    x = np.apply_along_axis(exp_func, -1, x)
    denom = np.apply_along_axis(sum_func, -1, x)
    denom = denom[..., np.newaxis]
    x = x * denom
   	return x

神经网络基础

Sigmoid实现

我的sigmoid笔记

\begin{aligned} σ (z) = \frac{1}{1 + \exp (- z)}, σ (z) \in (0, 1) \\ σ^{'} (z) = σ (z) (1 - σ (z)) \end{aligned}

关键代码

python

def sigmoid(x):
    s = 1.0 / (1 + np.exp(-x))
    return s


def sigmoid_grad(s):
    """ 对sigmoid的函数值，求梯度
    """
    ds = s * (1 - s)
    return ds

Softmax求梯度

交叉熵和softmax如下，记softmax的输入为 $θ$ ， $y$ 是真实one-hot向量。

\begin{aligned} C E (y, \hat{y}) = - \sum_{i} y_{i} \times \log ({\hat{y}}_{i}) \\ \hat{y} = s o f t m a x (θ) \end{aligned}

softmax求导

引入记号：

\begin{aligned} f_{i} = e^{θ_{i}} & 分子 \\ g_{i} = \sum_{k = 1}^{K} e^{θ_{k}} & 分母，与i无关 \\ {\hat{y}}_{i} = S_{i} = \frac{f_{i}}{g_{i}} & softmax \end{aligned}

则有 $S_{i} $ 对其中的一个数据 $θ_{j} $ 求梯度：

\frac{\partial S_{i}}{\partial θ_{j}} = \frac{f_{i}^{'} g_{i} - f_{i} g_{i}^{'}}{g_{i}^{2}}

其中两个导数

f_{i}^{'} (θ_{j}) = {\begin{cases} e^{θ_{j}}, & i = j \\ 0, & i \neq j \end{cases}

g_{i}^{'} (θ_{j}) = e^{θ_{j}}

$i = j$ 时

\begin{aligned} \frac{\partial S_{i}}{\partial θ_{j}} & = \frac{e^{θ_{j}} \cdot \sum_{k} e^{θ_{k}} - e^{θ_{i}} \cdot e^{θ_{j}}}{{(\sum_{k} e^{θ_{k}})}^{2}} \\ = \frac{e^{θ_{j}}}{\sum_{k} e^{θ_{k}}} \cdot (1 - \frac{e^{θ_{j}}}{\sum_{k} e^{θ_{k}}}) \\ = S_{i} \cdot (1 - S_{i}) \end{aligned}

$i \neq j$ 时

\begin{aligned} \frac{\partial S_{i}}{\partial θ_{j}} & = \frac{- e^{θ_{i}} \cdot e^{θ_{j}}}{{(\sum_{k} e^{θ_{k}})}^{2}} = - S_{i} \cdot S_{j} \end{aligned}

交叉熵求梯度

\begin{aligned} C E (y, \hat{y}) = - \sum_{i} y_{i} \times \log ({\hat{y}}_{i}) \\ \hat{y} = S (θ) \end{aligned}

只关注有关系的部分，带入 $y_{i} = 1$ ：

\begin{aligned} \frac{\partial C E}{\partial θ_{i}} & = - \frac{\partial \log {\hat{y}}_{i}}{\partial θ_{i}} = - \frac{1}{{\hat{y}}_{i}} \cdot \frac{\partial {\hat{y}}_{i}}{\partial θ_{i}} \\ = - \frac{1}{S_{i}} \cdot \frac{\partial S_{i}}{\partial θ_{i}} = S_{i} - 1 \\ = {\hat{y}}_{i} - y_{i} \end{aligned}

不带入求导

\begin{aligned} \frac{\partial C E}{\partial θ_{i}} & = - \sum_{k} y_{k} \times \frac{\partial \log S_{k}}{\partial θ_{i}} \\ = - \sum_{k} y_{k} \times \frac{1}{S_{k}} \times \frac{\partial S_{k}}{\partial θ_{i}} \\ = - y_{i} (1 - S_{i}) - \sum_{k \neq i} y_{k} \cdot \frac{1}{S_{k}} \cdot (- S_{i} \cdot S_{k}) \\ = - y_{i} (1 - S_{i}) + \sum_{k \neq i} y_{k} \cdot S_{i} \\ = S_{i} - y_{i} \end{aligned}

所以，交叉熵的导数是

\frac{\partial C E}{\partial θ_{i}} = {\hat{y}}_{i} - y_{i}, \frac{\partial C E (y, \hat{y})}{\partial θ} = \hat{y} - y

即

\frac{\partial C E (y, \hat{y})}{\partial θ_{i}} = {\begin{cases} {\hat{y}}_{i} - 1, & i是label \\ {\hat{y}}_{i}, & 其它 \end{cases}

简单网络

前向计算

\begin{aligned} z_{1} = x W_{1} + b_{1} \\ h = s i g m o i d (z 1) \\ z_{2} = h W_{2} + b_{2} \\ \hat{y} = s o f t m a x (z_{2}) \end{aligned}

关键代码：

python

def forward_backward_prop(data, labels, params, dimensions):
    h = sigmoid(np.dot(data, W1) + b1)
    yhat = softmax(np.dot(h, W2) + b2)

loss函数

J = C E (y, \hat{y})

关键代码：

python

def forward_backward_prop(data, labels, params, dimensions):
    # yhat[labels==1]实际上是boolean索引，见我的numpy_api.ipynb
    cost = np.sum(-np.log(yhat[labels == 1])) / data.shape[0]

反向传播

\begin{aligned} δ_{2} = \frac{\partial J}{\partial z_{2}} = \hat{y} - y \\ \frac{\partial J}{\partial h} = δ_{2} \cdot \frac{\partial z_{2}}{\partial h} = δ_{2} W_{2}^{T} \\ δ_{1} = \frac{\partial J}{\partial z_{1}} = \frac{\partial J}{\partial h} \cdot \frac{\partial h}{\partial z_{1}} = δ_{2} W_{2}^{T} \circ σ^{'} (z_{1}) \\ \frac{\partial J}{\partial x} = δ_{1} W_{1}^{T} \end{aligned}

一共有 $(d_{x} + 1) \cdot d_{h} + (d_{h} + 1) \cdot d_{y}$ 个参数。

关键代码：

python

def forward_backward_prop(data, labels, params, dimensions):
    # 前面推导的softmax梯度公式
    gradyhat = (yhat - labels) / data.shape[0]
    # 链式法则
    gradW2 = np.dot(h.T, gradyhat)
    # 本地导数是1，把第1维的所有加起来
    gradb2 = np.sum(gradyhat, axis=0, keepdims=True)
    gradh = np.dot(gradyhat, W2.T)
    gradz1 = gradh * sigmoid_grad(h)
    gradW1 = np.dot(data.T, gradz1)
    gradb1 = np.sum(gradz1, axis=0, keepdims=True)
    
    grad = np.concatenate((gradW1.flatten(), gradb1.flatten(),
            gradW2.flatten(), gradb2.flatten()))
    return cost, grad

梯度检查

我的梯度检查

python

def gradcheck_naive(f, x):
    fx, grad = f(x) # Evaluate function value at original point
    h = 1e-4        # Do not change this!
    # Iterate over all indexes in x
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        ix = it.multi_index
        # 关键代码
        x[ix] += h
        random.setstate(rndstate)
        new_f1 = f(x)[0]
        x[ix] -= 2 * h
        random.setstate(rndstate)
        new_f2 = f(x)[0]
        x[ix] += h
        numgrad = (new_f1 - new_f2) / (2 * h)

        # Compare gradients
        reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
        if reldiff > 1e-5:
            print ("Gradient check failed.")
            print ("First gradient error found at index %s" % str(ix))
            print ("Your gradient: %f \t Numerical gradient: %f" % (
                grad[ix], numgrad))
            return

        it.iternext() # Step to next dimension

Word2Vec

我的word2vec笔记

词向量的梯度

符号定义

$v_{c}$ 中心词向量，输入词向量， $V$ ， $R^{W \times d}$
$u_{o}$ 上下文词向量，输出词向量， $U = [u_{1}, u_{2}, \dots, u_{w}]$ , $R^{d \times W}$

前向

预测o是c的上下文概率，o为正确单词

{\hat{y}}_{o} = p (o ∣ c) = s o f t m a x (o) = \frac{\exp (u_{o}^{T} v_{c})}{\sum_{w} \exp (u_{w}^{T} v_{c})}

得分向量：

z = U^{T} \cdot v_{c}, [W, d] \times [d] \in, R^{W}

loss及梯度

J_{s o f t m a x - C E} (v_{c}, o, U) = C E (y, \hat{y}), 其中 \frac{\partial C E (y, \hat{y})}{\partial θ} = \hat{y} - y

梯度	中文	计算	维数
$\frac{\partial J}{\partial z}$	softmax	$\hat{y} - y$	$W$
$\frac{\partial J}{\partial v_{c}}$	中心词	$\frac{\partial J}{\partial z} \cdot \frac{\partial z}{\partial v_{c}} = (\hat{y} - y) \cdot U^{T}$	$d$
$\frac{\partial J}{\partial U}$	上下文	$\frac{\partial J}{\partial z} \cdot \frac{\partial z}{\partial U^{T}} = (\hat{y} - y) \cdot v_{c}$	$d \times W$

关键代码

python

def softmaxCostAndGradient(predicted, target, outputVectors, dataset):
    """ Softmax cost function for word2vec models
    Args:
        predicted: 中心词vc
        target: 上下文uo, index
        outputVectors: 输出，上下文矩阵U，W*d，未转置
        dataset: 
    Returns:
        cost: 交叉熵loss
        gradv: 一维向量
        gradU: W*d
    """
    vhat = predicted
    z = np.dot(outputVectors,vhat)
    preds = softmax(z)
    #  Calculate the cost:
    cost = -np.log(preds[target])
    #  Gradients
    gradz = preds.copy()
    gradz[target] -= 1.0
    gradU = np.outer(z, vhat)
    gradv = np.dot(outputVectors.T, z)
    ### END YOUR CODE
    return cost, gradv, gradU

cs224n作业一

Softmax ​

Softmax常数不变性 ​

关键代码 ​

神经网络基础 ​

Sigmoid实现 ​

Softmax求梯度 ​

交叉熵求梯度 ​

简单网络 ​

梯度检查 ​

Word2Vec ​

词向量的梯度 ​

Softmax

Softmax常数不变性

关键代码

神经网络基础

Sigmoid实现

Softmax求梯度

交叉熵求梯度

简单网络

梯度检查

Word2Vec

词向量的梯度