先放四张图，分别是DCN的Encoder、Decoder，DCN+的Encoder和Objective。后面再详细总结

Dynamic Coattention Networks For Question Answering

DCN+: Mixed Objective and Deep Residual Coattention for Question Answering

DCN

Coattention Encoder

Dynamic Pointing Decoder

HMN

DCN+

DCN的问题

loss没有判断真正的意义

DCN使用传统交叉熵去优化optimization，只考虑答案字符串的匹配程度。但是实际上人的评判evaluation却是看回答的意义。如果只考虑span，则有下面两个问题：

精确答案：没影响
但是对正确答案周围重叠的单词，却可能认为是错误的。

句子：Some believe that the Golden State Warriors team of 2017 is one of the greatest teams in NBA history

问题：which team is considered to be one of the greatest teams in NBA history

正确答案：the Golden State Warriors team of 2017

其实Warriors也是正确答案，但是传统交叉熵却认为它还不如history。

DCN没有建立起Optimization和 evaluation的联系。这也是Word Overlap。

单层coattention表达力不强

DCN+的优化点

Mixed Loss

交叉熵+自我批评学习（强化学习）。Word真正意义相似才会给一个好的reward。

强化学习会鼓励意义相近的词语，而dis不相近的词语
交叉熵让强化学习朝着正确的轨迹发展

Deep Residual Coattention Encoder

多层表达能力更强，详细看下面的优点。

Deep Residual Encoder

优点

两个别人得出的重要结论：

stacked self-attention 可以加速信号传递
减少信号传递路径，可以增加长依赖

比DCN的两个优化点：

coattention with self-attention和多层coattention 。可以对输入有richer representations
对每层的coattention outputs进行残差连接。缩短了信息传递路径。

Coattention深层理解

当时理解了很久都不懂，后来一个下午，一直看，结合机器翻译实现和实际例子矩阵计算，终于理解了Attention、Coattention。

参考了我的下面三篇笔记。

单个Coattention层计算

经过双向RNN后，得到两个语义编码：文档 $E_{0}^{D} \in R^{m \times e}$ ，问题编码 $E_{0}^{Q} \in R^{n \times h}$ 。

E_{1}^{D} = b i G R U_{1} (E_{0}^{D}) \in R^{m \times h}

E_{1}^{Q} = \tanh (W b i G R U_{1} (Q_{E}) + b) \in R^{n \times h}

计算关联得分矩阵A

A = E_{1}^{D} (E_{1}^{Q})^{T} \in R^{m \times n}

{[\begin{matrix} 0 & 0 \\ 2 & 3 \\ 0 & 2 \\ 1 & 1 \\ 3 & 3 \end{matrix}]}_{5 \times 2} \cdot {[\begin{matrix} 1 & 3 \\ 1 & 1 \\ 1 & 3 \end{matrix}]}_{3 \times 2}^{T} = {[\begin{matrix} 0 & 0 & 0 \\ 11 & 5 & 11 \\ 6 & 2 & 6 \\ 4 & 2 & 4 \\ 12 & 6 & 12 \end{matrix}]}_{5 \times 3}

做行Softmax，得到Q对D的权值分配概率 $A^{Q}$ ， attention_weights

每一行是一个文档单词w
元素值是所有问句单词对当前文档单词w的注意力分配权值
元素值是每个问句单词的权值概率

{[\begin{matrix} 0.3333 & 0.3333 & 0.3333 \\ 0.4994 & 0.0012 & 0.4994 \\ 0.4955 & 0.0091 & 0.4955 \\ 0.4683 & 0.0634 & 0.4683 \\ 0.4994 & 0.0012 & 0.4994 \end{matrix}]}_{5 \times 3}

计算D的summary， $S^{D} = A^{Q} \cdot Q$

S^{D} = A^{Q} \cdot Q

D所需要的新的语义，参考机器翻译的新语义理解
$A^{Q}$ 的每一行去乘以Q的每一列去表达单词w
用Q去表达D，每个 $D_{w}$ 都是Q的所有单词对w的线性表达，权值就是 $A^{Q}$
所以 $S^{D}$ 也是D的summary，也称作D需要context

同理，对列做softmax，得到D对Q的权值分配概率 $A^{D}$ ，得到Q的summary， $S^{Q} = A^{D} \cdot D$

这时，借鉴alternation-coattention思想去计算对D的Coattention context $C^{D}$ ：

C^{D} = S^{Q} \cdot A^{Q}

实际上， $C^{D}$ 与 $S^{D}$ 类似，都是Summary，都是context。只是 $C^{D}$ 使用的是新的 $S^{Q}$ ，而不是 $E_{1}^{Q}$ 。

Coattention Encoder总结

使用两层coattention，最后再残差连接，经过LSTM输出。

第一层

E_{1}^{D} = b i G R U_{1} (E_{0}^{D}) \in R^{m \times h} E_{1}^{Q} = \tanh (W \cdot b i G R U_{1} (E_{0}^{Q}) + b) \in R^{n \times h}

c o a t t n_{1} (E_{1}^{D}, E_{1}^{Q}) = S_{1}^{D}, S_{1}^{Q}, C_{1}^{Q}

第二层

E_{2}^{D} = b i G R U_{2} (E_{1}^{D}) \in R^{m \times h} E_{2}^{Q} = \tanh (W \cdot b i G R U_{2} (E_{1}^{Q}) + b) \in R^{m \times h}

c o a t t n_{2} (E_{2}^{D}, E_{2}^{Q}) = S_{2}^{D}, S_{2}^{Q}, C_{2}^{Q}

残差连接所有的D

c = c o n c a t ((E_{1}^{D}, E_{2}^{D}, S_{1}^{D}, S_{2}^{D}, C_{1}^{D}, C_{2}^{D})

LSTM编码输出，得到Encoder的输出

U = b i G R U (c) \in R^{m \times 2 h}

Dynamic Coattention Network (Plus)

DCN ​

Coattention Encoder ​

Dynamic Pointing Decoder ​

HMN ​

DCN+ ​

DCN的问题 ​

DCN+的优化点 ​

Deep Residual Encoder ​

Coattention深层理解 ​

Coattention Encoder总结 ​

Mixed Objective ​

DCN