Skip to content

训练观测指标

📅 发表于 2026/01/31
🔄 更新于 2026/01/31
👁️ -- 次访问
📝 0 字
0 分钟

Actor 指标

总Policy Loss

核心计算逻辑
  • Policy loss = PG损失 - 熵奖励 + KL 惩罚
  • Policy_loss = pg_loss - entropy_coeff * entropy_loss + kl_loss_coef * kl_loss

策略梯度loss (pg_loss)

PGLoss 核心计算逻辑

Seq-Level PG Loss

Lppo(θ)=1Gi=1G1|oi|t=1|ot|序列内平均min(πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t)A^i,t,clip(πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t),1ϵ,1+ϵ)A^i,t)

Token-Level PG Loss

LDAPO(θ)=1i=1G|oi|i=1Gt=1|ot|所有Token直接做平均min(πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t)A^i,t,clip(πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t),1ϵ,1+ϵ)A^i,t)
总访客数:   ·   总访问量:
PLM's Blog @ 2016 - 2026