训练观测指标

📅 发表于 2026/01/31

🔄 更新于 2026/01/31

👁️ -- 次访问

📝 0 字

⏳ 0 分钟

Actor 指标

总Policy Loss

参考笔记

核心计算逻辑

Policy loss = PG损失 - 熵奖励 + KL 惩罚
Policy_loss = pg_loss - entropy_coeff * entropy_loss + kl_loss_coef * kl_loss

策略梯度loss (pg_loss)

参考笔记

PGLoss 核心计算逻辑

Seq-Level PG Loss

L_{ppo} (θ) = - \underset{样 本 间 平 均}{\underset{⏟}{\frac{1}{G} \sum_{i = 1}^{G}}} \underset{序列内平均}{\underset{⏟}{\frac{1}{| o_{i} |} \sum_{t = 1}^{| o_{t} |}}} min (\frac{π_{θ} (o_{i, t} ∣ q, o_{i, < t})}{π_{θ_{o l d}} (o_{i, t} ∣ q, o_{i, < t})} \cdot {\hat{A}}_{i, t}, clip (\frac{π_{θ} (o_{i, t} ∣ q, o_{i, < t})}{π_{θ_{o l d}} (o_{i, t} ∣ q, o_{i, < t})}, 1 - ϵ, 1 + ϵ) \cdot {\hat{A}}_{i, t})

Token-Level PG Loss

L_{DAPO} (θ) = - \underset{所有Token直接做平均}{\underset{⏟}{\frac{1}{\sum_{i = 1}^{G} | o_{i} |} \sum_{i = 1}^{G} \sum_{t = 1}^{| o_{t} |}}} min (\frac{π_{θ} (o_{i, t} ∣ q, o_{i, < t})}{π_{θ_{o l d}} (o_{i, t} ∣ q, o_{i, < t})} \cdot {\hat{A}}_{i, t}, clip (\frac{π_{θ} (o_{i, t} ∣ q, o_{i, < t})}{π_{θ_{o l d}} (o_{i, t} ∣ q, o_{i, < t})}, 1 - ϵ, 1 + ϵ) \cdot {\hat{A}}_{i, t})

总访客数： · 总访问量：

PLM's Blog @ 2016 - 2026