fix(chapter5): 修复 labels/attention_mask 语义并补齐 padding-aware 批量推理 #170

2026-02-26T15:07:45+08:00

dearsky commented

2026-02-26 15:07:45 +08:00

(Migrated from gitea.proxy.dearsky.top)

PR Description

BG

当前 chapter5 示例代码中存在 4 个一致性问题：

forward 中将 attention_mask 误赋值给 targets。
训练阶段主要依赖 loss_mask，模型注意力层未使用 padding_mask。
推理阶段固定取 -1 位置 logits，batch 场景下含 padding 会取到错误位置。
数据集 padding 值与 tokenizer pad_token_id 不一致（实现写死为0）。

本次改动

修复 forward 函数，使用 labels 作为监督信号（不再把 attention_mask 当 targets）。
增加 attention 对 padding_mask 的支持

Attention.forward 新增 attention_mask 参数。
Flash Attention 路径构造 causal + key padding 组合 mask。
非 Flash 路径对 key 位置做 masked_fill(-inf)。

修复 batch 推理时最后 token 位置选择

在有 attention_mask 时按“最后有效 token”取 logits，不再固定 [:, -1, :]。
generate / generate_super 新增 attention_mask 和 pad_token_id 参数。
支持 batch 内样本提前结束（stop_id）后的生成。

对齐 padding 与 tokenizer 配置

dataset.py 改为使用 tokenizer.pad_token_id（fallback 为 0）。
训练脚本/导出脚本/采样脚本初始化模型时同步 lm_config.pad_token_id。

文档中的对应代码片段同步更新

影响范围

docs/chapter5/code/k_model.py
docs/chapter5/code/dataset.py
docs/chapter5/code/ddp_pretrain.py
docs/chapter5/code/ddp_sft_full.py
docs/chapter5/code/model_sample.py
docs/chapter5/code/export_model.py
docs/chapter5/第五章动手搭建大模型.md

兼容性

对已有 model(X, Y) 训练调用兼容。
新增参数均为可选，不影响原有单样本推理调用。
行为上更贴近标准 CausalLM 的 labels + attention_mask 语义。

## PR Description ### BG 当前 `chapter5` 示例代码中存在 4 个一致性问题： 1. `forward` 中将 `attention_mask` 误赋值给 `targets`。 2. 训练阶段主要依赖 `loss_mask`，模型注意力层未使用 `padding_mask`。 3. 推理阶段固定取 `-1` 位置 logits，batch 场景下含 padding 会取到错误位置。 4. 数据集 padding 值与 tokenizer `pad_token_id` 不一致（实现写死为0）。 ### 本次改动 1. 修复 `forward` 函数，使用 `labels` 作为监督信号（不再把 `attention_mask` 当 targets）。 2. 增加 attention 对 `padding_mask` 的支持 - `Attention.forward` 新增 `attention_mask` 参数。 - Flash Attention 路径构造 `causal + key padding` 组合 mask。 - 非 Flash 路径对 key 位置做 `masked_fill(-inf)`。 3. 修复 batch 推理时最后 token 位置选择 - 在有 `attention_mask` 时按“最后有效 token”取 logits，不再固定 `[:, -1, :]`。 - `generate` / `generate_super` 新增 `attention_mask` 和 `pad_token_id` 参数。 - 支持 batch 内样本提前结束（`stop_id`）后的生成。 4. 对齐 padding 与 tokenizer 配置 - `dataset.py` 改为使用 `tokenizer.pad_token_id`（fallback 为 0）。 - 训练脚本/导出脚本/采样脚本初始化模型时同步 `lm_config.pad_token_id`。 5. 文档中的对应代码片段同步更新 ### 影响范围 - `docs/chapter5/code/k_model.py` - `docs/chapter5/code/dataset.py` - `docs/chapter5/code/ddp_pretrain.py` - `docs/chapter5/code/ddp_sft_full.py` - `docs/chapter5/code/model_sample.py` - `docs/chapter5/code/export_model.py` - `docs/chapter5/第五章动手搭建大模型.md` ### 兼容性 - 对已有 `model(X, Y)` 训练调用兼容。 - 新增参数均为可选，不影响原有单样本推理调用。 - 行为上更贴近标准 CausalLM 的 `labels + attention_mask` 语义。

dearsky (Migrated from gitea.proxy.dearsky.top) approved these changes 2026-02-26 15:25:37 +08:00

dearsky commented

2026-02-26 15:26:27 +08:00

(Migrated from gitea.proxy.dearsky.top)

感谢这个PR！改动审查完毕：

✅ 正确修复了 forward 中attention_mask 被错误赋值给 targets 的 bug
✅ attention_mask 实现逻辑正确
✅ 批量生成支持正确

代码实现正确，逻辑清晰。LGTM 👍

感谢这个PR！改动审查完毕： ✅ 正确修复了 `forward` 中`attention_mask` 被错误赋值给 `targets` 的 bug ✅ attention_mask 实现逻辑正确 ✅ 批量生成支持正确代码实现正确，逻辑清晰。LGTM 👍

❤️ 1

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: PullFromGitHub/happy-llm#170