1. SFT 训练
Qwen-1_8B-Chat
进行 qlora 训练参数设置如下:
CUDA_VISIBLE_DEVICES=0,1 python src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path ./home/models/qwen/Qwen-1_8B-Chat \
--dataset alpaca_gpt4_en \
--template default \
--finetuning_type lora \
--quantization_bit 4 \
--lora_target c_attn \
--output_dir ./outputqlora \
--overwrite_cache \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 5e-5 \
--num_train_epochs 1 \
--plot_loss \
--fp16
2024.06.04 update: 现在入口变为 src 下 train.py:(还是 LLaMA-Factory 根目录下)
CUDA_VISIBLE_DEVICES=0 python src/train.py examples/lora_single_gpu/llama3task.yaml
accelerate launch --gpu_ids 0,1 src/train.py examples/lora_single_gpu/llama3task.yaml | tee output.log
去 examples 下面修改对应的 yaml 文件,如果数据集较大要去掉 max_samples=1000.
20240618 更新 dpo 训练
#dpo 每条数据格式
data = {
"conversations": [
{
"from": "human",
"value": prompt
}
],
"chosen": {
"from": "gpt",
"value": chosen
},
"rejected": {
"from": "gpt",
"value": rejected
}
}
#dataset_info.json 添加
"llama3dpo": {
"file_name": "traindpo_0618.json",
"ranking": true,
"formatting": "sharegpt",
"columns": {
"messages": "conversations",
"chosen": "chosen",
"rejected": "rejected"
}
},
#deepspeed 多卡
deepspeed --num_gpus 4 src/train.py \
--model_name_or_path /data/models/Meta-Llama-3-8B-Instruct \
--stage dpo \
--do_train \
--finetuning_type lora \
--lora_target all \
--pref_beta 0.1 \
--pref_loss sigmoid \
--deepspeed examples/deepspeed/ds_z3_config.json \
--dataset llama3dpo \
--template llama3 \
--cutoff_len 4096 \
--max_new_tokens 2000 \
--max_samples 10000 \
--overwrite_cache \
--preprocessing_num_workers 16 \
--output_dir savesdpo1 \
--logging_steps 100 \
--save_steps 100 \
--plot_loss \
--overwrite_output_dir \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 8 \
--learning_rate 1.0e-5 \
--num_train_epochs 4.0 \
--lr_scheduler_type cosine \
--warmup_ratio 0.1 \
--fp16 \
--ddp_timeout 180000000 \
--val_size 0.05 \
--per_device_eval_batch_size 4 \
--eval_strategy steps \
--eval_steps 100 > output6.log 2>&1
2. Reward 模型训练
CUDA_VISIBLE_DEVICES=0,1 python src/train_bash.py \
--stage rm \
--do_train \
--model_name_or_path ./home/models/qwen/Qwen-1_8B-Chat \
--adapter_name_or_path ./outputqlora \
--create_new_adapter \
--dataset comparison_gpt4_en \
--template default \
--finetuning_type lora \
--lora_target c_attn \
--output_dir output_reward \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 1e-6 \
--num_train_epochs 1.0 \
--plot_loss \
--fp16
comparison_gpt4_en
:一个数据例子经过 src/data/loader.py
下dataset = dataset.map(preprocess_func, batched=True, remove_columns=column_names, **kwargs)
得到的数据如下(去掉了转换后 ids):
prompt:
Human: What are the three primary colors?
Assistant:
chosen:
The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the additive color system, used for light, the primary colors are red, green, and blue (RGB).<|endoftext|>
rejected:
Red, Yellow, and Green.<|endoftext|>
计算损失的代码如下:
- 通过调用
model
来计算输入序列的隐藏状态values
。这些隐藏状态包含了模型对输入序列的理解。 - 将输入数据分成两部分:选择的部分和拒绝的部分。
- 对于每个样本,计算选择和拒绝序列的长度,并找到它们之间的差异。如果两个序列完全相同,则使用选择序列的长度作为结束索引和分界索引;否则,找到第一个不同的标记索引作为分界索引。
- 确保分界索引大于零,以确保选择和拒绝序列之间存在差异。
- 提取选择和拒绝序列在分界索引后的奖励值子序列,以便计算损失。
- 对于每个样本,计算选择和拒绝的奖励之间的差异,并应用
logsigmoid
函数以获得损失值。logsigmoid
函数的使用可以将损失控制在合适的范围内,并且对于较小和较大的差异都能产生良好的效果。 - 将所有样本的损失值相加,并计算平均损失,作为整个数据批次的损失值。
这样,在训练过程中,模型将根据选择和拒绝序列之间的差异来调整参数,以使损失最小化。
def compute_loss(self, model: "PreTrainedModel", inputs: Dict[str, torch.Tensor], return_outputs: Optional[bool] = False
) -> Union[torch.Tensor, Tuple[torch.Tensor, List[torch.Tensor]]]:
r"""
Computes pairwise loss. The first n examples are chosen and the last n examples are rejected.
Subclass and override to inject custom behavior.
Note that the first element will be removed from the output tuple.
See: https://github.com/huggingface/transformers/blob/v4.30.2/src/transformers/trainer.py#L3509
"""
# Compute rewards
#
_, _, values = model(**inputs, output_hidden_states=True, return_dict=True)
unwrapped_model: "PreTrainedModel" = self.accelerator.unwrap_model(self.model)
if getattr(unwrapped_model.config, "model_type", None) == "chatglm":
values = torch.transpose(values, 0, 1)
# Split the inputs and rewards into two parts, chosen and rejected
batch_size = inputs["input_ids"].size(0) // 2
chosen_input_ids, rejected_input_ids = inputs["input_ids"][:batch_size], inputs["input_ids"][batch_size:]
chosen_rewards, rejected_rewards = values[:batch_size], values[batch_size:]
chosen_scores, rejected_scores = [], []
# Compute pairwise loss. Only backprop on the different tokens before padding
# Inspired by: https://github.com/CarperAI/trlx/blob/main/examples/summarize_rlhf/reward_model/reward_model.py
loss = 0
for i in range(batch_size):
#获取选择和拒绝的长度
chosen_length = (chosen_input_ids[i] != self.tokenizer.pad_token_id).nonzero()[-1] + 1
rejected_length = (rejected_input_ids[i] != self.tokenizer.pad_token_id).nonzero()[-1] + 1
#判比较选择和拒绝的序列是否存在差异。如果选择和拒绝的序列完全相同,则为 0
check_divergence = (chosen_input_ids[i] != rejected_input_ids[i]).nonzero()
if len(check_divergence) == 0:
#如果选择和拒绝的序列完全相同,则使用选择序列的长度作为结束索引和分界索引。end_index = chosen_length
div_index = end_index - 1
else:
#如果选择和拒绝的序列存在差异,则使用两个序列中较长的长度作为结束索引,# 并找到第一个不同的索引作为分界索引。end_index = max(chosen_length, rejected_length)
div_index = check_divergence[0]
#确保分界索引大于零,以确保在选择和拒绝序列之间存在差异。assert div_index > 0
#下面两行用于从选择和拒绝的奖励值中提取分界索引后的子序列,以便计算损失。chosen_trunc_rewards = chosen_rewards[i, div_index:end_index]
rejected_trunc_rewards = rejected_rewards[i, div_index:end_index]
if return_outputs: # use the score on the last token except pad token for inference
chosen_scores.append(chosen_rewards[i, chosen_length - 1])
rejected_scores.append(rejected_rewards[i, rejected_length - 1])
#计算选择和拒绝的奖励之间的差异,并应用 logsigmoid 函数以获得损失值。loss += -torch.nn.functional.logsigmoid(chosen_trunc_rewards - rejected_trunc_rewards).mean()
loss = loss / batch_size
if return_outputs:
chosen_scores, rejected_scores = torch.stack(chosen_scores), torch.stack(rejected_scores)
return loss, [loss, chosen_scores, rejected_scores]
return loss
3. PPO 训练
CUDA_VISIBLE_DEVICES=0,1 python src/train_bash.py \
--stage ppo \
--do_train \
--model_name_or_path ./home/models/qwen/Qwen-1_8B-Chat \
--adapter_name_or_path ./outputqlora \
--create_new_adapter \
--dataset alpaca_gpt4_en \
--template default \
--finetuning_type lora \
--lora_target c_attn \
--reward_model output_reward \
--output_dir output_ppo \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--top_k 0 \
--top_p 0.9 \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 1e-5 \
--num_train_epochs 1.0 \
--plot_loss \
--fp16
输入数据如下:
inputs:
Human: Give three tips for staying healthy.
Assistant:
RLHF 需要四个模型actor model,ref_model,reward model 和 critic model:
- actor model:这里用的是 sft 后的 model
- ref_model: 跟 actor model 一样 sft 后的 model,这两个 model 都是创建的
from ...train.utils import create_ref_model, create_reward_model
和run_ppo
可以查看 model 和 ref_model 的创建过程。 - reward model 和 critic model 都是 RW 训练得到模型,进行创建得到。
参考
[1] LLaMA-Factory
正文完