MyGit

v0.10.1

huggingface/trl

版本发布时间: 2024-08-29 22:34:50

huggingface/trl最新发布版本:v0.11.1(2024-09-25 00:13:05)

We are excited to introduce the new v0.10.1 release, with many new exciting features and post-training algorithms. The highlights are as follows:

Online DPO

Screenshot 2024-08-29 at 15 53 29

Online DPO is a new alignment method from DeepMind to boost the performance of LLMs. With Online DPO, data is generated on the fly by the trained model (instead of pre-collected). For each prompt, two completions are generated, with a reward model selecting the preferred one. This approach:

To train models with this method, use the OnlineDPOTrainer

Liger Triton kernels for supercharged SFT

image (18)

DPO for VLMs

accelerate launch examples/scripts/dpo_visual.py \
    --dataset_name HuggingFaceH4/rlaif-v_formatted \
    --model_name_or_path google/paligemma-3b-pt-224 \
    --trust_remote_code \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --output_dir dpo_paligemma_rlaif-v \
    --bf16 \
    --torch_dtype bfloat16

WinRate callback for LLM as a judge

trainer = DPOTrainer(...)
win_rate_callback = WinRateCallback(..., trainer=trainer)
trainer.add_callback(win_rate_callback)

Anchored Preference Optimisation (APO) for fine-grained human/AI feedback

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.9.6...v0.10

相关地址:原始地址 下载(tar) 下载(zip)

查看:2024-08-29发行的版本