Comprehensively surpassing DPO: Chen Danqi's team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model-AI-php.cn

Table of Contents

Home

Comprehensively surpassing DPO: Chen Danqi's team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 01, 2024 pm 04:41 PM

Model train

In order to align large language models (LLMs) with human values and intentions, it is critical to learn human feedback to ensure that they are useful, honest, and harmless. In terms of aligning LLMs, an effective approach is reinforcement learning based on human feedback (RLHF). Although the results of the RLHF method are excellent, there are some optimization challenges involved. This involves training a reward model and then optimizing a policy model to maximize that reward.

Recently, some researchers have explored simpler offline algorithms, one of which is direct preference optimization (DPO). DPO learns a policy model directly based on preference data by parameterizing the reward function in RLHF, thus eliminating the need for an explicit reward model. This method is simple and stable and has been widely used in practice.

When using DPO, the way to obtain implicit rewards is to use the logarithm of the response likelihood ratio between the current policy model and the supervised fine-tuning (SFT) model. However, this way of constructing rewards does not align directly with the bootstrap-generated metric, which is approximately the mean logarithm of the response generated by the policy model. This difference between training and inference can lead to poor performance.

To this end, Meng Rui, an assistant professor at the University of Virginia, Xia Mengzhou, a doctoral candidate at Princeton University, and Chen Danqi, an assistant professor, jointly proposed SimPO - a simple and effective offline preference optimization algorithm. The design of SimPO is based on modeling the optimization problem as a minimization problem of a continuous black-box function. Through continuous iteration, SimPO is able to find the best optimization strategy and achieve efficient convergence. Compared with traditional optimization algorithms,

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

##Paper title: SimPO: Simple Preference Optimization with a Reference-Free Reward
Paper address: https://arxiv.org/pdf/2405.14734
Code & Model: https://github.com/princeton-nlp/SimPO

##The The core of the algorithm is to align the reward function in the preference optimization objective with the generated metric. SimPO consists of two main components: (1) a reward normalized in length, calculated as the average log probability of all tokens in the reward using the policy model; (2) a target reward difference to ensure wins and losses The reward difference between responses exceeds this difference.

To sum up, SimPO has the following characteristics:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

The team conducted extensive analysis and the results showed that SimPO can more effectively utilize preference data to achieve high-quality performance on the validation set. and more accurate ranking of the likelihood of low-quality responses, which further leads to better policy models.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

As shown in Table 1, the team built a model with top performance based on Llama3-8B-instruct, which was obtained on AlpacaEval 2 Its length-controlled win rate is 44.7, surpassing Claude 3 Opus on the leaderboard; in addition, its win rate on Arena-Hard is 33.8, making it the most powerful 8B open source model currently.

SimPO: Simple Preference Optimization

For ease of understanding, the following first introduces the background of DPO, and then explains the rewards of DPO and the similarities used in generation. The difference between the natural metrics and proposes a reference-free alternative reward formula to alleviate this problem. Finally, the SimPO target is derived by integrating the target reward margin term into the Bradley-Terry model.

Background: Direct Preference Optimization (DPO)

DPO is one of the most commonly used offline preference optimization methods. DPO does not learn an explicit reward model, but uses a closed expression with an optimal policy to reparameterize the reward function r:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

where π_θ is the policy model, π_ref is the reference policy (usually the SFT model), and Z (x) is the partition function. By integrating this way of building rewards into the Bradley-Terry (BT) ranking objective, Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model , the DPO can use a policy model instead of a reward model to represent the probabilities of preference data, resulting in the following objective:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

##where (x, y_w, y_l) is the preference pair consisting of prompt, winning response and failure response from the preference data set D.

A simple reference-free reward aligned with the generated result

The difference between DPO’s rewards and generation . Using equation (1) as an implicit reward expression has the following disadvantages: (1) The training phase requires a reference model π_ref, which will bring additional memory and computing costs; (2) The reward optimized in the training phase and the generation used for inference There are differences between indicators. Specifically, in the generation stage, the policy model π_θ is used to generate a sequence that can approximately maximize the average log-likelihood, defined as follows:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

It is very difficult to directly maximize this metric during the decoding process. Various decoding strategies can be used for this, such as greedy decoding, beam search, kernel sampling and top-k sampling. Additionally, this metric is often used to rank options when language models perform multi-selection tasks. In DPO, for any triplet (x, y_w, y_l), satisfying the reward ranking r (x, y_w) > r (x, y_l) does not necessarily mean satisfying the likelihood ranking Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model . In fact, when training with DPO, only about 50% of the triplets in the holdout set meet this condition (see Figure 4b).

Construct rewards normalized over length. Naturally, we would consider using p_θ in (3) to replace the reward construction in DPO so that it aligns with the bootstrap-generated likelihood metric. This results in a reward normalized in length:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

where β is a constant that controls the size of the reward difference. The team found that normalizing rewards based on response length is critical; removing the length normalization term from the reward formula caused the model to tend to generate longer but lower-quality sequences. This eliminates the need for a reference model in building rewards, resulting in greater memory and computational efficiency than algorithms that rely on reference models.

SimPO Target

Target reward difference. In addition, the team also introduced a target reward difference term γ > 0 for the Bradley-Terry objective to ensure that the reward r (x, y_w) of the winning response exceeds the reward r (x, y_l) of the failed response by at least γ:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

#The difference between two classes is known to affect the generalization ability of the classifier. In standard training settings using random model initialization, increasing the target margin usually improves generalization performance. In preference optimization, these two categories are winning or losing responses to a single input.

In practice, the team observed that as the target gap increases, the generation quality initially improves, but when the gap becomes too large, the generation quality decreases. A variant of the DPO, the IPO, also builds a target reward margin similar to SimPO, but its overall target is less effective than SimPO.

Target. Finally, by substituting equation (4) into equation (5), the SimPO target can be obtained:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

To sum up, SimPO adopts and generates A form of implicit reward where metrics align directly, thus eliminating the need for a reference model. Additionally, it introduces a target reward difference γ to separate winning and losing responses.

Experimental settings

Model and training settings. The team's experiments used two types of models, Llama3-8B and Mistral-7B, in both Base and Instruct settings.

Evaluation benchmark. The team used three of the most commonly used open compliance benchmarks: MT-Bench, AlpacaEval 2, and Arena-Hard v0.1. These benchmarks evaluate a model's diverse conversational capabilities on a variety of queries and have been widely adopted by the community. Table 2 gives some details.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

Baseline method. Table 3 lists other offline preference optimization methods compared with SimPO.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

Experimental results

##Main results and ablation studies

SimPO always performs significantly better than previously existing preference optimization methods. As shown in Table 4, although all preference optimization algorithms perform better than the SFT model, simple SimPO achieves the best performance on all benchmarks and settings. Such a large lead across the board demonstrates the robustness and effectiveness of SimPO.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

Benchmark quality varies. It can be observed that the win rate on Arena-Hard is significantly lower than the win rate on AlpacaEval 2, indicating that Arena-Hard is a more difficult benchmark.

Instruct settings result in significant performance gains. As can be seen, the Instruct setup outperforms the Base setup across the board on all benchmarks. This may be due to the use of higher quality SFT models for initialization by these models and the higher quality of preference data generated by these models.

Two key design aspects of SimPO are important. Table 5 shows the results of ablation experiments for each key design of SimPO. (1) Remove the length normalization (i.e. w/o LN) in (4); (2) Set the target reward difference in (6) to 0 (i.e. γ = 0).

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

#Removing length normalization has the greatest impact on the results. The team's research found that this resulted in the model generating long and repetitive patterns, which severely reduced the overall quality of the output. Setting γ to 0 also leads to performance degradation of SimPO, indicating that 0 is not the optimal target reward margin.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

See the original paper for a more in-depth analysis of these two design choices.

In-depth comparison between DPO and SimPO

Finally, the team also analyzed the DPO and SimPO are comprehensively compared: (1) likelihood-length correlation, (2) reward construction, (3) reward accuracy, (4) algorithm efficiency. The results show that SimPO outperforms DPO in terms of accuracy and efficiency.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

DPO rewards implicitly promote length normalization.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

Although the DPO reward expression Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model (does not include the partition function) lacks a length reduction function An explicit term for normalization, but the logarithmic ratio between the policy model and the reference model can implicitly offset the length bias. As shown in Table 6 and Figure 4a, using DPO reduces the Spearman correlation coefficient between the average log-likelihood and response length compared to the method without any length normalization (denoted as SimPO w/o LN). . However, it still shows a stronger positive correlation when compared to SimPO.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

DPO reward does not match the generated likelihood.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

#There is a difference between the DPO's reward and the average log-likelihood metric, which directly affects the generation . As shown in Figure 4b, in the instance on the UltraFeedback training set, where Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model , almost half of the data pairs have . In contrast, SimPO directly uses the average log-likelihood (scaled by β) as the reward expression, thereby completely eliminating the difference.

DPO is not as good as SimPO in terms of reward accuracy.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

Figure 4c compares the reward accuracy of SimPO and DPO, which evaluates their final learned reward versus the preference label on the holdout set degree of alignment. It can be observed that the reward accuracy of SimPO is higher than that of DPO, which indicates that the reward design of SimPO helps achieve more effective generalization and higher quality generation.

SimPO is both more memory efficient and computationally efficient than DPO.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

Another big advantage of SimPO is efficiency, after all, it does not use a reference model. Figure 4d presents the overall runtime and peak memory usage per GPU for SimPO and DPO when using the Llama3-Base setup on an 8×H100 GPU. SimPO reduces runtime by approximately 20% and GPU memory usage by approximately 10% compared to the original DPO implementation, thanks to the elimination of forward passes using the reference model.

For more details, please read the original article.

The above is the detailed content of Comprehensively surpassing DPO: Chen Danqi's team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks ago By DDD

InZoi: How To Apply To School And University

3 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

2 weeks ago By DDD

Roblox: Dead Rails – How To Summon And Defeat Nikola Tesla

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7814

Java Tutorial

1646

CakePHP Tutorial

1402

Laravel Tutorial

1300

PHP Tutorial

1237

Related knowledge

Open source! Beyond ZoeDepth! DepthFM: Fast and accurate monocular depth estimation! Apr 03, 2024 pm 12:04 PM

0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

The world's most powerful open source MoE model is here, with Chinese capabilities comparable to GPT-4, and the price is only nearly one percent of GPT-4-Turbo May 07, 2024 pm 04:13 PM

Imagine an artificial intelligence model that not only has the ability to surpass traditional computing, but also achieves more efficient performance at a lower cost. This is not science fiction, DeepSeek-V2[1], the world’s most powerful open source MoE model is here. DeepSeek-V2 is a powerful mixture of experts (MoE) language model with the characteristics of economical training and efficient inference. It consists of 236B parameters, 21B of which are used to activate each marker. Compared with DeepSeek67B, DeepSeek-V2 has stronger performance, while saving 42.5% of training costs, reducing KV cache by 93.3%, and increasing the maximum generation throughput to 5.76 times. DeepSeek is a company exploring general artificial intelligence

AI subverts mathematical research! Fields Medal winner and Chinese-American mathematician led 11 top-ranked papers | Liked by Terence Tao Apr 09, 2024 am 11:52 AM

AI is indeed changing mathematics. Recently, Tao Zhexuan, who has been paying close attention to this issue, forwarded the latest issue of "Bulletin of the American Mathematical Society" (Bulletin of the American Mathematical Society). Focusing on the topic "Will machines change mathematics?", many mathematicians expressed their opinions. The whole process was full of sparks, hardcore and exciting. The author has a strong lineup, including Fields Medal winner Akshay Venkatesh, Chinese mathematician Zheng Lejun, NYU computer scientist Ernest Davis and many other well-known scholars in the industry. The world of AI has changed dramatically. You know, many of these articles were submitted a year ago.

Hello, electric Atlas! Boston Dynamics robot comes back to life, 180-degree weird moves scare Musk Apr 18, 2024 pm 07:58 PM

Boston Dynamics Atlas officially enters the era of electric robots! Yesterday, the hydraulic Atlas just "tearfully" withdrew from the stage of history. Today, Boston Dynamics announced that the electric Atlas is on the job. It seems that in the field of commercial humanoid robots, Boston Dynamics is determined to compete with Tesla. After the new video was released, it had already been viewed by more than one million people in just ten hours. The old people leave and new roles appear. This is a historical necessity. There is no doubt that this year is the explosive year of humanoid robots. Netizens commented: The advancement of robots has made this year's opening ceremony look like a human, and the degree of freedom is far greater than that of humans. But is this really not a horror movie? At the beginning of the video, Atlas is lying calmly on the ground, seemingly on his back. What follows is jaw-dropping

KAN, which replaces MLP, has been extended to convolution by open source projects Jun 01, 2024 pm 10:03 PM

Earlier this month, researchers from MIT and other institutions proposed a very promising alternative to MLP - KAN. KAN outperforms MLP in terms of accuracy and interpretability. And it can outperform MLP running with a larger number of parameters with a very small number of parameters. For example, the authors stated that they used KAN to reproduce DeepMind's results with a smaller network and a higher degree of automation. Specifically, DeepMind's MLP has about 300,000 parameters, while KAN only has about 200 parameters. KAN has a strong mathematical foundation like MLP. MLP is based on the universal approximation theorem, while KAN is based on the Kolmogorov-Arnold representation theorem. As shown in the figure below, KAN has

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

FisheyeDetNet: the first target detection algorithm based on fisheye camera Apr 26, 2024 am 11:37 AM

Target detection is a relatively mature problem in autonomous driving systems, among which pedestrian detection is one of the earliest algorithms to be deployed. Very comprehensive research has been carried out in most papers. However, distance perception using fisheye cameras for surround view is relatively less studied. Due to large radial distortion, standard bounding box representation is difficult to implement in fisheye cameras. To alleviate the above description, we explore extended bounding box, ellipse, and general polygon designs into polar/angular representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model fisheyeDetNet with polygonal shape outperforms other models and simultaneously achieves 49.5% mAP on the Valeo fisheye camera dataset for autonomous driving

The vitality of super intelligence awakens! But with the arrival of self-updating AI, mothers no longer have to worry about data bottlenecks Apr 29, 2024 pm 06:55 PM

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

See all articles