Home Technology peripherals AI Large model + Monte Carlo tree search, one move makes LLaMa-3 8B Mathematical Olympiad level close to GPT-4

Large model + Monte Carlo tree search, one move makes LLaMa-3 8B Mathematical Olympiad level close to GPT-4

Jun 18, 2024 pm 04:46 PM
project MCTSr

Through innovation at the algorithm level, the ability of large language models to solve mathematical problems will continue to improve in the future.

In the past few days, the news that Jiang Ping, a 17-year-old technical secondary school student, ranked 12th in the world in the 2024 Alibaba Global Mathematics Competition qualifiers has flooded the screen. At the same time, the results of the AI ​​Challenge show that among all 563 AI participating teams, the highest score was 34 points and the average score was 18 points, catching up with the average level of human players.

The main shortcoming of AI participating in mathematics competitions is its weak logical reasoning ability, and it is difficult to get full points for proof questions. This is also a major challenge faced by current large language models (LLMs) such as GPT-4 and LLaMA in tasks that require strategy and logical reasoning.

One of the important obstacles is the accuracy and credibility of the output, especially in mathematical contexts where accuracy needs to be guaranteed, LLM often produces hallucinations when reasoning. The output results appear reasonable on the surface, but are actually irrelevant or factually incorrect, ultimately leading to unreasonable reasoning processes.

Naturally rewriting techniques like self-refinement can help address this bias, but can still lead to misleading or erroneous results for complex real-world mathematical problems.

Therefore, in order to deal with these challenges, researchers from Fudan University and Shanghai AI Lab proposed MCT Self-Refine (MCTSr), which combines LLM with the Monte Carlo Tree Search (MCTS) algorithm, and focuses on Improve the performance of LLM in complex mathematical reasoning tasks (such as Mathematical Olympiad competition questions).

MCTS is a decision-making tool widely used in artificial intelligence scenarios that require strategic planning, usually in games and complex problem-solving environments. By combining the system exploration capabilities of MCTS with the Self-Refine and Self-Evaluation capabilities of LLM, this paper aims to create a more powerful framework to deal with complex reasoning tasks that are currently difficult to solve with LLM.

大模型+蒙特卡洛树搜索,一招让LLaMa-3 8B奥数水平直逼GPT-4

  • Paper address: https://arxiv.org/pdf/2406.07394

  • Project address: https://github.com/trotsky1997/MathBlackBox

However, there are some technical challenges in integrating MCTS with LLM. Traditional MCTS strategies may not fit well with the stochastic and generative nature of LLM outputs, which typically involve an infinite, continuous space of potential actions. This inconsistency requires customized expectation calculation and backpropagation methods within the MCTS framework to better accommodate the unique properties of LLM.

Additionally, the researchers introduced a dynamic pruning strategy that incorporates an improved confidence upper bound (UCB) formula to optimize the exploration-exploitation balance required for effective decision-making in high-risk tasks.

It can be said that this research advances the application of LLM in complex reasoning challenges and lays the foundation for future integration of AI-related technological innovations, thereby enabling LLM-driven applications to have more powerful decision-making and reasoning Accuracy and reliability.

Method Overview

MCTSr architecture diagram is shown in Figure 1:

大模型+蒙特卡洛树搜索,一招让LLaMa-3 8B奥数水平直逼GPT-4MCTSr workflow includes:

  • Initialization: Use model-generated answers and dummy responses to establish root nodes to minimize the tendency of model overfitting;

  • Selection: The algorithm adopts the value function Q Sort all answers that are not fully expanded, and use a greedy strategy to select the node with the highest value for further exploration and optimization;

  • Self-Refine: Select a good answer a using Self- Refine framework for optimization. Initially, the model generates feedback m that guides the optimization process to produce an enhanced answer a ′;

  • Self-Evaluation: The refined answer is scored to sample a reward value, and its Q is calculated value. This involves model self-reward feedback and constraints, such as strict scoring standards and suppressing full scores to ensure the reliability and fairness of scoring;

  • Backpropagation: Reverse the value of the refined answer Propagates to its parent node and other related nodes to update the value information of the tree. If the Q value of any child node changes, update the Q value of the parent node;

  • UCT update: After the Q value update of all nodes is completed, determine a candidate node set C, using For further expansion or selection, the UCT update formula is then used to update the UCT values ​​of all nodes in preparation for the next selection stage.

Iterate the above stages until the termination condition T is met.

Self-Refine

In the self-refine stage, the model optimizes the answer a to question P through multiple rounds of dialogue refinement prompts. First, the model generates a reflective or critical comment m about answer a. Subsequently, under the guidance of m, the model modifies the answer a to produce an improved version a'. This iterative refinement improves the quality of the model response.

Self-Assessment

In the answer refining process of mathematical problem P, the Q value of an answer a is defined as the expected quality of further refining a into a better answer. This definition is based on the Markov property of the transition from a to its rewritten form, that is, the next state (i.e. the rewritten answer) only depends on the current state (i.e. the current answer a) and has nothing to do with the previous state.

In addition, the researchers also designed three constraints: prompt constraints, full score suppression, and repeated sampling. After sampling, calculate the Q value of a.

大模型+蒙特卡洛树搜索,一招让LLaMa-3 8B奥数水平直逼GPT-4

Backpropagation

After the reward values ​​of all leaf nodes have been sampled and the Q values ​​are updated, these changes are then Propagates to its parent and ancestor nodes. During this update process, if the Q function value of any element in the set Children (a) of node a changes, the Q function value of node a will also be updated. Such propagation ensures that a node's Q-value reflects the latest status and evaluation of all its possible children.

大模型+蒙特卡洛树搜索,一招让LLaMa-3 8B奥数水平直逼GPT-4

Update UCT and selection

After updating the Q values ​​of all nodes in the tree, the next round of selection phase will be entered . This process includes the following steps:

  • Candidate node selection: When selecting a node, the researcher does not need to start from the root node, but traverses the nodes in the tree in hierarchical order.

  • UCT Update: Drawing on AlphaGo, this study uses UCT and UCB-1 methods to balance the exploration and utilization of nodes; for node a in candidate set C, its UCT_a value is:

大模型+蒙特卡洛树搜索,一招让LLaMa-3 8B奥数水平直逼GPT-4

Termination function

Early termination: when the improvement of search results begins to decrease or continuous searches produce duplicate results when, termination occurs.

Search constraints: The search terminates once the number of expansions reaches a predetermined limit or one or more nodes in the tree satisfy the maximum depth constraint.

Experimental results

In order to evaluate the effectiveness of the MCTSr algorithm in solving mathematical problems, the researchers used LLaMA3-8B as the base model and used MCTSr for enhancement. They compared LLaMA3-8B with GPT-4, Claude 3, and Gemini 1.5-Pro ​​in several setups including Zero-Shot CoT, Self-Refine, 4-rollouts MCTSr, and 8-rollouts MCTSr.

The researchers evaluated the above method on the GSM8K and GSM-hard test sets (which contain typical and challenging mathematical problems respectively), and the results are shown in Table 1 below.

It can be found that there is a direct correlation between the number of rollouts and the success rate of MCTSr, and it increases significantly as the number of iterations increases, especially in the less complex GSM8K. However, for the more complex GSM-Hard test set, the performance limit will be reached even if the number of rollouts is higher, indicating that the current strategy has limitations in solving complex problems.

These results highlight the robustness and potential boundaries of the MCT-Self-refine algorithm, as well as the need for continuous improvement to effectively address more complex challenges.

大模型+蒙特卡洛树搜索,一招让LLaMa-3 8B奥数水平直逼GPT-4

Table 2 below shows the results of applying the MCT-Self-refine algorithm at different complexity levels on the MATH dataset. The dataset is divided into five difficulty levels, from Level 1 (easiest) to Level 5 (most challenging).

The results show that Level 1 has the highest success rate. After 8 rollouts, MCTSr achieved a success rate of 90.16% and solved 394 of 437 problems. As the number of rollouts increases, the success rate at this level increases significantly.

At the most challenging Level 5 difficulty, after 8 rollouts, MCTSr had a success rate of 34.06%, solving 451 of 1324 problems. This illustrates that as the difficulty continues to increase, the algorithm's performance is limited in highly complex scenarios.

The overall performance of all levels shows that after 8 rollouts, MCTSr has a cumulative success rate of 58.24%, solving 2912 out of 5000 problems. This success rate is a significant improvement over Zero-Shot CoT’s initial success rate of 24.36%. This shows that the increase in the number of rollouts is consistent with the increase in success rate, emphasizing the effectiveness of the MCT-Self-refine algorithm in improving problem-solving capabilities at different levels of mathematical complexity.

These results also validate the potential of the MCT-Self-refine algorithm in academic and problem-solving contexts, and highlight its scalability and adaptability to problems of different complexity levels in the MATH dataset.

大模型+蒙特卡洛树搜索,一招让LLaMa-3 8B奥数水平直逼GPT-4

Table 3 below shows the MCT-Self-refne algorithm tested on three data sets of the Mathematical Olympiad competition: AlME, GAIC Math Odyssey and OlympiadBench.

AIME: From 2.36% for Zero-Shot CoT (22 problems solved) to 11.79% for MCTSr (110 problems solved).

GAIC Math Odyssey: Success rate increased from 17.22% (67 problems solved) to 49.36% (192 problems solved).

OlympiadBench: Improved from 1.25% on Zero-Shot CoT (16 problems solved) to 7.76% on MCTSr (99 problems solved).

These results confirm the applicability of the MCT-Self-refine algorithm to unseen mathematical problems, indicating its advantages in competitive academic environments such as Olympiads.

大模型+蒙特卡洛树搜索,一招让LLaMa-3 8B奥数水平直逼GPT-4

As shown in Table 4. When compared with current closed-source large models, MCTSr can effectively improve the mathematical reasoning capabilities of small-parameter open-source models (such as LLaMa-3) to a comparable level.

大模型+蒙特卡洛树搜索,一招让LLaMa-3 8B奥数水平直逼GPT-4

For more technical details and experimental results, please refer to the original paper.

The above is the detailed content of Large model + Monte Carlo tree search, one move makes LLaMa-3 8B Mathematical Olympiad level close to GPT-4. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Nordhold: Fusion System, Explained
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1672
14
PHP Tutorial
1277
29
C# Tutorial
1256
24
The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days Jul 17, 2024 am 01:56 AM

It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Jul 17, 2024 pm 10:02 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

From RLHF to DPO to TDPO, large model alignment algorithms are already 'token-level' From RLHF to DPO to TDPO, large model alignment algorithms are already 'token-level' Jun 24, 2024 pm 03:04 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it Aug 01, 2024 pm 05:18 PM

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI ​​model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated Aug 05, 2024 pm 03:32 PM

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

LLM is really not good for time series prediction. It doesn't even use its reasoning ability. LLM is really not good for time series prediction. It doesn't even use its reasoning ability. Jul 15, 2024 pm 03:59 PM

Can language models really be used for time series prediction? According to Betteridge's Law of Headlines (any news headline ending with a question mark can be answered with "no"), the answer should be no. The fact seems to be true: such a powerful LLM cannot handle time series data well. Time series, that is, time series, as the name suggests, refers to a set of data point sequences arranged in the order of time. Time series analysis is critical in many areas, including disease spread prediction, retail analytics, healthcare, and finance. In the field of time series analysis, many researchers have recently been studying how to use large language models (LLM) to classify, predict, and detect anomalies in time series. These papers assume that language models that are good at handling sequential dependencies in text can also generalize to time series.

The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source Jul 17, 2024 am 02:46 AM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

See all articles