Table of Contents

Training

Add context

Home

Terence Tao called him an expert after seeing it! Google and others used LLM to automatically prove theorems and won top conference outstanding papers. The more complete the context, the better the proof.

Terence Tao called him an expert after seeing it! Google and others used LLM to automatically prove theorems and won top conference outstanding papers. The more complete the context, the better the proof.

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Feb 04, 2024 am 09:30 AM

software ai

Transformer’s skill tree is getting more and more powerful.

Researchers from the University of Massachusetts, Google, and the University of Illinois at Urbana-Champaign (UIUC) recently published a paper in which they successfully achieved The goal is to automatically generate complete theorem proofs.

Paper address: https://arxiv.org/pdf/2303.04910.pdf

This This work, named after Baldur (brother of Thor in Norse mythology), demonstrated for the first time that Transformer can generate full proofs, and also showed that previous proofs of the model can be improved when providing additional context for the model.

This paper was published at ESEC/FSE (ACM European Joint Conference on Software Engineering and Symposium on Fundamentals of Software Engineering) in December 2023, and won the Outstanding Paper Award.

#As we all know, bugs are inevitable in software, which may not cause too much of a problem for an average application or website. However, for the software behind critical systems, such as encryption protocols, medical devices, and space shuttles, we must ensure there are no bugs.

- General code review and testing cannot give this guarantee, which requires formal verification.

For formal verification, ScienceDirect’s explanation is:

the process of mathematically checking that the behavior of a system, described using a formal model, satisfies a given property, also described using a formal model

refers to the process of mathematically checking whether the system behavior described by the formal model satisfies the given property.

To put it simply, it uses mathematical analysis methods to build a model through an algorithm engine to conduct exhaustive analysis and verification of the state space of the design to be tested.

Formal software verification is one of the most challenging tasks for software engineers. For example, CompCert, a C compiler verified with the Coq interactive theorem prover, is the only compiler used by ubiquitous GCC and LLVM, among others.

However, the cost of manual formal verification (writing proofs) is quite huge - the proof of a C compiler is more than three times that of the compiler code itself.

Therefore, formal verification itself is a "labor-intensive" task, and researchers are also exploring automated methods.

Proof assistants such as Coq and Isabelle train a model to predict one proof step at a time and use the model to search the possible proof space.

Baldur in this article introduced the ability of large language models in this field for the first time, training on natural language text and code, and fine-tuning the proof,

Baldur can generate complete proofs of theorems in one go, rather than one step at a time.

As shown in the figure above, only use theorem statements as input to the proof generation model, then extract the proof attempts from the model, and use Isabelle to perform the proof examine.

If Isabelle accepts the proof attempt without errors, the proof is successful; otherwise, another proof attempt is extracted from the proof generation model.

Baldur is evaluated on a benchmark of 6336 Isabelle/HOL theorems and their proofs, empirically demonstrating the effectiveness of complete proof generation, repair and adding context.

In addition, the reason why this tool is called Baldur may be because the best automatic proof generation tool currently is called Thor.

Thor has a higher proof rate (57%), using a smaller language model combined with a method of searching the space of possible proofs to predict the next step in the proof, while Baldur's advantage is its ability to generate complete proofs.

But the brothers Thor and Baldur can also work together, which may increase the proof rate to close to 66%.

Automatically generate complete proofs

Baldur is powered by Minerva, Google’s large language model, which is used in scientific papers and web pages containing mathematical expressions. It was trained on and fine-tuned on data about proofs and theorems.

Baldur can work with theorem proving assistant Isabelle, who checks the proof results. When given a theorem statement, Baldur was able to generate a complete proof almost 41% of the time.

To further improve Baldur’s performance, the researchers provided the model with additional contextual information (such as other definitions, or theorem statements in theoretical documents ), which increases the proof rate to 47.5%.

This means that Baldur is able to take the context and use it to predict new correct proofs - similar to programmers who are more likely to do so when they understand the relevant methods and code Fix bugs in the program.

The following is an example (fun_sum_commute theorem):

This theorem comes from a project called Polynomials in the Formal Proof Archives.

When manually writing proofs, two cases are distinguished: the set is finite or not finite:

So, for the model, the input is the theorem statement, and the target output is this manually written proof.

Baldur recognized the need for induction here and applied a special induction law called infinite_finite_induct, which follows the same general approach as human written proofs, but is more concise.

Because of the need for induction, the Sledgehammer used by Isabelle cannot prove this theorem by default.

Training

To train the proof generation model, the researchers constructed a new proof generation dataset.

The existing dataset contains examples of a single proof step, and each training example includes the proof state (input) and the next proof step to apply (goal).

Given a dataset containing a single proof step, here you need to create a new dataset in order to train the model to predict the entire proof at once.

The researchers extracted the proof steps for each theorem from the dataset and concatenated them to reconstruct the original proof.

Proof of repair

Still take the above fun_sum_commute as an example,

Baldur's first generated proof attempt failed in the proof checker.

Baldur tried to apply induction but failed to first break down the proof into two cases (finite vs. infinite sets). Isabelle returns the following error message:

To derive a proof-repair training example from these strings, here the theorem statements, failed proof attempts, and error messages are concatenated as input, using the correct Human-written proofs as targets.

#The above figure details the creation process of training data.

Use a proof generation model to sample proofs with a temperature of 0 for each question in the original training set.

Use the Proofing Assistant to record all failed proofs and their error messages, then proceed to build a new proof-fix training set.

For each original training example, concatenate the theorem statement, the (incorrect) candidate proof generated by the proof generation model, and the corresponding error message to obtain input for the new training example sequence.

Add context

Add lines from the theory file before the theorem statement as additional context. For example, the picture below looks like this:

Baldur’s proof generation model with context can make use of this additional information. Strings that appear in the fun_sum_commute theorem statements appear again in this context, so the additional information surrounding them can help the model make better predictions.

Context can be a statement (theorem, definition, proof) or a natural language annotation.

To take advantage of LLM’s available input length, the researchers first added up to 50 statements from the same theory file.

During training, all these statements are first tokenized and then the left side of the sequence is truncated to fit the input length.

The above figure shows the relationship between the proof success rate and the number of proof attempts for the generative model with context and without context. We can see that proof generative models with context consistently outperform plain generative models.

The graph above shows the ratio of verified theorems to inference costs for models of different sizes and temperatures.

We can see the proof success rate of the generated model, as well as the relationship between the context of the 8B model and the 62B model and the number of proof attempts.

62B with context proves that the generative model outperforms the 8B model with context.

However, the authors emphasize here that due to the high cost of these experiments, they cannot adjust the hyperparameters, and the 62B model may perform better if it is optimized.

The above is the detailed content of Terence Tao called him an expert after seeing it! Google and others used LLM to automatically prove theorems and won top conference outstanding papers. The more complete the context, the better the proof.. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

1 months ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

1 months ago By DDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks ago By DDD

InZoi: How To Apply To School And University

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7789

Java Tutorial

1644

CakePHP Tutorial

1401

Laravel Tutorial

1298

PHP Tutorial

1234

Related knowledge

What is the analysis chart of Bitcoin finished product structure? How to draw? Apr 21, 2025 pm 07:42 PM

The steps to draw a Bitcoin structure analysis chart include: 1. Determine the purpose and audience of the drawing, 2. Select the right tool, 3. Design the framework and fill in the core components, 4. Refer to the existing template. Complete steps ensure that the chart is accurate and easy to understand.

What does cross-chain transaction mean? What are the cross-chain transactions? Apr 21, 2025 pm 11:39 PM

Exchanges that support cross-chain transactions: 1. Binance, 2. Uniswap, 3. SushiSwap, 4. Curve Finance, 5. Thorchain, 6. 1inch Exchange, 7. DLN Trade, these platforms support multi-chain asset transactions through various technologies.

Aavenomics is a recommendation to modify the AAVE protocol token and introduce token repurchase, which has reached the quorum number of people. Apr 21, 2025 pm 06:24 PM

Aavenomics is a proposal to modify the AAVE protocol token and introduce token repos, which has implemented a quorum for AAVEDAO. Marc Zeller, founder of the AAVE Project Chain (ACI), announced this on X, noting that it marks a new era for the agreement. Marc Zeller, founder of the AAVE Chain Initiative (ACI), announced on X that the Aavenomics proposal includes modifying the AAVE protocol token and introducing token repos, has achieved a quorum for AAVEDAO. According to Zeller, this marks a new era for the agreement. AaveDao members voted overwhelmingly to support the proposal, which was 100 per week on Wednesday

The top ten free platform recommendations for real-time data on currency circle markets are released Apr 22, 2025 am 08:12 AM

Cryptocurrency data platforms suitable for beginners include CoinMarketCap and non-small trumpet. 1. CoinMarketCap provides global real-time price, market value, and trading volume rankings for novice and basic analysis needs. 2. The non-small quotation provides a Chinese-friendly interface, suitable for Chinese users to quickly screen low-risk potential projects.

Ranking of leveraged exchanges in the currency circle The latest recommendations of the top ten leveraged exchanges in the currency circle Apr 21, 2025 pm 11:24 PM

The platforms that have outstanding performance in leveraged trading, security and user experience in 2025 are: 1. OKX, suitable for high-frequency traders, providing up to 100 times leverage; 2. Binance, suitable for multi-currency traders around the world, providing 125 times high leverage; 3. Gate.io, suitable for professional derivatives players, providing 100 times leverage; 4. Bitget, suitable for novices and social traders, providing up to 100 times leverage; 5. Kraken, suitable for steady investors, providing 5 times leverage; 6. Bybit, suitable for altcoin explorers, providing 20 times leverage; 7. KuCoin, suitable for low-cost traders, providing 10 times leverage; 8. Bitfinex, suitable for senior play

What are the hybrid blockchain trading platforms? Apr 21, 2025 pm 11:36 PM

Suggestions for choosing a cryptocurrency exchange: 1. For liquidity requirements, priority is Binance, Gate.io or OKX, because of its order depth and strong volatility resistance. 2. Compliance and security, Coinbase, Kraken and Gemini have strict regulatory endorsement. 3. Innovative functions, KuCoin's soft staking and Bybit's derivative design are suitable for advanced users.

Top 10 cryptocurrency exchange platforms The world's largest digital currency exchange list Apr 21, 2025 pm 07:15 PM

Exchanges play a vital role in today's cryptocurrency market. They are not only platforms for investors to trade, but also important sources of market liquidity and price discovery. The world's largest virtual currency exchanges rank among the top ten, and these exchanges are not only far ahead in trading volume, but also have their own advantages in user experience, security and innovative services. Exchanges that top the list usually have a large user base and extensive market influence, and their trading volume and asset types are often difficult to reach by other exchanges.

A list of special services for major virtual currency trading platforms Apr 22, 2025 am 08:09 AM

Institutional investors should choose compliant platforms such as Coinbase Pro and Genesis Trading, focusing on cold storage ratios and audit transparency; retail investors should choose large platforms such as Binance and Huobi, focusing on user experience and security; users in compliance-sensitive areas can conduct fiat currency trading through Circle Trade and Huobi Global, and mainland Chinese users need to go through compliant over-the-counter channels.

See all articles