One trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical 'demon mirror'-AI-php.cn

Home

Technology peripherals

One trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical 'demon mirror'

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Nov 17, 2023 pm 12:38 PM

ai data

Nowadays, many big models claim to be good at mathematics. Who has the real talent? Who "cheated" on the back-to-back test questions?

This year, someone conducted a comprehensive test on the questions just announced for the Hungarian National Mathematics Final Examination

Many models suddenly became successful"Now The original shape” .

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Look at the green part first, these large models have similar results on the classic mathematics test set GSM8k and the new paper, Together they form the reference standard .

Looking at the

red part, the result on GSM8K is significantly higher than that of the large model with the same parameter scale.As soon as it arrives The score on the new paper dropped significantly, almost the same as the large model of the same size. The researchers classified them as

"suspected or known to have been trained on GSM8k"

. After watching this test, some people said that they should start evaluating questions that they have never seen before

Some people think that this kind of test And everyone’s actual use experience of large models is currently the only reliable evaluation method One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Musk Grok is second only to GPT-4, and the open source Llemma has excellent results One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Tester

Keiran Paster

is a PhD student at the University of Toronto, a Google student researcher, and one of the authors of the large Lemma model in the test.

Let the big model take the Hungarian national high school mathematics final exam. This trick comes from One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Musk’s xAI

. In order to rule out the problem that xAI's Grok large model accidentally saw test questions in network data, in addition to several common test sets, this test was also conducted

This exam this year The test was only completed at the end of May, and the current large model has basically never had the opportunity to see this set of test questions.

xAI also announced the results of GPT-3.5, GPT-4, and Claude 2 when it was released for comparison.

Based on this set of data, Paster conducted further tests. The test objects were multiple open source models with strong mathematical capabilities One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

and The test questions, test scripts, and answer results of each model are

open sourced on Huggingface

for everyone to check and further test other models.

The results show that GPT-4 and Claude-2 form the first echelon, with very high scores on GSM8k and new papers. One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Although this does not mean that there are no GSM8k leaked questions in the training data of GPT-4 and Claude 2, but at least they have good generalization capabilities and can solve new questions correctly, so they don’t care.

Next, Musk xAI’s Grok-0

(33B)

and Grok-1

(unpublished parameter scale) performed well.

Grok-1 has the highest score in the "non-cheating group", and his new paper score is even higher than Claude 2.

Grok-0's performance on GSM8k is close to GPT3.5-Turbo, and slightly worse on the new paper.

Except for the above-mentioned closed models, the other models in the test are all open source

Code Llama series

is Meta’s own version of Llama 2 It is basically fine-tuned, focusing on generating code based on natural language. Now it seems that the mathematical ability is slightly worse than models of the same scale.

Based on Code Llama, many universities and research institutions jointly launched the One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Llemma series

, which was open sourced by EleutherAI. The team collected the Proof-Pile-2 dataset from scientific papers, network data containing mathematics, and mathematical code. After training, Llemma can use tools and do formal theorem proofs without any further fine-tuning.

On the new paper, the performance of Llemma 34B is close to the GPT-3.5 Turbo level

Mistral series is trained by the French AI unicorn Mistral AI. The Apache2.0 open source agreement is more relaxed than Llama, becoming a sheep The most popular basic model in the open source community after the Tuo family.

#OpenChat 3.5 and MetaMath Mistral are all fine-tuned based on the Mistral ecosystem.

MetaMath and MAmmoTH Code are based on the Code Llama ecosystem. Those who choose to adopt open source large models in actual business need to be careful to avoid this group, because they are likely to perform well just to boost the rankings, but their actual capabilities may not be as strong as other models of the same scale

Many netizens expressed their gratitude to Paster for this experiment, believing that this is exactly what is needed to understand the actual situation of the model.

Some people have expressed concerns:

From this day on, everyone who trains large models will add Hungarian math exam questions from previous years.

At the same time, he believes that the solution may be to have a

specialized large model evaluation company with proprietary testing.

Another proposal is to

Establish a test benchmark that is updated year by year to alleviate the overfitting problem.

The above is the detailed content of One trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical 'demon mirror'. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055612 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Nordhold: Fusion System, Explained

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1671

CakePHP Tutorial

1428

Laravel Tutorial

1329

PHP Tutorial

1276

C# Tutorial

1256

Related knowledge

How to use the chrono library in C? Apr 28, 2025 pm 10:18 PM

Using the chrono library in C can allow you to control time and time intervals more accurately. Let's explore the charm of this library. C's chrono library is part of the standard library, which provides a modern way to deal with time and time intervals. For programmers who have suffered from time.h and ctime, chrono is undoubtedly a boon. It not only improves the readability and maintainability of the code, but also provides higher accuracy and flexibility. Let's start with the basics. The chrono library mainly includes the following key components: std::chrono::system_clock: represents the system clock, used to obtain the current time. std::chron

How to understand DMA operations in C? Apr 28, 2025 pm 10:09 PM

DMA in C refers to DirectMemoryAccess, a direct memory access technology, allowing hardware devices to directly transmit data to memory without CPU intervention. 1) DMA operation is highly dependent on hardware devices and drivers, and the implementation method varies from system to system. 2) Direct access to memory may bring security risks, and the correctness and security of the code must be ensured. 3) DMA can improve performance, but improper use may lead to degradation of system performance. Through practice and learning, we can master the skills of using DMA and maximize its effectiveness in scenarios such as high-speed data transmission and real-time signal processing.

Steps to add and delete fields to MySQL tables Apr 29, 2025 pm 04:15 PM

In MySQL, add fields using ALTERTABLEtable_nameADDCOLUMNnew_columnVARCHAR(255)AFTERexisting_column, delete fields using ALTERTABLEtable_nameDROPCOLUMNcolumn_to_drop. When adding fields, you need to specify a location to optimize query performance and data structure; before deleting fields, you need to confirm that the operation is irreversible; modifying table structure using online DDL, backup data, test environment, and low-load time periods is performance optimization and best practice.

What is real-time operating system programming in C? Apr 28, 2025 pm 10:15 PM

C performs well in real-time operating system (RTOS) programming, providing efficient execution efficiency and precise time management. 1) C Meet the needs of RTOS through direct operation of hardware resources and efficient memory management. 2) Using object-oriented features, C can design a flexible task scheduling system. 3) C supports efficient interrupt processing, but dynamic memory allocation and exception processing must be avoided to ensure real-time. 4) Template programming and inline functions help in performance optimization. 5) In practical applications, C can be used to implement an efficient logging system.

Top 10 digital currency trading platforms: Top 10 safe and reliable digital currency exchanges Apr 30, 2025 pm 04:30 PM

The top 10 digital virtual currency trading platforms are: 1. Binance, 2. OKX, 3. Coinbase, 4. Kraken, 5. Huobi Global, 6. Bitfinex, 7. KuCoin, 8. Gemini, 9. Bitstamp, 10. Bittrex. These platforms all provide high security and a variety of trading options, suitable for different user needs.

Quantitative Exchange Ranking 2025 Top 10 Recommendations for Digital Currency Quantitative Trading APPs Apr 30, 2025 pm 07:24 PM

The built-in quantization tools on the exchange include: 1. Binance: Provides Binance Futures quantitative module, low handling fees, and supports AI-assisted transactions. 2. OKX (Ouyi): Supports multi-account management and intelligent order routing, and provides institutional-level risk control. The independent quantitative strategy platforms include: 3. 3Commas: drag-and-drop strategy generator, suitable for multi-platform hedging arbitrage. 4. Quadency: Professional-level algorithm strategy library, supporting customized risk thresholds. 5. Pionex: Built-in 16 preset strategy, low transaction fee. Vertical domain tools include: 6. Cryptohopper: cloud-based quantitative platform, supporting 150 technical indicators. 7. Bitsgap:

How to measure thread performance in C? Apr 28, 2025 pm 10:21 PM

Measuring thread performance in C can use the timing tools, performance analysis tools, and custom timers in the standard library. 1. Use the library to measure execution time. 2. Use gprof for performance analysis. The steps include adding the -pg option during compilation, running the program to generate a gmon.out file, and generating a performance report. 3. Use Valgrind's Callgrind module to perform more detailed analysis. The steps include running the program to generate the callgrind.out file and viewing the results using kcachegrind. 4. Custom timers can flexibly measure the execution time of a specific code segment. These methods help to fully understand thread performance and optimize code.

How does deepseek official website achieve the effect of penetrating mouse scroll event? Apr 30, 2025 pm 03:21 PM

How to achieve the effect of mouse scrolling event penetration? When we browse the web, we often encounter some special interaction designs. For example, on deepseek official website, �...

See all articles