Table of Contents
Contests and Trends
Platform
Academia
Prizes
How to win
Python Packages Used by Winners
Computer Vision
Natural Language Processing
Compute and Hardware
Home Technology peripherals AI Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Apr 21, 2023 pm 07:55 PM
data Contest

2022 is a big year for AI, and also for data competitions, with total prize money across all platforms exceeding $5 million.

Recently, the machine learning competition analysis platform ML Contests conducted a large-scale statistics on the 2022 data competition. New report takes a look at all the noteworthy happenings in 2022. The following is a compilation of the original text.

Highlights:

  • Tool selections for successful contestants: Python, Pydata, Pytorch, and gradient-boosted decision trees.
  • Deep learning has not yet replaced gradient-boosted decision trees, although the former often increases in value when getting acquainted with boosting methods.
  • Transformers continue to dominate NLP and are beginning to compete with convolutional neural networks in computer vision.
  • Today’s data competitions cover a wide range of research areas, including computer vision, NLP, data analysis, robotics, time series analysis, etc.
  • Large ensemble models are still common among winning solutions, and some single-model solutions can also win.
  • There are multiple active data competition platforms.
  • The data competition community continues to grow, including in academia.
  • About 50% of the winners are one-person teams, and 50% of the winners are first-time winners.
  • Some people use high-end hardware, but free resources like Google Colab can also win the game.

The competition with the largest prize money is Drivendata’s Snow Cast Showdown Contest sponsored by the U.S. Bureau of Reclamation. Participants receive $500,000 in prize money and are designed to help improve water supply management by providing accurate snowwater flow estimates for different regions across the West. As always, Drivendata has written a detailed article on the matchup and has a detailed solution report that is well worth a read.

The most popular competition of 2022 is Kaggle’s American Express Default Prediction competition, which aims to predict whether customers will repay their loans. More than 4,000 teams competed, with $100,000 in prize money distributed to the top four teams. For the first time this year, a first-time entry was won by a one-person team using an ensemble of neural networks and LightGBM models.

The largest independent competition is Stanford University’s AI Audit Challenge, which offers a $71,000 reward pool for the best “models, solutions, datasets, and tools.” To find ways to solve the problem of "illegal discriminatory AI review systems".

Three competitions based on financial predictions are all on Kaggle: JPX’s Tokyo Stock Exchange predictions, Ubiquant’s market predictions, and G-Research’s crypto predictions.

In comparisons in different directions, computer vision accounts for the highest proportion, NLP ranks second, and sequential decision-making problems (reinforcement learning) are on the rise. Kaggle responded to this growth in popularity by introducing simulation competitions in 2020. Aicrowd also hosts many reinforcement learning competitions. In 2022, 25 of those Interactive events totaled more than $300,000.

In the official NeurIPS 2022 competition Real Robot Challenge, participants must learn to control a three-fingered robot to move a cube to a target location or position it at a specific point in space, and Be facing the right direction. Participants' strategies are run on the physical robot every week, and the results are updated on the leaderboard. The award is a $5,000 prize and the academic honor of speaking at the NeurIPS Symposium.

Platform

Although people are familiar with Kaggle and Tianchi, there are currently many machine learning competition platforms that form an active ecosystem.

The picture below shows the 2022 platform comparison:

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Give some examples:

  • Kaggle is one of the most established platforms, it was acquired by Google in 2017 and has the largest community, recently attracting 10 million users. Running competitions with prizes on Kaggle can be very expensive. In addition to hosting competitions, Kaggle also allows users to host datasets, notes, and models.
  • Codalab is an open source competition platform maintained by the University of Paris - Saclay. Anyone can register, host or participate in a contest. It provides free CPU resources for inference that competition organizers can supplement with their own hardware.
  • Zindi is a smaller platform with a very active community focused on connecting institutions with data scientists in Africa. Drivendata focuses on social impact competitions and has developed competitions for NASA and other organizations. Competitions are always followed by in-depth research reports.
  • Aicrowd started as a research project at the Swiss Federal Institute of Technology (EPFL) and is now one of the top five competition platforms. It hosts several official NeurIPS competitions.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Academia

Most of the prize money for competitions run on large platforms from industry, but machine learning competitions clearly have a richer history in academia, as Isabelle Guyon discussed in her NeurIPS invited talk this year.

NeurIPS is one of the most prestigious academic machine learning conferences in the world. The most important machine learning papers in the past decade are often presented at the conference, including AlexNet, GAN, Transformer and GPT-3.

NeurIPS first held the Data Challenge in Machine Learning (CIML) workshop in 2014, and there has been a competition component since 2017. Since then, the competition and total prize money have continued to grow, reaching nearly $400,000 in December 2022.

Other machine learning conferences also host competitions, including CVPR, ICPR, IJCAI, ICRA, ECCV, PCIC, and AutoML.

Prizes

About half of all machine learning competitions have prize pools of over $10,000. There is no doubt that many interesting competitions have small prizes, and this report only considers those with monetary prizes or academic honors. Often, data competitions associated with prestigious academic conferences provide the winners with travel grants to attend the conference.

While some tournament platforms do tend to have larger prize pools on average than others (see platform comparison chart), many platforms are hosting at least one prize pool in 2022 Very Big Competitions - The top ten competitions by total prize money include those run on DrivenData, Kaggle, CodaLab and AIcrowd.

How to win

This survey analyzes the techniques used by the winning algorithm through questionnaires and code observation.

Quite consistently, Python was the language of choice for the contest winners, which may not be an unexpected result for people. Of those who use Python, about half primarily use Jupyter Notebook, and the other half use standard Python scripts.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

A winning solution using mostly R is: Amir Ghazi won on Kaggle to predict the 2022 American Men’s College Basketball tournament winner's game. He did this by using — apparently copying verbatim — code from a 2018 competition-winning solution written by Kaggle Grandmaster Darius Barušauskas. Unbelievably, Darius also competed in this race in 2022, using a new approach and finishing 593rd.

Python Packages Used by Winners

When looking at the packages used in the winning solutions, the results showed that all winners using Python to some extent PyData stack.

The most popular software packages are divided into three categories - core toolkits, NLP categories and computer vision categories.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Among them, the growth of the deep learning framework PyTorch has been stable, and its jump from 2021 to 2022 is very obvious: PyTorch has gone from being the winning solution to 77% increased to 96%.

Of the 46 winning solutions using deep learning, 44 used PyTorch as their primary framework and only two used TensorFlow. Even more tellingly, one of the two competitions won using TensorFlow, Kaggle's Great Barrier Reef Competition, offers an additional $50,000 in prize money to the winning team using TensorFlow. Another competition won using TensorFlow used the high-level Keras API.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

While there were 3 winners using pytorch-lightning and 1 using fastai - both were built on PyTorch above - but the vast majority of people use PyTorch directly.

It may now be said that at least in the data race, PyTorch has won the machine learning framework battle. This is consistent with broader machine learning research trends.

Notably, we found no instances of the winning team using other neural network libraries, such as JAX (built by Google and used by DeepMind), PaddlePaddle (developed by Baidu) or MindSpore (developed by Huawei).

Computer Vision

Tools have a tendency to dominate the world, but technology does not. At CVPR 2022, the ConvNext architecture was introduced as the “ConvNet of the 2020s” and proved to outperform recent Transformer-based models. It was used in at least two competition-winning computer vision solutions, and CNN overall remains the most popular neural network architecture among computer vision competition winners to date.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Computer vision is very similar to language modeling in the use of pre-trained models: on public datasets such as ImageNet ), easy-to-understand architecture trained on. The most popular repository is Hugging Face Hub, accessible through timm, which makes it extremely convenient to load pre-trained versions of dozens of different computer vision models.

The advantages of using pre-trained models are obvious: real-world images and human-generated text have some common characteristics, and using pre-trained models can bring common sense knowledge, similar to Yu used a larger and more general training data set.

Typically, pre-trained models are fine-tuned – and further trained – based on task-specific data (such as data provided by competition organizers), but not always. The winner of the Image Matching Challenge used a pre-trained model without any fine-tuning at all - "Due to the (different) quality of the training and test data in this competition, we did not fine-tune using the provided training because we thought it would Not very effective." The decision paid off.

So far, the most popular pre-trained computer vision model type among the 2022 winners is EfficientNet, which, as the name suggests, has the advantage of being less resource intensive than many other models.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Natural Language Processing

Transformer-based models have dominated natural language processing since their inception in 2017 The field of language processing (NLP). Transformer is the "T" in BERT and GPT, and is also the core of ChatGPT.

So it’s no surprise that all winning solutions in natural language processing competitions have Transformer-based models at their core. It’s no surprise that they are all implemented in PyTorch. They all used pre-trained models, loaded using Hugging Face’s Transformers library, and almost all used Microsoft Research’s version of the DeBERTa model – usually deberta-v3-large.

Many of them require large amounts of computing resources. For example, the Google AI4Code winner ran an A100 (80GB) for approximately 10 days to train a single deberta-v3-large for their final solution. This approach is the exception (using a single master model and a fixed train/evaluation split) - all other solutions make heavy use of ensemble models, and almost all use some form of k-fold cross-validation. For example, the winner of the Jigsaw Toxic Comments contest used a weighted average of the outputs of 15 models.

Transformer-based ensembles are sometimes used in conjunction with LSTM or LightGBM, and there are also at least two instances of pseudo-labeling that were effectively used for the winning solution.

XGBoost was once synonymous with Kaggle. However, LightGBM is clearly the favorite GBDT library for the 2022 winners - winners mentioned LightGBM as many times in their solution reports or questionnaires as CatBoost and XGBoost combined, CatBoost came in second, and XGBoost surprisingly ranked third.

Compute and Hardware

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

##As roughly expected, most winners used GPUs for training— — This can greatly improve the training performance of gradient boosted trees and is actually required for deep neural networks. A significant number of award recipients have access to clusters provided by their employer or university, often including GPUs.

Somewhat surprisingly, we didn’t find any instances of using Google’s Tensor Processing Unit, the TPU, to train a winning model. We also didn’t see any winning models trained on Apple’s M-series chips, which have been supported by PyTorch since May 2022.

Google's cloud notebook solution Colab was popular, with one winner on the free plan, one on the Pro plan, and another on Pro (we can't confirm the fourth winner) or using the package used by Colab).

Local personal hardware was more popular than cloud hardware, and although nine winners mentioned the GPU they used for training, they did not specify whether they used a local or cloud GPU.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

The most popular GPU is the latest high-end AI accelerator card NVIDIA A100 (here A100 40GB and A100 80GB are placed together, since the winner can't always tell the difference), and often multiple A100s - for example, the winner of Zindi's Turtle Recall competition used 8 A100 (40GB) GPUs, and the other two winners used 4 A100.

Team Formation

Many competitions allow up to 5 entrants per team, teams can consist of individuals or smaller teams at some point before the results submission deadline "Merge" them together before the deadline.

Some competitions allow for larger teams, for example, Waymo’s Open Data Challenge allows up to 10 people per team.

Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games

Conclusion

This is a rough look at the 2022 machine learning competition. Hope you can find some useful information in it.

There are many exciting new competitions in 2023, and we look forward to releasing more insights as they wrap up.

The above is the detailed content of Revealing the secret to victory in the data competition: analyzing the advantages of A100 in 200 games. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Use ddrescue to recover data on Linux Use ddrescue to recover data on Linux Mar 20, 2024 pm 01:37 PM

DDREASE is a tool for recovering data from file or block devices such as hard drives, SSDs, RAM disks, CDs, DVDs and USB storage devices. It copies data from one block device to another, leaving corrupted data blocks behind and moving only good data blocks. ddreasue is a powerful recovery tool that is fully automated as it does not require any interference during recovery operations. Additionally, thanks to the ddasue map file, it can be stopped and resumed at any time. Other key features of DDREASE are as follows: It does not overwrite recovered data but fills the gaps in case of iterative recovery. However, it can be truncated if the tool is instructed to do so explicitly. Recover data from multiple files or blocks to a single

Open source! Beyond ZoeDepth! DepthFM: Fast and accurate monocular depth estimation! Open source! Beyond ZoeDepth! DepthFM: Fast and accurate monocular depth estimation! Apr 03, 2024 pm 12:04 PM

0.What does this article do? We propose DepthFM: a versatile and fast state-of-the-art generative monocular depth estimation model. In addition to traditional depth estimation tasks, DepthFM also demonstrates state-of-the-art capabilities in downstream tasks such as depth inpainting. DepthFM is efficient and can synthesize depth maps within a few inference steps. Let’s read about this work together ~ 1. Paper information title: DepthFM: FastMonocularDepthEstimationwithFlowMatching Author: MingGui, JohannesS.Fischer, UlrichPrestel, PingchuanMa, Dmytr

How to use Excel filter function with multiple conditions How to use Excel filter function with multiple conditions Feb 26, 2024 am 10:19 AM

If you need to know how to use filtering with multiple criteria in Excel, the following tutorial will guide you through the steps to ensure you can filter and sort your data effectively. Excel's filtering function is very powerful and can help you extract the information you need from large amounts of data. This function can filter data according to the conditions you set and display only the parts that meet the conditions, making data management more efficient. By using the filter function, you can quickly find target data, saving time in finding and organizing data. This function can not only be applied to simple data lists, but can also be filtered based on multiple conditions to help you locate the information you need more accurately. Overall, Excel’s filtering function is a very practical

Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Google is ecstatic: JAX performance surpasses Pytorch and TensorFlow! It may become the fastest choice for GPU inference training Apr 01, 2024 pm 07:46 PM

The performance of JAX, promoted by Google, has surpassed that of Pytorch and TensorFlow in recent benchmark tests, ranking first in 7 indicators. And the test was not done on the TPU with the best JAX performance. Although among developers, Pytorch is still more popular than Tensorflow. But in the future, perhaps more large models will be trained and run based on the JAX platform. Models Recently, the Keras team benchmarked three backends (TensorFlow, JAX, PyTorch) with the native PyTorch implementation and Keras2 with TensorFlow. First, they select a set of mainstream

Slow Cellular Data Internet Speeds on iPhone: Fixes Slow Cellular Data Internet Speeds on iPhone: Fixes May 03, 2024 pm 09:01 PM

Facing lag, slow mobile data connection on iPhone? Typically, the strength of cellular internet on your phone depends on several factors such as region, cellular network type, roaming type, etc. There are some things you can do to get a faster, more reliable cellular Internet connection. Fix 1 – Force Restart iPhone Sometimes, force restarting your device just resets a lot of things, including the cellular connection. Step 1 – Just press the volume up key once and release. Next, press the Volume Down key and release it again. Step 2 – The next part of the process is to hold the button on the right side. Let the iPhone finish restarting. Enable cellular data and check network speed. Check again Fix 2 – Change data mode While 5G offers better network speeds, it works better when the signal is weaker

Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! Tesla robots work in factories, Musk: The degree of freedom of hands will reach 22 this year! May 06, 2024 pm 04:13 PM

The latest video of Tesla's robot Optimus is released, and it can already work in the factory. At normal speed, it sorts batteries (Tesla's 4680 batteries) like this: The official also released what it looks like at 20x speed - on a small "workstation", picking and picking and picking: This time it is released One of the highlights of the video is that Optimus completes this work in the factory, completely autonomously, without human intervention throughout the process. And from the perspective of Optimus, it can also pick up and place the crooked battery, focusing on automatic error correction: Regarding Optimus's hand, NVIDIA scientist Jim Fan gave a high evaluation: Optimus's hand is the world's five-fingered robot. One of the most dexterous. Its hands are not only tactile

The vitality of super intelligence awakens! But with the arrival of self-updating AI, mothers no longer have to worry about data bottlenecks The vitality of super intelligence awakens! But with the arrival of self-updating AI, mothers no longer have to worry about data bottlenecks Apr 29, 2024 pm 06:55 PM

I cry to death. The world is madly building big models. The data on the Internet is not enough. It is not enough at all. The training model looks like "The Hunger Games", and AI researchers around the world are worrying about how to feed these data voracious eaters. This problem is particularly prominent in multi-modal tasks. At a time when nothing could be done, a start-up team from the Department of Renmin University of China used its own new model to become the first in China to make "model-generated data feed itself" a reality. Moreover, it is a two-pronged approach on the understanding side and the generation side. Both sides can generate high-quality, multi-modal new data and provide data feedback to the model itself. What is a model? Awaker 1.0, a large multi-modal model that just appeared on the Zhongguancun Forum. Who is the team? Sophon engine. Founded by Gao Yizhao, a doctoral student at Renmin University’s Hillhouse School of Artificial Intelligence.

Alibaba 7B multi-modal document understanding large model wins new SOTA Alibaba 7B multi-modal document understanding large model wins new SOTA Apr 02, 2024 am 11:31 AM

New SOTA for multimodal document understanding capabilities! Alibaba's mPLUG team released the latest open source work mPLUG-DocOwl1.5, which proposed a series of solutions to address the four major challenges of high-resolution image text recognition, general document structure understanding, instruction following, and introduction of external knowledge. Without further ado, let’s look at the effects first. One-click recognition and conversion of charts with complex structures into Markdown format: Charts of different styles are available: More detailed text recognition and positioning can also be easily handled: Detailed explanations of document understanding can also be given: You know, "Document Understanding" is currently An important scenario for the implementation of large language models. There are many products on the market to assist document reading. Some of them mainly use OCR systems for text recognition and cooperate with LLM for text processing.

See all articles