The secret of data-centered AI in the GPT model
Translator | Zhu Xianzhong
Reviewer| Chonglou
The image comes from the article https://www.php.cn/link/f74412c3c1c8899f3c130bb30ed0e363, produced by the author myself
Artificial intelligence is making incredible progress in changing the way we live, work, and interact with technology. Recently, an area that has made significant progress is the development of large language models (LLM), such as GPT-3, ChatGPT and GPT-4. These models are able to perform tasks such as language translation, text summarization, and question answering with impressive accuracy. While it’s hard to ignore the ever-increasing model sizes of large language models, it’s equally important to recognize that their success is largely due to the large number of high-performance machines used to train them. Quality data.
In this article, we will provide an overview of recent advances in large language models from a data-centric artificial intelligence perspective, referring to our recent survey paper (end The views in documents 1 and 2) and the corresponding
technical resources on GitHub. In particular, we will take a closer look at the GPT model through the lens of data-centric artificial intelligence, a growing trend in the data science community A point of view. We will reveal the data-centric artificial intelligence behind the GPT model by discussing three data-centric artificial intelligence goals - training data development, inference data development and data maintenance. Concept. Large Language Model vs. GPT ModelLLM (Large Language Model) is a natural language processing model that is trained to infer words in context. For example, the most basic function of LLM is to predict missing tokens given context. To do this, LLM is trained to predict the probability of each candidate token from massive amounts of data.
#Illustrative example of predicting the probability of missing tokens using a large language model with context (provided by the author himself Picture)
GPT model refers to a series of large-scale language models created by OpenAI, such as GPT-1, GPT-2、GPT-3、InstructGPT and ChatGPT/GPT-4. Like other large-scale language models, the architecture of the GPT model is heavily based on Transformers, which use text and positional embeddings as input and use attention layers to model relationships between tokens.
GPT-1 model architecture diagram, this image comes from the paper https://www.php.cn/link/c3bfbc2fc89bd1dd71ad5fc5ac96ae69
The later GPT model used a similar architecture to GPT-1, but used more model parameters, with more layers, larger context length, Hidden layer size, etc.
Comparison of various model sizes of the GPT model (picture provided by the author)
What is data-centric artificial intelligence?
Data-centric artificial intelligence is an emerging new way of thinking about how to build artificial intelligence systems. Artificial intelligence pioneer Andrew Ng has been championing this idea.
Data-centric artificial intelligence is the discipline of systematic engineering of the data used to build artificial intelligence systems.
——Andrew Ng
In the past, we mainly focused on creating better models (model-centric artificial intelligence) when the data is basically unchanged. However, this approach can cause problems in the real world because it does not take into account different issues that can arise in the data, such as inaccurate labels, duplications, and biases. Therefore, "overfitting" a data set may not necessarily lead to better model behavior.
In contrast, data-centric AI focuses on improving the quality and quantity of data used to build AI systems. This means that attention will be focused on the data itself, while the model is relatively more fixed. A data-centric approach to developing AI systems has greater potential in the real world because the data used for training ultimately determines the model’s maximum capabilities.
It is worth noting that "data-centric" is fundamentally different from "data-driven" because the latter only emphasizes the use of data to guide artificial intelligence development, while AI development often remains centered around developing models rather than engineering data.
Comparison of data-centered artificial intelligence and model-centered AI (picture from https:/ /www.php.cn/link/f9afa97535cf7c8789a1c50a2cd83787Paper author)
Overall, the data-centered artificial intelligence framework consists of three goals:
- Training data development is the collection and generation of rich, high-quality data to support the training of machine learning models.
- Inference data development is used to create new evaluation sets that can provide more granular insights to the model or trigger the model through engineering data input. specific abilities.
- #Data maintenance is to ensure the quality and reliability of data in a dynamic environment. Data maintenance is critical because real-world data is not created once but requires ongoing maintenance.
Data-centered artificial intelligence framework (image from paperhttps://www.php.cn/link/ Author of f74412c3c1c8899f3c130bb30ed0e363)
Why does data-centric artificial intelligence make the GPT model so successful?
A few months ago, Yann LeCun, a leader in the artificial intelligence industry, stated on Twitter that ChatGPT is nothing new. In fact, all the techniques used in ChatGPT and GPT-4 (Ttransformer and reinforcement learning from human feedback, etc.) are not new technologies. However, they did achieve incredible results that previous models couldn't achieve. So what drives their success?
First, strengthen training data development. Through better data collection, data labeling, and data preparation strategies, the quantity and quality of data used to train GPT models has increased significantly.
- GPT-1: BooksCorpus dataset is used for training. The dataset contains 4629MB of raw text, covering books in a range of genres including adventure, fantasy, and romance.
- #Not using a data-centric AI strategy.
- Training results: Applying GPT-1 on this dataset can improve the performance of downstream tasks through fine-tuning.
- Adopts a data-centric artificial intelligence strategy: (1) Control/filter data using only outbound links from Reddit that receive at least to 3 results; (2) use the tools Dragnet and Newspaper to extract "clean" content; (3) adopt deduplication and some other heuristic-based purification methods (the details are not mentioned in the paper).
- #Training results: 40GB of text was obtained after purification. GPT-2 achieves robust zero-sample results without fine-tuning.
- A data-centric artificial intelligence strategy is used: (1) Train a classifier to filter out low-quality documents based on their similarity to WebText Documentation, WebText is a proxy for high-quality documents. (2) Use Spark’s MinHashLSH to perform fuzzy deduplication on documents. (3) Use WebText, book corpora, and Wikipedia to enhance data.
- Training results: 570GB of text was filtered from 45TB of plaintext (only 1.27% of the data was selected in this quality filtering). In the zero-sample setting, GPT-3 significantly outperforms GPT-2.
- uses a data-centric artificial intelligence strategy: (1) Use manually provided prompt answers to adjust the model through supervised training. (2) Collect comparative data to train a reward model, and then use the reward model to tune GPT-3 through reinforcement learning from human feedback (RLHF).
- Training results: InstructGPT shows better authenticity and less bias, that is, better consistency.
- GPT-2: Use WebText Come for training. This is an internal dataset within OpenAI created by scraping outbound links from Reddit.
- GPT-3: The training of GPT-3 is mainly based on the Common Crawl tool.
- InstructGPT: Let human evaluation adjust GPT-3 answers so that they better match human expectations. They designed tests for annotators, and only those who could pass the tests were eligible for annotation. Additionally, they even designed a survey to ensure that annotators enjoyed the annotation process.
- ChatGPT/GPT-4: OpenAI did not disclose details. But as we all know, ChatGPT/GPT-4 largely follows the design of previous GPT models, and they still use RLHF to tune the model (possibly with more and higher quality data/labels). It is generally believed that GPT-4 uses larger data sets as the model weights increase.
Secondly, develop inference data. Since recent GPT models have become powerful enough, we can achieve various goals by adjusting the hints (or adjusting the inference data) while fixing the model. For example, we can perform text summarization by providing the text of the summary along with instructions such as "summarize it" or "TL;DR" to guide the reasoning process.
Prompt fine-tuning, picture by Provided by the author
Designing the right inference prompts is a challenging task. It relies heavily on heuristic techniques. A good survey summarizes the different prompting methods people use so far. Sometimes, even semantically similar cues can have very different outputs. In this case, soft-cue-based calibration may be needed to reduce the discrepancy.
Soft prompt based calibration. This image comes from the paper https://arxiv.org/abs/2303.13035v1, with permission from the original author
Research on the development of large-scale language model inference data is still In early stages. In the near future, more inference data development techniques already used in other tasks may be applied to the field of large language models.
In terms of data maintenance, ChatGPT/GPT-4, as a commercial product, is not just a successful training once, but requires continuous training. Updates and maintenance. Obviously, we don't know how data maintenance is performed outside of OpenAI. Therefore, we discuss some general data-centric AI strategies that are likely to have been used or will be used in GPT models:
- Continuous Data Collection: When we use ChatGPT/GPT-4, our tips/feedback can in turn be used by OpenAI to further advance their models. Quality metrics and assurance strategies may have been designed and implemented to collect high-quality data during the process.
- Data Understanding Tools: It is possible that various tools have been developed to visualize and understand user data, promote a better understanding of user needs, and guide the future direction of improvement.
- Efficient data processing: With the rapid growth of the number of ChatGPT/GPT-4 users, an efficient data management system is needed to achieve rapid data collection. .
ChatGPT/GPT-4 system is able to collect user feedback through the two icon buttons of "thumbs up" and "thumbs down" as shown in the figure to further promote them system development. The screenshot here is from https://chat.openai.com/chat.
What can the data science community learn from this wave of large language models?
The success of large language models has revolutionized artificial intelligence. Going forward, large language models may further revolutionize the data science lifecycle. To this end, we make two predictions:
- Data-centric artificial intelligence becomes more important. After years of research, model design has become very mature, especially after Transformer. Engineering data becomes the key (or perhaps the only) way to improve AI systems in the future. Furthermore, when the model becomes powerful enough, we do not need to train the model in our daily work. Instead, we only need to design appropriate inference data (just-in-time engineering) to explore knowledge from the model. Therefore, research and development of data-centric AI will drive future progress.
- Large language models will enable better data-centric artificial intelligence solutions. Many tedious data science tasks can be performed more efficiently with the help of large language models. For example, ChaGPT/GPT-4 already makes it possible to write operational code to process and clean data. Additionally, large language models can even be used to create data for training. For example, recent work has shown that using large language models to generate synthetic data can improve model performance in clinical text mining.
Use a large language model to generate synthetic data to train the model, the image here is from the paper https:/ /arxiv.org/abs/2303.04360, with permission from the original author
References
I hope this article can be used in your own Inspire you at work. You can learn more about data-centric AI frameworks and how they can benefit large language models in the following papers:
[1]A review of data-centered artificial intelligence.
[2]The prospects and challenges of data-centered artificial intelligence.
Note that we also maintain a GitHub code repository, which will be updated regularly of data-centric artificial intelligence resources.
In future articles, I will delve into the three goals of data-centric artificial intelligence (training data development, inference data development, and data maintenance) and introduce representative sexual method.
Translator introduction
Zhu Xianzhong, 51CTO community editor, 51CTO expert blogger, lecturer, computer teacher at a university in Weifang, freelance programming community A veteran.
Original title: What Are the Data-Centric AI Concepts behind GPT Models?, Author: Henry Lai
The above is the detailed content of The secret of data-centered AI in the GPT model. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

DALL-E 3 was officially introduced in September of 2023 as a vastly improved model than its predecessor. It is considered one of the best AI image generators to date, capable of creating images with intricate detail. However, at launch, it was exclus

This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

Large Language Models (LLMs) are trained on huge text databases, where they acquire large amounts of real-world knowledge. This knowledge is embedded into their parameters and can then be used when needed. The knowledge of these models is "reified" at the end of training. At the end of pre-training, the model actually stops learning. Align or fine-tune the model to learn how to leverage this knowledge and respond more naturally to user questions. But sometimes model knowledge is not enough, and although the model can access external content through RAG, it is considered beneficial to adapt the model to new domains through fine-tuning. This fine-tuning is performed using input from human annotators or other LLM creations, where the model encounters additional real-world knowledge and integrates it

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

Editor | KX In the field of drug research and development, accurately and effectively predicting the binding affinity of proteins and ligands is crucial for drug screening and optimization. However, current studies do not take into account the important role of molecular surface information in protein-ligand interactions. Based on this, researchers from Xiamen University proposed a novel multi-modal feature extraction (MFE) framework, which for the first time combines information on protein surface, 3D structure and sequence, and uses a cross-attention mechanism to compare different modalities. feature alignment. Experimental results demonstrate that this method achieves state-of-the-art performance in predicting protein-ligand binding affinities. Furthermore, ablation studies demonstrate the effectiveness and necessity of protein surface information and multimodal feature alignment within this framework. Related research begins with "S

According to news from this site on August 1, SK Hynix released a blog post today (August 1), announcing that it will attend the Global Semiconductor Memory Summit FMS2024 to be held in Santa Clara, California, USA from August 6 to 8, showcasing many new technologies. generation product. Introduction to the Future Memory and Storage Summit (FutureMemoryandStorage), formerly the Flash Memory Summit (FlashMemorySummit) mainly for NAND suppliers, in the context of increasing attention to artificial intelligence technology, this year was renamed the Future Memory and Storage Summit (FutureMemoryandStorage) to invite DRAM and storage vendors and many more players. New product SK hynix launched last year

In the world of front-end development, VSCode has become the tool of choice for countless developers with its powerful functions and rich plug-in ecosystem. In recent years, with the rapid development of artificial intelligence technology, AI code assistants on VSCode have sprung up, greatly improving developers' coding efficiency. AI code assistants on VSCode have sprung up like mushrooms after a rain, greatly improving developers' coding efficiency. It uses artificial intelligence technology to intelligently analyze code and provide precise code completion, automatic error correction, grammar checking and other functions, which greatly reduces developers' errors and tedious manual work during the coding process. Today, I will recommend 12 VSCode front-end development AI code assistants to help you in your programming journey.

According to news from this website on July 5, GlobalFoundries issued a press release on July 1 this year, announcing the acquisition of Tagore Technology’s power gallium nitride (GaN) technology and intellectual property portfolio, hoping to expand its market share in automobiles and the Internet of Things. and artificial intelligence data center application areas to explore higher efficiency and better performance. As technologies such as generative AI continue to develop in the digital world, gallium nitride (GaN) has become a key solution for sustainable and efficient power management, especially in data centers. This website quoted the official announcement that during this acquisition, Tagore Technology’s engineering team will join GLOBALFOUNDRIES to further develop gallium nitride technology. G
