Table of Contents
3. Experimental results" >3. Experimental results
4. Conclusion" >4. Conclusion
Home Technology peripherals AI Poverty prepares me

Poverty prepares me

Jun 26, 2023 am 08:32 AM
natural language

Poverty prepares me

##1. Do you need pre-training?

Poverty prepares me

The effect of pre-training is direct, and the resources required are often prohibitive. If this pre-training method exists, its startup requires very little computing power, data, and human resources, or even only the original corpus of a single person and a single card. After unsupervised data processing and a pre-training transfer to your own domain, you can obtain zero-sample NLG, NLG and vector representation reasoning capabilities. The recall capabilities of other vector representations exceed BM25. Are you interested in trying it?

Poverty prepares me


# Do you want to do something? Determine by measuring input and output. Pre-training is a big deal and requires some prerequisites and resources, as well as sufficient expected benefits before it can be implemented. The usually required conditions are: sufficient corpus construction. Generally speaking, quality is more rare than quantity, so the quality of the corpus can be relaxed, but the quantity must be sufficient; secondly, there is a corresponding talent reserve and manpower budget. In comparison, , small models are easier to train and have fewer obstacles, while large models will encounter more problems; the last thing is the computing resources. According to the scenario and talent matching, it depends on the people. It is best to have a large memory graphics card. The benefits brought by pre-training are also very intuitive. Migrating the model can directly bring about effect improvement. The degree of improvement is directly related to the pre-training investment and field differences. The final benefit is gained by model improvement and business scale.

#In our scenario, the data field is very different from the general field, and even the vocabulary needs to be significantly replaced, and the business scale is sufficient. If not pre-trained, the model will also be fine-tuned specifically for each downstream task. The expected benefits of pre-training are certain. Our corpus is poor in quality, but sufficient in quantity. Computing power resources are very limited and can be compensated for by matching the corresponding talent reserves. At this time, the conditions for pre-training are already met.

The factor that directly determines how we start pre-training is that there are too many downstream models that need to be maintained, which especially takes up machine and human resources, and needs to be assigned to each task. Prepare a large amount of data to train a dedicated model, and the complexity of model management increases dramatically. So we explore pre-training, hoping to build a unified pre-training task to benefit all downstream models. When we do this, it is not accomplished overnight. The more models that need to be maintained also mean more model experience. Combined with the experience of multiple previous projects, including some self-supervised learning, contrastive learning, multi-task learning and other models, after repeated experiments and iterations Fusion formed.

Poverty prepares me

The above picture is the traditional nlp pipeline paradigm, based on the existing general pre-training model, with optional migration After pre-training is completed, collecting data sets for each downstream task, fine-tuning the training, and requiring a lot of labor and graphics cards to maintain multiple downstream models and services.

The following figure is the new paradigm we proposed. When migrating to our field to continue pre-training, we use joint language modeling tasks and contrastive learning tasks to make the product The model has zero-sample NLU, NLG, and vector representation capabilities. These capabilities are modeled and can be used on demand. In this way, there are fewer models that need to be maintained, especially when the project is started, they can be used directly for research. If further fine-tuning is necessary, the amount of data required is also greatly reduced.

2. How to pre-train


Poverty prepares me

##This is our pre-trained model architecture, including the Transformer's encoder, decoder and vector representation head.

The goals of pre-training include language modeling and contrastive representation. The loss function is Total Loss = LM Loss α CL Loss. The language modeling task and the contrastive representation task are jointly trained, where α represents the weight coefficient. . Language modeling uses a mask model, similar to T5, which only decodes the mask part. The contrastive representation task is similar to CLIP. Within a batch, there is a pair of related training positive samples and other non-negative samples. For each sample pair (i, I) i, there is a positive sample I and the other samples are negative samples. , using symmetric cross-entropy loss to force the representation of positive samples to be close and the representation of negative samples to be far apart. Using T5 decoding can shorten the decoding length. A non-linear vector representation is placed above the head loading encoder. One is that the vector representation is required to be faster in the scenario, and the other is that the two shown functions act far away to prevent training target conflicts. So here comes the question. Cloze tasks are very common and do not require samples. So how do similar sample pairs come from?

Poverty prepares me

Of course, as a pre-training method, the sample pairs must be mined by an unsupervised algorithm. Usually, the basic method used to mine positive samples in the field of information retrieval is reverse cloze, which mines several fragments in a document and assumes that they are related. Here we split the document into sentences and then enumerate the sentence pairs. We use the longest common substring to determine whether two sentences are related. As shown in the figure, two positive and negative sentence pairs are taken. If the longest common substring is long enough to a certain extent, it is judged to be similar, otherwise it is not similar. The threshold is chosen by yourself. For example, a long sentence requires three Chinese characters, and more English letters are required. A short sentence can be more relaxed.

#We use correlation as the sample pair instead of semantic equivalence because the two goals are conflicting. As shown in the figure above, the meanings of cat catching mouse and mouse catching cat are opposite but related. Our scenario search is mainly focused on relevance. Moreover, correlation is broader than semantic equivalence, and semantic equivalence is more suitable for continued fine-tuning based on correlation.

#Some sentences are filtered multiple times, and some sentences are not filtered. We limit the frequency of sentences being selected. For the unsuccessful sentences, they can be copied as positive samples, spliced ​​into the selected sentences, or reverse cloze can be used as positive samples.

Poverty prepares me

Traditional masking methods such as SpanBert use geometric distribution to sample mask lengths. Short masks have a higher probability, and longer masks have a higher probability. The masking probability is low and suitable for long sentences. But our corpus is fragmented. When faced with short sentences of one to twenty words, the traditional tendency is to mask two single words rather than one double word, which does not meet our expectations. So we improved this distribution so that it has the highest probability of sampling the optimal length, and the probability of other lengths gradually decreases, just like a camel's hump, becoming a camel-hump geometric distribution, which is more robust in our short sentence-rich scenarios.

3. Experimental results

Poverty prepares me

We conducted a controlled experiment. Including GUR-FULL, which uses language modeling and vector contrastive representation; UR-LCS sample pairs are not filtered by LCS; UR-CL does not have contrastive representation learning, which is equivalent to a traditional language model; GUR-LM only has vector contrastive representation learning , without language modeling learning, is equivalent to fine-tuning specifically for downstream tasks; NLPC is a word2vec operator in Baidu.

#The experiment starts from a T5-small and continues pre-training. Training corpora include Wikipedia, Wikisource, CSL and our own corpora. Our own corpus is captured from the material library, and the quality is very poor. The best quality part is the title of the material library. Therefore, when digging for positive samples in other documents, almost any text pair is screened, while in our corpus, the title is used to match every sentence of the text. GUR-LCS has not been selected by LCS. If it is not done this way, the sample pair will be too bad. If it is done this way, the difference with GUR-FULL will be much smaller.

Poverty prepares me

We evaluate the vector representation effect of the model on several retrieval tasks. The picture on the left shows the performance of several models in recall. We found that the models learned through vector representation performed best, outperforming BM25. We also compared ranking targets, and this time BM25 came back to win. This shows that the dense model has strong generalization ability and the sparse model has strong determinism, and the two can complement each other. In fact, in downstream tasks in the field of information retrieval, dense models and sparse models are often used together.

Poverty prepares me

The above picture shows the NLU evaluation tasks with different training sample sizes. Each task has There are dozens to hundreds of categories, and the effect is evaluated by ACC score. The GUR model also converts the classification labels into vectors to find the nearest label for each sentence. The above figure from left to right shows zero sample, small sample and sufficient fine-tuning evaluation according to the increasing training sample size. The picture on the right is the model performance after sufficient fine-tuning, which shows the difficulty of each sub-task and is also the ceiling of zero-sample and small-sample performance. It can be seen that the GUR model can achieve zero-sample reasoning in some classification tasks by relying on vector representation. And the small sample capability of the GUR model is the most outstanding.

Poverty prepares me

This is a zero-sample performance in NLG. When we are doing title generation and query expansion, we mine titles with high-quality traffic, retain keywords, and randomly mask non-keywords. The models trained by language modeling perform well. This automatic prompt effect is similar to the manually constructed target effect, with wider diversity and capable of meeting mass production. Several models that have undergone language modeling tasks perform similarly. The above figure uses the GUR model example.

4. Conclusion

This article proposes a new pre-training paradigm. The above control experiments show that, Joint training does not create a conflict of objectives. When the GUR model continues to be pre-trained, it can increase its vector representation capabilities while maintaining its language modeling capabilities. Pre-training once, inference with zero original samples everywhere. Suitable for low-cost pre-training for business departments.

Poverty prepares me

The above link records our training details. For reference details, please see the paper citation and code The version is slightly newer than the paper. I hope to make a small contribution to the democratization of AI. Large and small models have their own application scenarios. In addition to being directly used for downstream tasks, the GUR model can also be used in combination with large models. In the pipeline, we first use the small model for recognition and then use the large model to instruct the task. The large model can also produce samples for the small model, and the GUR small model can provide vector retrieval for the large model.

The model in the paper is a small model selected to explore multiple experiments. In practice, the gain is obvious if a larger model is selected. Our exploration is not enough and further work is needed. If you are willing, you can contact laohur@gmail.com and look forward to making progress together with everyone.

The above is the detailed content of Poverty prepares me. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Introduction to five sampling methods in natural language generation tasks and Pytorch code implementation Introduction to five sampling methods in natural language generation tasks and Pytorch code implementation Feb 20, 2024 am 08:50 AM

In natural language generation tasks, sampling method is a technique to obtain text output from a generative model. This article will discuss 5 common methods and implement them using PyTorch. 1. GreedyDecoding In greedy decoding, the generative model predicts the words of the output sequence based on the input sequence time step by time. At each time step, the model calculates the conditional probability distribution of each word, and then selects the word with the highest conditional probability as the output of the current time step. This word becomes the input to the next time step, and the generation process continues until some termination condition is met, such as a sequence of a specified length or a special end marker. The characteristic of GreedyDecoding is that each time the current conditional probability is the best

How to do basic natural language generation using PHP How to do basic natural language generation using PHP Jun 22, 2023 am 11:05 AM

Natural language generation is an artificial intelligence technology that converts data into natural language text. In today's big data era, more and more businesses need to visualize or present data to users, and natural language generation is a very effective method. PHP is a very popular server-side scripting language that can be used to develop web applications. This article will briefly introduce how to use PHP for basic natural language generation. Introducing the natural language generation library The function library that comes with PHP does not include the functions required for natural language generation, so

Building text generators using Markov chains Building text generators using Markov chains Apr 09, 2023 pm 10:11 PM

In this article, we will introduce a popular machine learning project - text generator. You will learn how to build a text generator and learn how to implement a Markov chain to achieve a faster predictive model. Introduction to Text Generators Text generation is popular across industries, especially in mobile, apps, and data science. Even the press uses text generation to aid the writing process. In daily life, we will come into contact with some text generation technologies. Text completion, search suggestions, Smart Compose, and chat robots are all examples of applications. This article will use Markov chains to build a text generator. This would be a character-based model that takes the previous character of the chain and generates the next letter in the sequence. By training our program on sample words,

Traffic Engineering doubles code generation accuracy: from 19% to 44% Traffic Engineering doubles code generation accuracy: from 19% to 44% Feb 05, 2024 am 09:15 AM

The authors of a new paper propose a way to "enhance" code generation. Code generation is an increasingly important capability in artificial intelligence. It automatically generates computer code based on natural language descriptions by training machine learning models. This technology has broad application prospects and can transform software specifications into usable code, automate back-end development, and assist human programmers to improve work efficiency. However, generating high-quality code remains challenging for AI systems, compared with language tasks such as translation or summarization. The code must accurately conform to the syntax of the target programming language, handle edge cases and unexpected inputs gracefully, and handle the many small details of the problem description accurately. Even small bugs that may seem innocuous in other areas can completely disrupt the functionality of a program, causing

Cursor integrated with GPT-4 makes writing code as easy as chatting. A new era of coding in natural language has arrived. Cursor integrated with GPT-4 makes writing code as easy as chatting. A new era of coding in natural language has arrived. Apr 04, 2023 pm 12:15 PM

Github Copilot X, which integrates GPT-4, is still in small-scale internal testing, while Cursor, which integrates GPT-4, has been publicly released. Cursor is an IDE that integrates GPT-4 and can write code in natural language, making writing code as easy as chatting. There is still a big difference between GPT-4 and GPT-3.5 in their ability to process and write code. A test report from the official website. The first two are GPT-4, one uses text input and the other uses image input; the third is GPT3.5. It can be seen that the coding capabilities of GPT-4 have been greatly improved compared to GPT-3.5. Github Copilot X integrating GPT-4 is still in small-scale testing, and

With full coverage of values ​​and privacy protection, the Cyberspace Administration of China plans to 'establish rules” for generative AI With full coverage of values ​​and privacy protection, the Cyberspace Administration of China plans to 'establish rules” for generative AI Apr 13, 2023 pm 03:34 PM

On April 11, the Cyberspace Administration of China (hereinafter referred to as the Cyberspace Administration of China) drafted and released the "Measures for the Management of Generative Artificial Intelligence Services (Draft for Comments)" and launched a month-long solicitation of opinions from the public. This management measure (draft for comments) has a total of 21 articles. In terms of scope of application, it includes both entities that provide generative artificial intelligence services, as well as organizations and individuals who use these services; the management measures cover the output content of generative artificial intelligence. value orientation, training principles for service providers, protection of privacy/intellectual property rights and other rights, etc. The emergence of large-scale generative natural language models and products such as GPT not only allowed the public to experience the rapid progress of artificial intelligence, but also exposed security risks, including the generation of biased and discriminatory information.

Is it necessary to 'participle'? Andrej Karpathy: It's time to throw away this historical baggage Is it necessary to 'participle'? Andrej Karpathy: It's time to throw away this historical baggage May 20, 2023 pm 12:52 PM

The emergence of conversational AI such as ChatGPT has made people accustomed to this kind of thing: input a piece of text, code or a picture, and the conversational robot will give you the answer you want. But behind this simple interaction method, the AI ​​model needs to perform very complex data processing and calculations, and tokenization is a common one. In the field of natural language processing, tokenization refers to dividing text input into smaller units, called "tokens". These tokens can be words, subwords or characters, depending on the specific word segmentation strategy and task requirements. For example, if we perform tokenization on the sentence "I like eating apples", we will get a sequence of tokens: [&qu

Many countries are planning to ban ChatGPT. Is the cage for the 'beast' coming? Many countries are planning to ban ChatGPT. Is the cage for the 'beast' coming? Apr 10, 2023 pm 02:40 PM

"Artificial intelligence wants to escape from prison", "AI generates self-awareness", "AI will eventually kill humans", "the evolution of silicon-based life"... once only appeared in technological fantasies such as cyberpunk The plot is coming true this year, and generative natural language models are being questioned like never before. The one that has attracted the most attention is ChatGPT. From the end of March to the beginning of April, this text conversation robot developed by OpenAI suddenly changed from a representative of "advanced productivity" to a threat to mankind. First, it was named by thousands of elites in the technology circle and included in an open letter to "suspend the training of AI systems more powerful than GPT-4"; then, the American technology ethics organization asked the U.S. Federal Trade Commission to investigate OpenAI and prohibit the release of commercial Version

See all articles