Multifunctional RNA analysis, Baidu team's RNA language model based on Transformer is published in Nature sub-journal-AI-php.cn

Home

Technology peripherals

Multifunctional RNA analysis, Baidu team's RNA language model based on Transformer is published in Nature sub-journal

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 10, 2024 pm 10:21 PM

theory

Multifunctional RNA analysis, Baidu teams RNA language model based on Transformer is published in Nature sub-journal

Editor | Luoboxin

Pre-trained language models have shown good promise in analyzing nucleotide sequences, but using a single pre-trained weight set performs well on different tasks There are still challenges for multifunctional models that perform well in .

The Baidu Big Data Lab (BDL) and Shanghai Jiao Tong University teams developed RNAErnie, an RNA-centered pre-training model based on the Transformer architecture.

The researchers evaluated the model on seven datasets and five tasks, demonstrating RNAErnie’s superiority in both supervised and unsupervised learning.

RNAErnie surpasses the baseline by improving classification accuracy by 1.8%, interaction prediction accuracy by 2.2%, and structure prediction F1 score by 3.3%, demonstrating its robustness and adaptability.

The study is titled "Multi-purpose RNA language modeling with motif-aware pretraining and type-guided fine-tuning" and was published on May 13, 2024 in "Nature Machine Intelligence》.

Multifunctional RNA analysis, Baidu teams RNA language model based on Transformer is published in Nature sub-journal

#RNA plays a key role in the central dogma of molecular biology, responsible for transmitting genetic information in DNA to proteins.

RNA molecules play a vital role in a variety of cellular processes including gene expression, regulation and catalysis. Given the importance of RNA in biological systems, there is a growing need for efficient and accurate analysis methods for RNA sequences.

Traditional RNA-seq analysis relies on experimental techniques such as RNA sequencing and microarrays, but these methods are often costly, time-consuming, and require large amounts of RNA input.

To address these challenges, the Baidu BDL and Shanghai Jiao Tong University teams developed a pre-trained RNA language model: RNAErnie.

RNAErnie

The model is built on the Enhanced Representation of Knowledge Integration (ERNIE) framework and contains multi-layer and multi-head Transformer blocks, with hidden states for each Transformer block Dimension is 768. Pretraining is performed using an extensive corpus consisting of approximately 23 million RNA sequences carefully selected from RNAcentral.

The proposed motif-aware pre-training strategy involves base-level masking, sub-sequence-level masking and motif-level random masking, which effectively captures sub-sequence and motif-level knowledge and enriches the representation of RNA sequences. .

Additionally, RNAErnie tags coarse-grained RNA types as special vocabularies and appends the tags of coarse-grained RNA types to the end of each RNA sequence during pre-training. By doing so, the model has the potential to discern unique features of various RNA types, thereby facilitating domain adaptation to various downstream tasks.

Multifunctional RNA analysis, Baidu teams RNA language model based on Transformer is published in Nature sub-journal

Illustration: Model overview. (Source: paper)

Specifically, the RNAErnie model consists of 12 Transformer layers. In the topic-aware pre-training stage, RNAErnie is trained on a dataset of approximately 23 million sequences extracted from the RNAcentral database, using self-supervised learning and topic-aware multi-level random masks.

Multifunctional RNA analysis, Baidu teams RNA language model based on Transformer is published in Nature sub-journal

Illustration: Topic-aware pre-training and type-guided fine-tuning strategy. (Source: paper)

In the type-guided fine-tuning stage, RNAErnie first uses the output embeddings to predict possible coarse-grained RNA types, and then uses the predicted types as auxiliary information to fine-tune the model through task-specific headers.

This approach enables the model to adapt to various RNA types and enhances its utility in a wide range of RNA analysis tasks.

More specifically, to adapt to distribution changes between the pre-trained dataset and the target domain, RNAErnie leverages domain adaptation to combine the pre-trained backbone with downstream modules in three neural architectures: with trainable Frozen Backbone with Trainable Heads (FBTH), Trainable Backbone with Trainable Heads (TBTH), and Stacking for Type-Guided Fine-Tuning (STACK).

In this way, the proposed method can optimize the trunk and task-specific headers end-to-end, or use embeddings extracted from the frozen trunk to fine-tune task-specific headers, depending on the downstream application .

Performance Evaluation

Multifunctional RNA analysis, Baidu teams RNA language model based on Transformer is published in Nature sub-journal

Illustration: RNAErnie captures multi-level ontology patterns. (Source: paper)

The researchers evaluated the method and the results showed that RNAErnie outperformed seven RNA sequence data sets covering more than 17,000 major RNA motifs, 20 RNA types, and 50,000 RNA sequences. based on existing advanced technology.

Multifunctional RNA analysis, Baidu teams RNA language model based on Transformer is published in Nature sub-journal

Illustration: RNAErnie performance on RNA secondary structure prediction task using ArchiveII600 and TS0 datasets. (Source: Paper)

Evaluated using 30 mainstream RNA sequencing technologies, RNAErnie’s generalization and robustness are demonstrated. The team used accuracy, precision, recall, F1 score, MCC, and AUC as evaluation metrics to ensure a fair comparison of RNA-seq analysis methods.

Currently, there are few studies on applying the Transformer architecture with enhanced external knowledge to RNA-seq data analysis. The from-scratch RNAErnie framework integrates RNA sequence embedding and self-supervised learning strategies to bring superior performance, interpretability, and generalization potential to downstream RNA tasks.

Additionally, RNAErnie can be adapted to other tasks by modifying outputs and monitoring signals. RNAErnie is publicly available and is an efficient tool for understanding type-guided RNA analysis and advanced applications.

Limitations

Although the RNAErnie model is innovative in RNA sequence analysis, it still faces some challenges.

First, the model is limited by the size of the RNA sequences it can analyze, as sequences longer than 512 nucleotides are discarded, potentially overlooking important structural and functional information. Blocking methods developed to handle longer sequences may result in further loss of information about long-range interactions.

Second, the focus of this study is narrow, focusing only on RNA domains and not extending to tasks such as RNA protein prediction or binding site identification. Additionally, the model encounters difficulty in accounting for RNA's three-dimensional structural motifs, such as loops and junctions, which are critical to understanding RNA function.

More importantly, existing post-hoc architecture designs also have potential limitations.

Conclusion

Nonetheless, RNAErnie has great potential to advance RNA analysis. The model demonstrates its versatility and effectiveness as a general solution in different downstream tasks.

In addition, the innovative strategies adopted by RNAErnie are expected to enhance the performance of other pre-trained models in RNA analysis. These findings make RNAErnie a valuable asset, providing researchers with a powerful tool to unravel the complexities of RNA-related research.

Paper link:https://www.nature.com/articles/s42256-024-00836-4

The above is the detailed content of Multifunctional RNA analysis, Baidu team's RNA language model based on Transformer is published in Nature sub-journal. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Nordhold: Fusion System, Explained

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1666

CakePHP Tutorial

1425

Laravel Tutorial

1325

PHP Tutorial

1273

C# Tutorial

1252

Related knowledge

Breaking through the boundaries of traditional defect detection, 'Defect Spectrum' achieves ultra-high-precision and rich semantic industrial defect detection for the first time. Jul 26, 2024 pm 05:38 PM

In modern manufacturing, accurate defect detection is not only the key to ensuring product quality, but also the core of improving production efficiency. However, existing defect detection datasets often lack the accuracy and semantic richness required for practical applications, resulting in models unable to identify specific defect categories or locations. In order to solve this problem, a top research team composed of Hong Kong University of Science and Technology Guangzhou and Simou Technology innovatively developed the "DefectSpectrum" data set, which provides detailed and semantically rich large-scale annotation of industrial defects. As shown in Table 1, compared with other industrial data sets, the "DefectSpectrum" data set provides the most defect annotations (5438 defect samples) and the most detailed defect classification (125 defect categories

Training with millions of crystal data to solve the crystallographic phase problem, the deep learning method PhAI is published in Science Aug 08, 2024 pm 09:22 PM

Editor |KX To this day, the structural detail and precision determined by crystallography, from simple metals to large membrane proteins, are unmatched by any other method. However, the biggest challenge, the so-called phase problem, remains retrieving phase information from experimentally determined amplitudes. Researchers at the University of Copenhagen in Denmark have developed a deep learning method called PhAI to solve crystal phase problems. A deep learning neural network trained using millions of artificial crystal structures and their corresponding synthetic diffraction data can generate accurate electron density maps. The study shows that this deep learning-based ab initio structural solution method can solve the phase problem at a resolution of only 2 Angstroms, which is equivalent to only 10% to 20% of the data available at atomic resolution, while traditional ab initio Calculation

NVIDIA dialogue model ChatQA has evolved to version 2.0, with the context length mentioned at 128K Jul 26, 2024 am 08:40 AM

The open LLM community is an era when a hundred flowers bloom and compete. You can see Llama-3-70B-Instruct, QWen2-72B-Instruct, Nemotron-4-340B-Instruct, Mixtral-8x22BInstruct-v0.1 and many other excellent performers. Model. However, compared with proprietary large models represented by GPT-4-Turbo, open models still have significant gaps in many fields. In addition to general models, some open models that specialize in key areas have been developed, such as DeepSeek-Coder-V2 for programming and mathematics, and InternVL for visual-language tasks.

Google AI won the IMO Mathematical Olympiad silver medal, the mathematical reasoning model AlphaProof was launched, and reinforcement learning is so back Jul 26, 2024 pm 02:40 PM

For AI, Mathematical Olympiad is no longer a problem. On Thursday, Google DeepMind's artificial intelligence completed a feat: using AI to solve the real question of this year's International Mathematical Olympiad IMO, and it was just one step away from winning the gold medal. The IMO competition that just ended last week had six questions involving algebra, combinatorics, geometry and number theory. The hybrid AI system proposed by Google got four questions right and scored 28 points, reaching the silver medal level. Earlier this month, UCLA tenured professor Terence Tao had just promoted the AI Mathematical Olympiad (AIMO Progress Award) with a million-dollar prize. Unexpectedly, the level of AI problem solving had improved to this level before July. Do the questions simultaneously on IMO. The most difficult thing to do correctly is IMO, which has the longest history, the largest scale, and the most negative

PRO | Why are large models based on MoE more worthy of attention? Aug 07, 2024 pm 07:08 PM

In 2023, almost every field of AI is evolving at an unprecedented speed. At the same time, AI is constantly pushing the technological boundaries of key tracks such as embodied intelligence and autonomous driving. Under the multi-modal trend, will the situation of Transformer as the mainstream architecture of AI large models be shaken? Why has exploring large models based on MoE (Mixed of Experts) architecture become a new trend in the industry? Can Large Vision Models (LVM) become a new breakthrough in general vision? ...From the 2023 PRO member newsletter of this site released in the past six months, we have selected 10 special interpretations that provide in-depth analysis of technological trends and industrial changes in the above fields to help you achieve your goals in the new year. be prepared. This interpretation comes from Week50 2023

To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework Jul 25, 2024 am 06:42 AM

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

The accuracy rate reaches 60.8%. Zhejiang University's chemical retrosynthesis prediction model based on Transformer was published in the Nature sub-journal Aug 06, 2024 pm 07:34 PM

Editor | KX Retrosynthesis is a critical task in drug discovery and organic synthesis, and AI is increasingly used to speed up the process. Existing AI methods have unsatisfactory performance and limited diversity. In practice, chemical reactions often cause local molecular changes, with considerable overlap between reactants and products. Inspired by this, Hou Tingjun's team at Zhejiang University proposed to redefine single-step retrosynthetic prediction as a molecular string editing task, iteratively refining the target molecular string to generate precursor compounds. And an editing-based retrosynthetic model EditRetro is proposed, which can achieve high-quality and diverse predictions. Extensive experiments show that the model achieves excellent performance on the standard benchmark data set USPTO-50 K, with a top-1 accuracy of 60.8%.

Nature's point of view: The testing of artificial intelligence in medicine is in chaos. What should be done? Aug 22, 2024 pm 04:37 PM

Editor | ScienceAI Based on limited clinical data, hundreds of medical algorithms have been approved. Scientists are debating who should test the tools and how best to do so. Devin Singh witnessed a pediatric patient in the emergency room suffer cardiac arrest while waiting for treatment for a long time, which prompted him to explore the application of AI to shorten wait times. Using triage data from SickKids emergency rooms, Singh and colleagues built a series of AI models that provide potential diagnoses and recommend tests. One study showed that these models can speed up doctor visits by 22.3%, speeding up the processing of results by nearly 3 hours per patient requiring a medical test. However, the success of artificial intelligence algorithms in research only verifies this

See all articles