


Multifunctional RNA analysis, Baidu team's RNA language model based on Transformer is published in Nature sub-journal
Editor | Luoboxin
Pre-trained language models have shown good promise in analyzing nucleotide sequences, but using a single pre-trained weight set performs well on different tasks There are still challenges for multifunctional models that perform well in .
The Baidu Big Data Lab (BDL) and Shanghai Jiao Tong University teams developed RNAErnie, an RNA-centered pre-training model based on the Transformer architecture.
The researchers evaluated the model on seven datasets and five tasks, demonstrating RNAErnie’s superiority in both supervised and unsupervised learning.
RNAErnie surpasses the baseline by improving classification accuracy by 1.8%, interaction prediction accuracy by 2.2%, and structure prediction F1 score by 3.3%, demonstrating its robustness and adaptability.
The study is titled "Multi-purpose RNA language modeling with motif-aware pretraining and type-guided fine-tuning" and was published on May 13, 2024 in "Nature Machine Intelligence》.
#RNA plays a key role in the central dogma of molecular biology, responsible for transmitting genetic information in DNA to proteins.
RNA molecules play a vital role in a variety of cellular processes including gene expression, regulation and catalysis. Given the importance of RNA in biological systems, there is a growing need for efficient and accurate analysis methods for RNA sequences.
Traditional RNA-seq analysis relies on experimental techniques such as RNA sequencing and microarrays, but these methods are often costly, time-consuming, and require large amounts of RNA input.
To address these challenges, the Baidu BDL and Shanghai Jiao Tong University teams developed a pre-trained RNA language model: RNAErnie.
RNAErnie
The model is built on the Enhanced Representation of Knowledge Integration (ERNIE) framework and contains multi-layer and multi-head Transformer blocks, with hidden states for each Transformer block Dimension is 768. Pretraining is performed using an extensive corpus consisting of approximately 23 million RNA sequences carefully selected from RNAcentral.
The proposed motif-aware pre-training strategy involves base-level masking, sub-sequence-level masking and motif-level random masking, which effectively captures sub-sequence and motif-level knowledge and enriches the representation of RNA sequences. .
Additionally, RNAErnie tags coarse-grained RNA types as special vocabularies and appends the tags of coarse-grained RNA types to the end of each RNA sequence during pre-training. By doing so, the model has the potential to discern unique features of various RNA types, thereby facilitating domain adaptation to various downstream tasks.
Specifically, the RNAErnie model consists of 12 Transformer layers. In the topic-aware pre-training stage, RNAErnie is trained on a dataset of approximately 23 million sequences extracted from the RNAcentral database, using self-supervised learning and topic-aware multi-level random masks.
Illustration: Topic-aware pre-training and type-guided fine-tuning strategy. (Source: paper)
In the type-guided fine-tuning stage, RNAErnie first uses the output embeddings to predict possible coarse-grained RNA types, and then uses the predicted types as auxiliary information to fine-tune the model through task-specific headers.
This approach enables the model to adapt to various RNA types and enhances its utility in a wide range of RNA analysis tasks.
More specifically, to adapt to distribution changes between the pre-trained dataset and the target domain, RNAErnie leverages domain adaptation to combine the pre-trained backbone with downstream modules in three neural architectures: with trainable Frozen Backbone with Trainable Heads (FBTH), Trainable Backbone with Trainable Heads (TBTH), and Stacking for Type-Guided Fine-Tuning (STACK).
In this way, the proposed method can optimize the trunk and task-specific headers end-to-end, or use embeddings extracted from the frozen trunk to fine-tune task-specific headers, depending on the downstream application .
Performance Evaluation
Illustration: RNAErnie captures multi-level ontology patterns. (Source: paper)
The researchers evaluated the method and the results showed that RNAErnie outperformed seven RNA sequence data sets covering more than 17,000 major RNA motifs, 20 RNA types, and 50,000 RNA sequences. based on existing advanced technology.
Illustration: RNAErnie performance on RNA secondary structure prediction task using ArchiveII600 and TS0 datasets. (Source: Paper)
Evaluated using 30 mainstream RNA sequencing technologies, RNAErnie’s generalization and robustness are demonstrated. The team used accuracy, precision, recall, F1 score, MCC, and AUC as evaluation metrics to ensure a fair comparison of RNA-seq analysis methods.
Currently, there are few studies on applying the Transformer architecture with enhanced external knowledge to RNA-seq data analysis. The from-scratch RNAErnie framework integrates RNA sequence embedding and self-supervised learning strategies to bring superior performance, interpretability, and generalization potential to downstream RNA tasks.
Additionally, RNAErnie can be adapted to other tasks by modifying outputs and monitoring signals. RNAErnie is publicly available and is an efficient tool for understanding type-guided RNA analysis and advanced applications.
Limitations
Although the RNAErnie model is innovative in RNA sequence analysis, it still faces some challenges.
First, the model is limited by the size of the RNA sequences it can analyze, as sequences longer than 512 nucleotides are discarded, potentially overlooking important structural and functional information. Blocking methods developed to handle longer sequences may result in further loss of information about long-range interactions.
Second, the focus of this study is narrow, focusing only on RNA domains and not extending to tasks such as RNA protein prediction or binding site identification. Additionally, the model encounters difficulty in accounting for RNA's three-dimensional structural motifs, such as loops and junctions, which are critical to understanding RNA function.
More importantly, existing post-hoc architecture designs also have potential limitations.
Conclusion
Nonetheless, RNAErnie has great potential to advance RNA analysis. The model demonstrates its versatility and effectiveness as a general solution in different downstream tasks.
In addition, the innovative strategies adopted by RNAErnie are expected to enhance the performance of other pre-trained models in RNA analysis. These findings make RNAErnie a valuable asset, providing researchers with a powerful tool to unravel the complexities of RNA-related research.
Paper link:https://www.nature.com/articles/s42256-024-00836-4
The above is the detailed content of Multifunctional RNA analysis, Baidu team's RNA language model based on Transformer is published in Nature sub-journal. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











In modern manufacturing, accurate defect detection is not only the key to ensuring product quality, but also the core of improving production efficiency. However, existing defect detection datasets often lack the accuracy and semantic richness required for practical applications, resulting in models unable to identify specific defect categories or locations. In order to solve this problem, a top research team composed of Hong Kong University of Science and Technology Guangzhou and Simou Technology innovatively developed the "DefectSpectrum" data set, which provides detailed and semantically rich large-scale annotation of industrial defects. As shown in Table 1, compared with other industrial data sets, the "DefectSpectrum" data set provides the most defect annotations (5438 defect samples) and the most detailed defect classification (125 defect categories

Editor |KX To this day, the structural detail and precision determined by crystallography, from simple metals to large membrane proteins, are unmatched by any other method. However, the biggest challenge, the so-called phase problem, remains retrieving phase information from experimentally determined amplitudes. Researchers at the University of Copenhagen in Denmark have developed a deep learning method called PhAI to solve crystal phase problems. A deep learning neural network trained using millions of artificial crystal structures and their corresponding synthetic diffraction data can generate accurate electron density maps. The study shows that this deep learning-based ab initio structural solution method can solve the phase problem at a resolution of only 2 Angstroms, which is equivalent to only 10% to 20% of the data available at atomic resolution, while traditional ab initio Calculation

The open LLM community is an era when a hundred flowers bloom and compete. You can see Llama-3-70B-Instruct, QWen2-72B-Instruct, Nemotron-4-340B-Instruct, Mixtral-8x22BInstruct-v0.1 and many other excellent performers. Model. However, compared with proprietary large models represented by GPT-4-Turbo, open models still have significant gaps in many fields. In addition to general models, some open models that specialize in key areas have been developed, such as DeepSeek-Coder-V2 for programming and mathematics, and InternVL for visual-language tasks.

For AI, Mathematical Olympiad is no longer a problem. On Thursday, Google DeepMind's artificial intelligence completed a feat: using AI to solve the real question of this year's International Mathematical Olympiad IMO, and it was just one step away from winning the gold medal. The IMO competition that just ended last week had six questions involving algebra, combinatorics, geometry and number theory. The hybrid AI system proposed by Google got four questions right and scored 28 points, reaching the silver medal level. Earlier this month, UCLA tenured professor Terence Tao had just promoted the AI Mathematical Olympiad (AIMO Progress Award) with a million-dollar prize. Unexpectedly, the level of AI problem solving had improved to this level before July. Do the questions simultaneously on IMO. The most difficult thing to do correctly is IMO, which has the longest history, the largest scale, and the most negative

In 2023, almost every field of AI is evolving at an unprecedented speed. At the same time, AI is constantly pushing the technological boundaries of key tracks such as embodied intelligence and autonomous driving. Under the multi-modal trend, will the situation of Transformer as the mainstream architecture of AI large models be shaken? Why has exploring large models based on MoE (Mixed of Experts) architecture become a new trend in the industry? Can Large Vision Models (LVM) become a new breakthrough in general vision? ...From the 2023 PRO member newsletter of this site released in the past six months, we have selected 10 special interpretations that provide in-depth analysis of technological trends and industrial changes in the above fields to help you achieve your goals in the new year. be prepared. This interpretation comes from Week50 2023

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

Editor | KX Retrosynthesis is a critical task in drug discovery and organic synthesis, and AI is increasingly used to speed up the process. Existing AI methods have unsatisfactory performance and limited diversity. In practice, chemical reactions often cause local molecular changes, with considerable overlap between reactants and products. Inspired by this, Hou Tingjun's team at Zhejiang University proposed to redefine single-step retrosynthetic prediction as a molecular string editing task, iteratively refining the target molecular string to generate precursor compounds. And an editing-based retrosynthetic model EditRetro is proposed, which can achieve high-quality and diverse predictions. Extensive experiments show that the model achieves excellent performance on the standard benchmark data set USPTO-50 K, with a top-1 accuracy of 60.8%.

Editor | ScienceAI Based on limited clinical data, hundreds of medical algorithms have been approved. Scientists are debating who should test the tools and how best to do so. Devin Singh witnessed a pediatric patient in the emergency room suffer cardiac arrest while waiting for treatment for a long time, which prompted him to explore the application of AI to shorten wait times. Using triage data from SickKids emergency rooms, Singh and colleagues built a series of AI models that provide potential diagnoses and recommend tests. One study showed that these models can speed up doctor visits by 22.3%, speeding up the processing of results by nearly 3 hours per patient requiring a medical test. However, the success of artificial intelligence algorithms in research only verifies this
