Table of Contents
Generate sound effects based on text " >Generate sound effects based on text
Internal Technical Principles of the Model
Overall, the Make-An-Audio model achieves high-quality, highly controllable audio synthesis, and proposes "No Modality Left Behind" to fine-tune the text conditional audio model ( finetune), which can unlock audio synthesis (audio/image/video) for any modal input.
Home Technology peripherals AI This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

Apr 12, 2023 pm 06:25 PM
ai technology

Recently, AIGC has been on the hot search, and its popularity remains high. Of course, in addition to its extremely famous name, its breakthroughs are also absolutely remarkable: images, videos and even 3D models can be automatically generated by inputting natural language. You say Is it surprising?

But in the field of audio and sound effects, AIGC’s welfare seems to be a little worse. Mainly because high-degree-of-freedom audio generation relies on a large amount of text-audio pair data, and there are many difficulties in long-term waveform modeling. In order to solve the above difficulties, Zhejiang University and Peking University jointly proposed an innovative text-to-audio generation system, namely Make-An-Audio. It can take natural language description as input, and it can be in any modality (such as text, audio, image, video, etc.), and at the same time output audio sound effects that match the description. It is difficult for the majority of netizens to ignore its controllability and generalization. like.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry


  • ##Paper link: https://arxiv .org/abs/2301.12661
  • Project link: https://text-to-audio.github.io

In just two days, the demo video received 45K views on Twitter.

After New Year’s Eve 2023, a large number of audio synthesis articles emerged such as Make-An-Audio and MusicLM. There have been 4 breakthrough developments within 48 hours.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

## User comments 1## Many netizens have said that AIGC sound effect synthesis will change the future of film and short video production.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

## User comments 2

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

#Netizen comments 3

More netizens posted Such a sigh: "audio is all you need..."

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry##Netizen Comments 4

Audio effect display

Without further ado, just look at the effect,

Generate sound effects based on text

It turns out that it can be so convenient and smooth.

Text 1: a speedboat running as wind blows into a microphone

Convert audio 1Audio:

00:0000:09Text 2: fireworks pop and explode

##Convert audio 2Audio:

00:0000:09

Have you ever been troubled by repairing damaged audio? Once the Make-An-Audio model comes out, this becomes much easier.

Before repair

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

## Audio before repair

Audio before repairAudio:00:0000:09

##After repair

Audio after repair

Audio after repairAudio: 00:0000:09##​Understand pictures to generate sound effects

, it’s not impossible.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industryPicture 1

Convert audio

Convert image to audioAudio:

00:0000:09

## Picture 2This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

Convert Audio

Picture Convert Audio 2Audio:

00:0000:09##According to Video content generates corresponding sound effects, this model can also do it easily.

Video 1

## Convert audio

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

Video 1Audio:

00:0000:09 Video 2

Convert audio

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry##Video 2

Audio:

00:0000:09

Internal Technical Principles of the Model

In-depth analysis of the magical connotations of the "Internet Celebrity" model must go back to the objective problem of sparse audio-natural language data. In this regard, Zhejiang University and Peking University jointly launched the Volcano Voice The team collaborated with two major universities to jointly propose the Distill-then-Reprogram text enhancement strategy, which uses the teacher model to obtain the natural language description of the audio, and then obtains dynamic training through random reorganization sample.

Specifically, in the Distill link, audio-to-text and audio-text retrieval models are used to find natural language description candidates (Candidate) for language-free audio. By calculating the matching similarity between the candidate text and the audio, the best result is obtained under the threshold as the description of the audio. This method has strong generalization, and real natural language avoids out-of-domain text in the testing phase. "In the Reprogram phase, the team randomly sampled from additional event data sets and combined them with the current training samples to obtain new concept combinations and descriptions to increase the model's robustness to different event combinations," the research team said.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

##Distill-then-Reprogram Text Enhancement Strategy Framework Diagram

As shown in the figure above, self-supervised learning has successfully migrated pictures to audio spectrum, used spectral autoencoders to solve the problem of long audio sequences, and completed self-processing based on the Latent Diffusion generation model. Prediction of supervised representations avoids direct prediction of long-term waveforms.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

##Make-An-Audio model system framework diagram

In addition, during the research, the team also explored powerful text condition strategies, including contrastive Language-Audio Pretraining (CLAP) and language model (LLM) T5, BERT, etc., which verified the effectiveness and computational friendliness of CLAP text representation. sex. At the same time, CLAP Score was used for the first time to evaluate the generated audio, which can be used to measure the consistency between text and generated scenes; using a combination of subjective and objective evaluation methods, the effectiveness of the model was verified in the benchmark data set test, demonstrating The model has excellent zero-shot learning (Zero-Shot) generalization, etc.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

##Make-An-Audio and baseline model subjective and objective evaluation experimental results

How much do you know about the application prospects of the magic model?

Overall, the Make-An-Audio model achieves high-quality, highly controllable audio synthesis, and proposes "No Modality Left Behind" to fine-tune the text conditional audio model ( finetune), which can unlock audio synthesis (audio/image/video) for any modal input.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

Make-An-Audio implements highly controllable X-audio AIGC synthesis for the first time, X can be Text/Audio/Image/Video

For visually guided audio synthesis, Make-An-Audio conditions the CLIP text encoder on its image-text joint space , can directly synthesize audio based on image encoding.

This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry

##Make-An-Audio Vision-Audio Synthesis Framework Diagram

It is foreseeable that audio synthesis AIGC will play an important role in future film dubbing, short video creation and other fields, and with the help of models such as Make-An-Audio, it may be possible for everyone to become a professional in the future Sound effects engineers can use text, video, and images to synthesize lifelike audio and sound effects at any time and at any place. However, Make-An-Audio is not perfect at this stage. Perhaps due to the rich data sources and inevitable sample quality issues, side effects will inevitably occur during the training process, such as generating audio that does not conform to the text content. Make-An- Audio is technically positioned as "assisted artist generation", and one thing is for sure, the progress in the AIGC field is indeed surprising.

Huoshan Voice has long provided ByteDance’s major business lines with globally advantaged AI voice technology capabilities and full-stack voice product solutions, including audio understanding, audio synthesis, and virtual digits. People, dialogue interaction, music retrieval, intelligent hardware, etc. Since its establishment in 2017, the team has focused on developing industry-leading AI intelligent voice technology and constantly exploring the efficient combination of AI and business scenarios to achieve greater user value. At present, its speech recognition and speech synthesis have covered multiple languages ​​​​and dialects. Many technical papers have been selected for various top AI conferences, providing leading voice capabilities for Douyin, Jianying, Feishu, Tomato Novels, Pico and other businesses. It is suitable for diverse scenarios such as short videos, live broadcasts, video creation, office and wearable devices, and is open to external companies through the Volcano Engine.

The above is the detailed content of This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Nordhold: Fusion System, Explained
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1666
14
PHP Tutorial
1273
29
C# Tutorial
1252
24
How to use the chrono library in C? How to use the chrono library in C? Apr 28, 2025 pm 10:18 PM

Using the chrono library in C can allow you to control time and time intervals more accurately. Let's explore the charm of this library. C's chrono library is part of the standard library, which provides a modern way to deal with time and time intervals. For programmers who have suffered from time.h and ctime, chrono is undoubtedly a boon. It not only improves the readability and maintainability of the code, but also provides higher accuracy and flexibility. Let's start with the basics. The chrono library mainly includes the following key components: std::chrono::system_clock: represents the system clock, used to obtain the current time. std::chron

How to understand DMA operations in C? How to understand DMA operations in C? Apr 28, 2025 pm 10:09 PM

DMA in C refers to DirectMemoryAccess, a direct memory access technology, allowing hardware devices to directly transmit data to memory without CPU intervention. 1) DMA operation is highly dependent on hardware devices and drivers, and the implementation method varies from system to system. 2) Direct access to memory may bring security risks, and the correctness and security of the code must be ensured. 3) DMA can improve performance, but improper use may lead to degradation of system performance. Through practice and learning, we can master the skills of using DMA and maximize its effectiveness in scenarios such as high-speed data transmission and real-time signal processing.

How to handle high DPI display in C? How to handle high DPI display in C? Apr 28, 2025 pm 09:57 PM

Handling high DPI display in C can be achieved through the following steps: 1) Understand DPI and scaling, use the operating system API to obtain DPI information and adjust the graphics output; 2) Handle cross-platform compatibility, use cross-platform graphics libraries such as SDL or Qt; 3) Perform performance optimization, improve performance through cache, hardware acceleration, and dynamic adjustment of the details level; 4) Solve common problems, such as blurred text and interface elements are too small, and solve by correctly applying DPI scaling.

What is real-time operating system programming in C? What is real-time operating system programming in C? Apr 28, 2025 pm 10:15 PM

C performs well in real-time operating system (RTOS) programming, providing efficient execution efficiency and precise time management. 1) C Meet the needs of RTOS through direct operation of hardware resources and efficient memory management. 2) Using object-oriented features, C can design a flexible task scheduling system. 3) C supports efficient interrupt processing, but dynamic memory allocation and exception processing must be avoided to ensure real-time. 4) Template programming and inline functions help in performance optimization. 5) In practical applications, C can be used to implement an efficient logging system.

Steps to add and delete fields to MySQL tables Steps to add and delete fields to MySQL tables Apr 29, 2025 pm 04:15 PM

In MySQL, add fields using ALTERTABLEtable_nameADDCOLUMNnew_columnVARCHAR(255)AFTERexisting_column, delete fields using ALTERTABLEtable_nameDROPCOLUMNcolumn_to_drop. When adding fields, you need to specify a location to optimize query performance and data structure; before deleting fields, you need to confirm that the operation is irreversible; modifying table structure using online DDL, backup data, test environment, and low-load time periods is performance optimization and best practice.

How to measure thread performance in C? How to measure thread performance in C? Apr 28, 2025 pm 10:21 PM

Measuring thread performance in C can use the timing tools, performance analysis tools, and custom timers in the standard library. 1. Use the library to measure execution time. 2. Use gprof for performance analysis. The steps include adding the -pg option during compilation, running the program to generate a gmon.out file, and generating a performance report. 3. Use Valgrind's Callgrind module to perform more detailed analysis. The steps include running the program to generate the callgrind.out file and viewing the results using kcachegrind. 4. Custom timers can flexibly measure the execution time of a specific code segment. These methods help to fully understand thread performance and optimize code.

Quantitative Exchange Ranking 2025 Top 10 Recommendations for Digital Currency Quantitative Trading APPs Quantitative Exchange Ranking 2025 Top 10 Recommendations for Digital Currency Quantitative Trading APPs Apr 30, 2025 pm 07:24 PM

The built-in quantization tools on the exchange include: 1. Binance: Provides Binance Futures quantitative module, low handling fees, and supports AI-assisted transactions. 2. OKX (Ouyi): Supports multi-account management and intelligent order routing, and provides institutional-level risk control. The independent quantitative strategy platforms include: 3. 3Commas: drag-and-drop strategy generator, suitable for multi-platform hedging arbitrage. 4. Quadency: Professional-level algorithm strategy library, supporting customized risk thresholds. 5. Pionex: Built-in 16 preset strategy, low transaction fee. Vertical domain tools include: 6. Cryptohopper: cloud-based quantitative platform, supporting 150 technical indicators. 7. Bitsgap:

How does deepseek official website achieve the effect of penetrating mouse scroll event? How does deepseek official website achieve the effect of penetrating mouse scroll event? Apr 30, 2025 pm 03:21 PM

How to achieve the effect of mouse scrolling event penetration? When we browse the web, we often encounter some special interaction designs. For example, on deepseek official website, �...

See all articles