


This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry
Recently, AIGC has been on the hot search, and its popularity remains high. Of course, in addition to its extremely famous name, its breakthroughs are also absolutely remarkable: images, videos and even 3D models can be automatically generated by inputting natural language. You say Is it surprising?
But in the field of audio and sound effects, AIGC’s welfare seems to be a little worse. Mainly because high-degree-of-freedom audio generation relies on a large amount of text-audio pair data, and there are many difficulties in long-term waveform modeling. In order to solve the above difficulties, Zhejiang University and Peking University jointly proposed an innovative text-to-audio generation system, namely Make-An-Audio. It can take natural language description as input, and it can be in any modality (such as text, audio, image, video, etc.), and at the same time output audio sound effects that match the description. It is difficult for the majority of netizens to ignore its controllability and generalization. like.
- ##Paper link: https://arxiv .org/abs/2301.12661
- Project link: https://text-to-audio.github.io
In just two days, the demo video received 45K views on Twitter.
After New Year’s Eve 2023, a large number of audio synthesis articles emerged such as Make-An-Audio and MusicLM. There have been 4 breakthrough developments within 48 hours.
## User comments 1## Many netizens have said that AIGC sound effect synthesis will change the future of film and short video production.
More netizens posted Such a sigh: "audio is all you need..."
##Netizen Comments 4
Audio effect display
Without further ado, just look at the effect,Generate sound effects based on text
It turns out that it can be so convenient and smooth.Text 1: a speedboat running as wind blows into a microphone
Convert audio 1Audio:
00:0000:09Text 2: fireworks pop and explode
##Convert audio 2Audio:
00:0000:09 Have you ever been troubled by repairing damaged audio? Once the Make-An-Audio model comes out, this becomes much easier. Before repair
## Audio before repair
Audio before repairAudio:00:0000:09
##After repair
Audio after repair
Audio after repairAudio: 00:0000:09##Understand pictures to generate sound effects
, it’s not impossible.
Picture 1
Convert audio
Convert image to audioAudio:
00:0000:09
## Picture 2
Convert Audio
Picture Convert Audio 2Audio:
00:0000:09##According to Video content generates corresponding sound effects, this model can also do it easily.
Video 1
## Convert audio
Video 1Audio:
00:0000:09 Video 2
Convert audio
##Video 2
00:0000:09
In-depth analysis of the magical connotations of the "Internet Celebrity" model must go back to the objective problem of sparse audio-natural language data. In this regard, Zhejiang University and Peking University jointly launched the Volcano Voice The team collaborated with two major universities to jointly propose the Distill-then-Reprogram text enhancement strategy, which uses the teacher model to obtain the natural language description of the audio, and then obtains dynamic training through random reorganization sample. Specifically, in the Distill link, audio-to-text and audio-text retrieval models are used to find natural language description candidates (Candidate) for language-free audio. By calculating the matching similarity between the candidate text and the audio, the best result is obtained under the threshold as the description of the audio. This method has strong generalization, and real natural language avoids out-of-domain text in the testing phase. "In the Reprogram phase, the team randomly sampled from additional event data sets and combined them with the current training samples to obtain new concept combinations and descriptions to increase the model's robustness to different event combinations," the research team said. Internal Technical Principles of the Model
##Distill-then-Reprogram Text Enhancement Strategy Framework Diagram
As shown in the figure above, self-supervised learning has successfully migrated pictures to audio spectrum, used spectral autoencoders to solve the problem of long audio sequences, and completed self-processing based on the Latent Diffusion generation model. Prediction of supervised representations avoids direct prediction of long-term waveforms.
##Make-An-Audio model system framework diagram
In addition, during the research, the team also explored powerful text condition strategies, including contrastive Language-Audio Pretraining (CLAP) and language model (LLM) T5, BERT, etc., which verified the effectiveness and computational friendliness of CLAP text representation. sex. At the same time, CLAP Score was used for the first time to evaluate the generated audio, which can be used to measure the consistency between text and generated scenes; using a combination of subjective and objective evaluation methods, the effectiveness of the model was verified in the benchmark data set test, demonstrating The model has excellent zero-shot learning (Zero-Shot) generalization, etc.
How much do you know about the application prospects of the magic model?
Overall, the Make-An-Audio model achieves high-quality, highly controllable audio synthesis, and proposes "No Modality Left Behind" to fine-tune the text conditional audio model ( finetune), which can unlock audio synthesis (audio/image/video) for any modal input.
For visually guided audio synthesis, Make-An-Audio conditions the CLIP text encoder on its image-text joint space , can directly synthesize audio based on image encoding.
It is foreseeable that audio synthesis AIGC will play an important role in future film dubbing, short video creation and other fields, and with the help of models such as Make-An-Audio, it may be possible for everyone to become a professional in the future Sound effects engineers can use text, video, and images to synthesize lifelike audio and sound effects at any time and at any place. However, Make-An-Audio is not perfect at this stage. Perhaps due to the rich data sources and inevitable sample quality issues, side effects will inevitably occur during the training process, such as generating audio that does not conform to the text content. Make-An- Audio is technically positioned as "assisted artist generation", and one thing is for sure, the progress in the AIGC field is indeed surprising. Huoshan Voice has long provided ByteDance’s major business lines with globally advantaged AI voice technology capabilities and full-stack voice product solutions, including audio understanding, audio synthesis, and virtual digits. People, dialogue interaction, music retrieval, intelligent hardware, etc. Since its establishment in 2017, the team has focused on developing industry-leading AI intelligent voice technology and constantly exploring the efficient combination of AI and business scenarios to achieve greater user value. At present, its speech recognition and speech synthesis have covered multiple languages and dialects. Many technical papers have been selected for various top AI conferences, providing leading voice capabilities for Douyin, Jianying, Feishu, Tomato Novels, Pico and other businesses. It is suitable for diverse scenarios such as short videos, live broadcasts, video creation, office and wearable devices, and is open to external companies through the Volcano Engine.
The above is the detailed content of This audio went viral on the Internet! Generate realistic sound effects from text and pictures with one click, AIGC is coming to the audio industry. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











Using the chrono library in C can allow you to control time and time intervals more accurately. Let's explore the charm of this library. C's chrono library is part of the standard library, which provides a modern way to deal with time and time intervals. For programmers who have suffered from time.h and ctime, chrono is undoubtedly a boon. It not only improves the readability and maintainability of the code, but also provides higher accuracy and flexibility. Let's start with the basics. The chrono library mainly includes the following key components: std::chrono::system_clock: represents the system clock, used to obtain the current time. std::chron

DMA in C refers to DirectMemoryAccess, a direct memory access technology, allowing hardware devices to directly transmit data to memory without CPU intervention. 1) DMA operation is highly dependent on hardware devices and drivers, and the implementation method varies from system to system. 2) Direct access to memory may bring security risks, and the correctness and security of the code must be ensured. 3) DMA can improve performance, but improper use may lead to degradation of system performance. Through practice and learning, we can master the skills of using DMA and maximize its effectiveness in scenarios such as high-speed data transmission and real-time signal processing.

Handling high DPI display in C can be achieved through the following steps: 1) Understand DPI and scaling, use the operating system API to obtain DPI information and adjust the graphics output; 2) Handle cross-platform compatibility, use cross-platform graphics libraries such as SDL or Qt; 3) Perform performance optimization, improve performance through cache, hardware acceleration, and dynamic adjustment of the details level; 4) Solve common problems, such as blurred text and interface elements are too small, and solve by correctly applying DPI scaling.

C performs well in real-time operating system (RTOS) programming, providing efficient execution efficiency and precise time management. 1) C Meet the needs of RTOS through direct operation of hardware resources and efficient memory management. 2) Using object-oriented features, C can design a flexible task scheduling system. 3) C supports efficient interrupt processing, but dynamic memory allocation and exception processing must be avoided to ensure real-time. 4) Template programming and inline functions help in performance optimization. 5) In practical applications, C can be used to implement an efficient logging system.

In MySQL, add fields using ALTERTABLEtable_nameADDCOLUMNnew_columnVARCHAR(255)AFTERexisting_column, delete fields using ALTERTABLEtable_nameDROPCOLUMNcolumn_to_drop. When adding fields, you need to specify a location to optimize query performance and data structure; before deleting fields, you need to confirm that the operation is irreversible; modifying table structure using online DDL, backup data, test environment, and low-load time periods is performance optimization and best practice.

Measuring thread performance in C can use the timing tools, performance analysis tools, and custom timers in the standard library. 1. Use the library to measure execution time. 2. Use gprof for performance analysis. The steps include adding the -pg option during compilation, running the program to generate a gmon.out file, and generating a performance report. 3. Use Valgrind's Callgrind module to perform more detailed analysis. The steps include running the program to generate the callgrind.out file and viewing the results using kcachegrind. 4. Custom timers can flexibly measure the execution time of a specific code segment. These methods help to fully understand thread performance and optimize code.

The built-in quantization tools on the exchange include: 1. Binance: Provides Binance Futures quantitative module, low handling fees, and supports AI-assisted transactions. 2. OKX (Ouyi): Supports multi-account management and intelligent order routing, and provides institutional-level risk control. The independent quantitative strategy platforms include: 3. 3Commas: drag-and-drop strategy generator, suitable for multi-platform hedging arbitrage. 4. Quadency: Professional-level algorithm strategy library, supporting customized risk thresholds. 5. Pionex: Built-in 16 preset strategy, low transaction fee. Vertical domain tools include: 6. Cryptohopper: cloud-based quantitative platform, supporting 150 technical indicators. 7. Bitsgap:

How to achieve the effect of mouse scrolling event penetration? When we browse the web, we often encounter some special interaction designs. For example, on deepseek official website, �...
