The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed 'Almighty Diffusion'-AI-php.cn

Table of Contents

Network Framework

Performance

Effect display

Summary

Team Introduction

Home

Technology peripherals

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed 'Almighty Diffusion'

王林

Apr 11, 2023 pm 07:30 PM

ai Model

Recent advances in Diffusion models set an impressive milestone in many generative tasks. Attractive works such as DALL·E 2, Imagen, and Stable Diffusion (SD) have aroused great interest in academia and industry.

However, although these models perform amazingly, they are basically focused on a certain type of task, such as generating images from given text. For different types of tasks, it is often necessary to Dedicated training alone, or building a new model from scratch.

So can we build an "all-round" Diffusion based on previous models to achieve the unification of the AIGC model? Some people are trying to explore in this direction and have made progress.

This joint team from the University of Illinois at Urbana-Champaign and the University of Texas at Austin is trying to extend the existing single-stream Diffusion into a multi-stream network, called Versatile Diffusion ( VD), the first unified multi-stream multi-modal diffusion framework and a step towards general generative artificial intelligence.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

##Paper address: https://arxiv.org/abs/2211.08332

In addition to the ordinary text generation function, Versatile Diffusion can also input images to generate similar images, input images to generate text, input text to generate similar text, image semantic decoupling editing, input images and Generate videos from text, edit image content based on latent space, and more.

Future versions will also support more modes such as voice, music, video and 3D.

According to the paper, it has been proven that VD and its basic framework have the following advantages:

a) It can be used with competitive high quality Process all subtasks.

b) Support new extensions and applications, such as separation of graphic style and semantics, image-text dual guidance generation, etc.

c) These experiments and applications provide richer semantic insights into the generated output.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

In terms of training data set, VD uses Laion2B-en with custom data filters as the main data set.

First exploration of

One of the exciting findings of VD is that it can semantically enhance or reduce image style without further supervision.

Such a phenomenon inspired the author to explore a completely new field, where the separation between style and semantics can occur for images with arbitrary styles and arbitrary content.

The authors state that they are the first to explore: a) interpretation of the semantics and style of natural images without domain specification; b) diffusion model latent space Semantic and stylistic decomposition team.

In the image below, the author first generates variants of the input image and then operates on them with a semantic (left) or stylistic (right) focus.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Since VD supports both image to text and text to image, the author team tried editing from the perspective of text prompts for the first time by following the steps below Images: a) Convert image to text, b) Edit text, c) Convert text back to image.

In the experiment the author removed the described content from the image and then added new content using this image-text-image (I2T2I) paradigm. Unlike painting or other image editing methods that require the location of objects as input, VD's I2T2I does not require masks because it automatically positions and replaces objects on command.

However, the output image of I2T2I is inconsistent with the pixels of the input image, which is caused by image-to-text semantic extraction and text-to-image content creation.

In the display below, the input image is first translated into a prompt, and then the prompt is edited using subtraction (red box) and addition (green box). Finally, the edited prompt is translated into an image.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

#In addition, they are also the first team to explore generating similar text based on given text.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Network Framework

Specifically, the VD framework proposed in the article is a multi-stream network with various types of data as input and background.

The VD multi-stream multi-modal diffusion framework inherits the advantages of LDM/SD, with interpretable latent space, modal structure and low computational cost.

VD can jointly train multiple streams, each stream representing a cross-modal task. Its core design is to diffuser the grouping, sharing and switching protocols within the network, adapting the framework to all supported tasks and beyond.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

diffuser is divided into three groups: global layer, data layer and context layer. The global layer is the temporal embedding layer, the data layer is the residual block, and the context layer is the cross-attention.

This grouping corresponds to the function of the layer. When working on multiple tasks, the global layer is shared among all tasks. The data layer and context layer contain multiple data flows. Each data stream can be shared or exchanged based on the current data and context type.

For example, when processing text-image requests, diffuser uses the image data layer and the text context layer. When dealing with image mutation tasks, the image data layer and image context layer are used.

A single VD process contains a VAE, a diffuser and a context encoder, processing a task under one data type (such as image) and one context type (such as text) ( Such as text to image).

The multi-stream structure of Versatile Diffusion is shown in the figure below:

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

The researchers based on Versatile Diffusion, A general multi-stream multi-modal framework is further proposed, which includes VAE, context encoder and diffuser containing three layers (i.e. global, data and context layers).

Diffuser:

VD uses the widely adopted cross-concern UNet as The main architecture of the diffuser network divides the layers into global layer, data layer and context layer. The data layer and context layer have two data streams to support images and text.

For image data flow, follow LDM and use residual block (ResBlock), whose spatial dimension gradually decreases and the number of channels gradually increases.

For text data flow, the new fully connected residual block (FCResBlock) is utilized to expand the 768-dimensional text latent vector into 320*4 hidden features and follow a similar channel Add normalization, reuse GroupNorms, SiLU and skip connections, just like normal ResBlock.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

As shown in the figure above, FCResBlock contains two sets of fully connected layers (FC), group normalization (GN) and sigmoid linear unit (SiLU). x is the input text latent code, t is the input time embedding, and hi is the intermediate feature.

For context groups, cross-attention layers are used for both image and context streams, where content embedding operates data features through projection layers, dot products, and sigmoids.

Variational Autoencoder (VAE):

VD adopts the previous potential The autoencoder-KL of the Latent Diffusion Model (LDM) is used as the image data VAE, and Optimus is used as the text data VAE. Optimus consists of the BERT text encoder and the GPT2 text decoder, which can bidirectionally convert sentences into 768-dimensional normally distributed latent vectors.

At the same time, Optimus also shows satisfactory VAE characteristics with its reconfigurable and interpretable text latent space. Optimus was therefore chosen as the text VAE because it fits well with the prerequisites of a multi-stream multi-modal framework.

Context Encoder:

VD uses CLIP text and Image encoder as context encoder. Unlike LDM and SD which only use raw text embeddings as context input, VD uses normalized and projected embeddings to minimize the CLIP contrast loss of text and images.

Experiments show that a closer embedding space between context types helps the model converge quickly and perform better. Similar conclusions can also be achieved in DALL·E 2, which fine-tunes the text-to-image model with an additional projection layer to minimize the difference between text and image embeddings for image variations.

Performance

The authors used early single-task models as baseline models and compared the results of VD with these baselines. Among them, SDv1.4 is used as the baseline model from text to image, SD-variation is used for image-variation, and BLIP is used for image-text.

Meanwhile, the authors also conduct a qualitative comparison of different VD models, where VDDC and VD-official are used for text to image, and all three models are used for image variants.

The image samples of SD and VD are generated with controlled random seeds to better check the quality.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Text to Image Performance

Although DALLE 2 and Imagen are State-of-the-art results were also achieved on these tasks, but since there are no public code or training details, the authors skip comparing them.

The results show that multi-process structure and multi-task training can help VD capture contextual semantics and generate output more accurately, and complete all sub-tasks excellently.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Performance of Image-Variant

Also, generated by VD The image annotation also contains some creative words. In comparison, the generation of BLIP is very short and lacks description of details.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Image to text performance

Effect display

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

文生图

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Image variations

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

with semantics Focused Image Variants

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Dual Boot

Summary

The authors introduce Versatile Diffusion (VD), a multi-stream multi-modal diffusion network that addresses text, images, and variations in a unified model. Based on VD, the author further introduces a general multi-stream multi-modal framework, which can involve new tasks and domains.
Through experiments, the authors found that VD can produce high-quality output on all supported tasks, among which VD’s text-to-image and image-to-variant results can better capture the semantics in context, VD The image-to-text results are creative and illustrative.
Given the multi-stream multi-modal nature of VD, the authors introduce novel extensions and applications that may further benefit downstream users working on this technology.

Team Introduction

The IFP team at the University of Illinois at Urbana-Champaign was founded by Professor Huang Xutao in the 1980s, originally as Beckman Advanced Science and Technology Research Institute's Image Formation and Processing Group.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Over the years, IFP has been committed to research and innovation beyond images, including image and video coding, multi-modal human-computer interaction, and multimedia annotation and search, computer vision and pattern recognition, machine learning, big data, deep learning, and high-performance computing.

The current research direction of IFP is to solve the problem of multi-modal information processing by collaboratively combining big data, deep learning and high-performance computing.

In addition, IFP has won several best papers at top conferences in the field of artificial intelligence and won many international competitions, including the first NIST TrecVID, the first ImageNet Challenge and the first Artificial Intelligence City Challenge.

Interestingly, since Professor Huang began teaching at MIT in the 1960s, the "members" of the IFP group have even included friends, students, students' students, and students' students. Even students of students’ students.

The above is the detailed content of The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed 'Almighty Diffusion'. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1663

CakePHP Tutorial

1420

Laravel Tutorial

1313

PHP Tutorial

1266

C# Tutorial

1239

Related knowledge

Which of the top ten currency trading platforms in the world are among the top ten currency trading platforms in 2025 Apr 28, 2025 pm 08:12 PM

The top ten cryptocurrency exchanges in the world in 2025 include Binance, OKX, Gate.io, Coinbase, Kraken, Huobi, Bitfinex, KuCoin, Bittrex and Poloniex, all of which are known for their high trading volume and security.

How much is Bitcoin worth Apr 28, 2025 pm 07:42 PM

Bitcoin’s price ranges from $20,000 to $30,000. 1. Bitcoin’s price has fluctuated dramatically since 2009, reaching nearly $20,000 in 2017 and nearly $60,000 in 2021. 2. Prices are affected by factors such as market demand, supply, and macroeconomic environment. 3. Get real-time prices through exchanges, mobile apps and websites. 4. Bitcoin price is highly volatile, driven by market sentiment and external factors. 5. It has a certain relationship with traditional financial markets and is affected by global stock markets, the strength of the US dollar, etc. 6. The long-term trend is bullish, but risks need to be assessed with caution.

What are the top currency trading platforms? The top 10 latest virtual currency exchanges Apr 28, 2025 pm 08:06 PM

Currently ranked among the top ten virtual currency exchanges: 1. Binance, 2. OKX, 3. Gate.io, 4. Coin library, 5. Siren, 6. Huobi Global Station, 7. Bybit, 8. Kucoin, 9. Bitcoin, 10. bit stamp.

Which of the top ten currency trading platforms in the world are the latest version of the top ten currency trading platforms Apr 28, 2025 pm 08:09 PM

The top ten cryptocurrency trading platforms in the world include Binance, OKX, Gate.io, Coinbase, Kraken, Huobi Global, Bitfinex, Bittrex, KuCoin and Poloniex, all of which provide a variety of trading methods and powerful security measures.

Decryption Gate.io Strategy Upgrade: How to Redefine Crypto Asset Management in MeMebox 2.0? Apr 28, 2025 pm 03:33 PM

MeMebox 2.0 redefines crypto asset management through innovative architecture and performance breakthroughs. 1) It solves three major pain points: asset silos, income decay and paradox of security and convenience. 2) Through intelligent asset hubs, dynamic risk management and return enhancement engines, cross-chain transfer speed, average yield rate and security incident response speed are improved. 3) Provide users with asset visualization, policy automation and governance integration, realizing user value reconstruction. 4) Through ecological collaboration and compliance innovation, the overall effectiveness of the platform has been enhanced. 5) In the future, smart contract insurance pools, forecast market integration and AI-driven asset allocation will be launched to continue to lead the development of the industry.

What are the top ten virtual currency trading apps? The latest digital currency exchange rankings Apr 28, 2025 pm 08:03 PM

The top ten digital currency exchanges such as Binance, OKX, gate.io have improved their systems, efficient diversified transactions and strict security measures.

How to use the chrono library in C? Apr 28, 2025 pm 10:18 PM

Using the chrono library in C can allow you to control time and time intervals more accurately. Let's explore the charm of this library. C's chrono library is part of the standard library, which provides a modern way to deal with time and time intervals. For programmers who have suffered from time.h and ctime, chrono is undoubtedly a boon. It not only improves the readability and maintainability of the code, but also provides higher accuracy and flexibility. Let's start with the basics. The chrono library mainly includes the following key components: std::chrono::system_clock: represents the system clock, used to obtain the current time. std::chron

How to handle high DPI display in C? Apr 28, 2025 pm 09:57 PM

Handling high DPI display in C can be achieved through the following steps: 1) Understand DPI and scaling, use the operating system API to obtain DPI information and adjust the graphics output; 2) Handle cross-platform compatibility, use cross-platform graphics libraries such as SDL or Qt; 3) Perform performance optimization, improve performance through cache, hardware acceleration, and dynamic adjustment of the details level; 4) Solve common problems, such as blurred text and interface elements are too small, and solve by correctly applying DPI scaling.

See all articles