Using Transformer for the diffusion model, AI-generated videos achieve photorealism-AI-php.cn

Home

Technology peripherals

Using Transformer for the diffusion model, AI-generated videos achieve photorealism

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Dec 15, 2023 am 09:25 AM

project w.a.l.t

In video generation scenarios, using Transformer as the denoising backbone of the diffusion model has been proven feasible by researchers such as Li Feifei. This can be considered a major success for Transformer in the field of video generation.

Recently, a video generation research has received a lot of praise, and was even rated as "the end of Hollywood" by an X netizen.

Is it really that good? Let’s take a look at the effect first:

将Transformer用于扩散模型，AI 生成视频达到照片级真实感

It is obvious that these videos not only have almost no artifacts, but are also very coherent and full of details. It even seems that even if a few frames are really added to the movie blockbuster, it will not be obviously inconsistent.

The author of these videos is the Window Attention Latent Transformer proposed by researchers from Stanford University, Google, and Georgia Institute of Technology, that is, the Window Attention Latent Transformer, referred to as W.A.L.T. This method successfully integrates the Transformer architecture into the latent video diffusion model. Professor Li Feifei of Stanford University is also one of the authors of the paper.

Project website: https://walt-video-diffusion.github.io/
Paper address: https://walt-video-diffusion.github.io/assets/W.A.L.T.pdf

Prior to this, the Transformer architecture has been used in many different fields Great success has been achieved, with the exception of the area of generative modeling of images and videos, where currently the dominant paradigm is diffusion models.

In the field of image and video generation, the diffusion model has become the main paradigm. However, among all video diffusion methods, the dominant backbone network is the U-Net architecture consisting of a series of convolutional and self-attention layers. U-Net is preferred because the memory requirements of the full attention mechanism in Transformer grow quadratically with the length of the input sequence. When processing high-dimensional signals such as video, this growth pattern makes the computational cost very high.

The latent diffusion model (LDM) operates in a lower-dimensional latent space derived from autoencoders, thereby reducing computational requirements. In this case, a key design choice is the type of latent space: space compression versus space-time compression.

People often prefer spatial compression because it enables the use of pretrained image autoencoders and LDMs, which are performed using large paired image-text datasets train. However, choosing spatial compression increases network complexity and makes Transformer difficult to use as the network backbone (due to memory constraints), especially when generating high-resolution videos. On the other hand, while spatiotemporal compression can alleviate these problems, it is not suitable for working with paired image-text datasets, which tend to be larger and more diverse than video-text datasets.

W.A.L.T is a Transformer method for latent video diffusion models (LVDM).

This method consists of two stages.

#In the first stage, an autoencoder is used to map the video and image into a unified low-dimensional latent space. This allows a single generative model to be jointly trained on image and video datasets and significantly reduces the computational cost of generating high-resolution videos.

For the second phase, the team designed a new Transformer block for latent video diffusion models, which consists of self-attention layers. Alternating between non-overlapping, window-restricted spatial and spatiotemporal attention. There are two main benefits of this design: First, it uses local window attention, which can significantly reduce computational requirements. Second, it facilitates joint training, where the spatial layer can process images and video frames independently, while the spatiotemporal layer is used to model temporal relationships in videos.

#Although conceptually simple, this study is the first to experimentally demonstrate on a public benchmark that Transformer has superior generation quality and parameter efficiency in latent video diffusion .

#Finally, to demonstrate the scalability and efficiency of the new method, the team also experimented with the difficult photorealistic image-to-video generation task. They trained three models cascaded together. These include a basic latent video diffusion model and two video super-resolution diffusion models. The result is a video with a resolution of 512×896 at 8 frames per second. This approach achieves state-of-the-art zero-shot FVD scores on the UCF-101 benchmark.

将Transformer用于扩散模型，AI 生成视频达到照片级真实感

Additionally, this model can be used to generate videos with consistent 3D camera motion.

将Transformer用于扩散模型，AI 生成视频达到照片级真实感

W.A.L.T

Learn visual token

at In the field of generative modeling of video, a key design decision is the choice of latent space representation. Ideally, we would like to have a shared and unified compressed visual representation that can be used for generative modeling of both images and videos.

Specifically, given a video sequence x, the goal is to learn a low-dimensional representation z that performs spatio-temporal compression at a certain temporal and spatial scale. In order to get a unified representation of video and still images, it is always necessary to encode the first frame of the video separately from the remaining frames. This allows you to treat still images as if they were only one frame of video.

Based on this idea, the team’s actual design uses the MAGVIT-v2 tokenizer’s causal 3D CNN encoder-decoder architecture.

After this stage, the input to the model becomes a batch of latent tensors, which represent a single video or a stack of discrete images (Figure 2). And the implicit representation here is real-valued and unquantized.

Learn to generate images and videos

Patchify. Following the original ViT design, the team tiled each hidden frame individually by converting it into a sequence of non-overlapping tiles. They also used learnable position embeddings, which are the sum of spatial and temporal position embeddings. The positional embedding is added to the linear projection of the tile. Note that for images, simply add the temporal position embedding corresponding to the first hidden frame.

Window attention. Transformer models consisting entirely of global self-attention modules are computationally and memory expensive, especially for video tasks. For efficiency and joint processing of images and videos, the team computes self-attention in a windowed manner based on two types of non-overlapping configurations: space (S) and space-time (ST), see Figure 2.

#The spatial window (SW) attention focuses on all tokens within a hidden frame. SW models spatial relationships in images and videos. The scope of the spatiotemporal window (STW) attention is a 3D window that models the temporal relationship between hidden frames of the video. Finally, in addition to absolute position embedding, they also used relative position embedding.

According to reports, although this design is simple, it has high computational efficiency and can be jointly trained on image and video data sets. Unlike methods based on frame-level autoencoders, the new method does not produce flickering artifacts, a common problem with methods that encode and decode video frames separately.

Conditional generation

In order to achieve controllable video generation, in addition to taking the time step t as Conditional,diffusion models also tend to use additional conditional,information,c,such as category labels, natural language, past,frames or low-resolution videos. In the newly proposed Transformer backbone network, the team integrated three types of conditional mechanisms, as described below:

Cross-attention. In addition to using self-attention layers in windowed Transformer blocks, they also added cross-attention layers for text conditional generation. When training the model with only videos, the cross-attention layer uses the same window-restricted attention as the self-attention layer, which means that S/ST will have a SW/STW cross-attention layer (Figure 2). However, for joint training, only the SW cross-attention layer is used. For cross-attention, the team’s approach is to concatenate input signals (queries) and conditional signals (key, value).

AdaLN-LoRA. Adaptive normalization layers are important components in many generative and visual synthesis models. To incorporate adaptive normalization layers, a simple approach is to include an MLP layer for each layer i that regresses on the vector of conditional parameters. The number of parameters for these additional MLP layers grows linearly with the number of layers and quadratically with the model dimensionality. Inspired by LoRA, researchers proposed a simple solution to reduce model parameters: AdaLN-LoRA.

Self-conditioning. In addition to being conditioned on external inputs, iterative generation algorithms can also be conditioned on samples they generate during inference. Specifically, Chen et al. modified the training process of the diffusion model in the paper "Analog bits: Generating discrete data using diffusion models with self-conditioning" so that the model generates a sample with a certain probability p_sc, and then based on this initial sample , use another forward pass to refine this estimate. There is also a certain probability that 1-p_sc only completes one forward pass.The team concatenated this model estimate with the input along the channel dimension and found that this simple technique worked well in combination with v-prediction.

Autoregressive generation

In order to generate long videos through autoregressive prediction, the team The models were also jointly trained on the frame prediction task. This is achieved by giving the model a certain probability p_fp conditioned on past frames during the training process. The condition is either 1 hidden frame (image-to-video generation) or 2 hidden frames (video prediction). This condition is integrated into the model by channel dimensions along the noisy implicit input. Standard classifier-less bootstrapping is used during inference, with c_fp as the conditional signal.

Video super-resolution

Computation of generating high-resolution video using a single model The cost is very high and basically difficult to achieve. The researchers refer to the paper "Cascaded diffusion models for high fidelity image generation" and use a cascade method to cascade the three models, and they operate at increasingly higher resolutions.

The base model generates video at a resolution of 128×128, which is then upsampled twice through two super-resolution stages. The low-resolution input (video or image) is first spatially upsampled using a depth-to-space convolution operation. Note that unlike training (where ground truth low-resolution input is provided), inference relies on implicit representations generated in previous stages.

To reduce this difference and make the super-resolution stage more robust to artifacts produced in the low-resolution stage, the team also used noise-conditional enhancement .

Aspect ratio fine-tuning. To simplify training and exploit more data sources with different aspect ratios, they used a square aspect ratio in the base stage. They then fine-tuned the model on a subset of the data to generate videos with a 9:16 aspect ratio via positional embedding interpolation.

Experiment

The researchers evaluated the newly proposed method on a variety of tasks: with categories Conditional image and video generation, frame prediction, text-based video generation. They also explored the effects of different design choices through ablation studies.

Visual generation

Video generation: in both UCF-101 and Kinetics-600 On each data set, W.A.L.T outperforms all previous methods in terms of FVD index, see Table 1.

Image generation: Table 2 compares the results of W.A.L.T with other current best methods for generating 256×256 resolution images. The newly proposed model outperforms previous methods and does not require specialized scheduling, convolutional induction bias, improved diffusion loss, and classifier-free guidance. Although VDM has a slightly higher FID score, it has many more model parameters (2B).

Ablation studies

To understand the contribution of different design decisions, the team also conducted ablation studies. Table 3 presents the results of the ablation study in terms of patch size, window attention, self-conditioning, AdaLN-LoRA, and autoencoders.

Text-to-video generation

The team works on text-to-image and text-to-video We jointly trained W.A.L.T’s text-to-video generation capabilities. They used a dataset from the public internet and internal sources containing ~970M text-image pairs and ~89M text-video pairs.

The resolution of the basic model (3B) is 17×128×128, and the two cascaded super-resolution models are 17×128×224 → 17× 256×448 (L, 1.3B, p = 2) and 17× 256×448→ 17×512×896 (L, 419M, p = 2). They also fine-tuned the aspect ratio in the base stage to produce videos at 128×224 resolution. All text-to-video generation results use a classifier-free bootstrapping approach.

Below are some generated video examples, for more please visit the project website:

Text: A squirrel eating a burger.

将Transformer用于扩散模型，AI 生成视频达到照片级真实感

Text: A cat riding a ghost rider bike through the desert.

将Transformer用于扩散模型，AI 生成视频达到照片级真实感

Quantitative evaluation

Evaluating text-based video generation in a scientific manner remains a challenge, in part due to the lack of standardized training datasets and benchmarks. So far, researchers' experiments and analyzes have focused on standard academic benchmarks, which use the same training data to ensure fair comparisons.

Nevertheless, for comparison with previous text-to-video generation studies, the team reports results on the UCF-101 dataset in a zero-sample evaluation setting.

It can be seen that the advantages of W.A.L.T are obvious.

Please refer to the original paper for more details.

The above is the detailed content of Using Transformer for the diffusion model, AI-generated videos achieve photorealism. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

4 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

4 weeks ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

1 months ago By DDD

Atomfall guide: item locations, quest guides, and tips

1 months ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7706

Java Tutorial

1640

CakePHP Tutorial

1394

Laravel Tutorial

1288

PHP Tutorial

1231

Related knowledge

The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days Jul 17, 2024 am 01:56 AM

It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Jul 17, 2024 pm 10:02 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

From RLHF to DPO to TDPO, large model alignment algorithms are already 'token-level' Jun 24, 2024 pm 03:04 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated Aug 05, 2024 pm 03:32 PM

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it Aug 01, 2024 pm 05:18 PM

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source Jul 17, 2024 am 02:46 AM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

Axiomatic training allows LLM to learn causal reasoning: the 67 million parameter model is comparable to the trillion parameter level GPT-4 Jul 17, 2024 am 10:14 AM

Show the causal chain to LLM and it learns the axioms. AI is already helping mathematicians and scientists conduct research. For example, the famous mathematician Terence Tao has repeatedly shared his research and exploration experience with the help of AI tools such as GPT. For AI to compete in these fields, strong and reliable causal reasoning capabilities are essential. The research to be introduced in this article found that a Transformer model trained on the demonstration of the causal transitivity axiom on small graphs can generalize to the transitive axiom on large graphs. In other words, if the Transformer learns to perform simple causal reasoning, it may be used for more complex causal reasoning. The axiomatic training framework proposed by the team is a new paradigm for learning causal reasoning based on passive data, with only demonstrations

See all articles