The open source version of Sora is a hit: 4K Star is available, 4090 runs on a single card, and A6000 can be fine-tuned-AI-php.cn

Home

Technology peripherals

The open source version of Sora is a hit: 4K Star is available, 4090 runs on a single card, and A6000 can be fine-tuned

PHPz

Aug 07, 2024 pm 06:05 PM

industry Wisdom spectrum ai

Zhipu AI has open sourced the large model it developed in-house.

The field of domestic video generation is becoming more and more popular. Just now, Zhipu AI announced that it will open source CogVideoX, a video generation model with the same origin as "Qingying". Earn 4k stars in just a few hours.

智谱版Sora开源爆火：狂揽4K Star，4090单卡运行，A6000可微调

Code repository: https://github.com/THUDM/CogVideo
Model download: https://huggingface.co/THUDM/CogVideoX-2b
Technical report: https: //github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf

On July 26, Zhipu AI officially released the video generation product "Qingying", which has been widely praised by everyone. . As long as you have a good idea (a few words to a few hundred words) and a little patience (30 seconds), "Qingying" can generate a high-precision video with 1440x960 resolution.

It is officially announced that from now on, Qingying will launch Qingyan App, and all users can experience it in an all-round way. Friends who want to try it can go to "Zhipu Qingyan" to experience the ability of "Qingying" to generate videos.

The emergence of "Qingying" is hailed as the first Sora available to everyone in China. Six days after its release, the number of videos generated by "Qingying" exceeded one million.

PC access link: https://chatglm.cn/
Mobile access link: https://chatglm.cn/download?fr=web_home

Why Is the Zhipu AI open source model so popular? You must know that although video generation technology is gradually maturing, there is still no open source video generation model that can meet the requirements of commercial-level applications. The familiar Sora, Gen-3, etc. are all closed source. The open source of CogVideoX is like OpenAI open source of the model behind Sora, which is of great significance to the majority of researchers.

CogVideoX open source model includes multiple models of different sizes. Currently, Zhipu AI open source CogVideoX-2B requires only 18GB of video memory for inference at FP-16 accuracy and only 40GB of video memory for fine-tuning. This means that a single A 4090 graphics card can perform inference, while a single A6000 graphics card can complete fine-tuning.

CogVideoX-2B’s prompt word limit is 226 tokens, the video length is 6 seconds, the frame rate is 8 frames/second, and the video resolution is 720*480. Zhipu AI has reserved a vast space for the improvement of video quality, and we look forward to developers' open source contributions to prompt word optimization, video length, frame rate, resolution, scene fine-tuning, and the development of various functions around video.

Models with stronger performance and larger parameters are on the way, so stay tuned and look forward to it.

Model

智谱版Sora开源爆火：狂揽4K Star，4090单卡运行，A6000可微调

VAE

Video data contains spatial and temporal information, so its data volume and computational burden far exceed that of image data. To address this challenge, Zhipu proposed a video compression method based on 3D variational autoencoder (3D VAE). 3D VAE simultaneously compresses the spatial and temporal dimensions of video through three-dimensional convolution, achieving higher compression rates and better reconstruction quality.

智谱版Sora开源爆火：狂揽4K Star，4090单卡运行，A6000可微调

The model structure includes an encoder, a decoder and a latent space regularizer, and compression is achieved through four stages of downsampling and upsampling. Temporal causal convolution ensures the causality of information and reduces communication overhead. Zhipu uses contextual parallelism technology to adapt to large-scale video processing.

In the experiment, Zhipu AI found that large-resolution encoding is easy to generalize, but increasing the number of frames is more challenging. Therefore, Zhipu trains the model in two stages: first training on lower frame rates and mini-batches, and then fine-tuning on higher frame rates through contextual parallelism. The training loss function combines L2 loss, LPIPS perceptual loss, and GAN loss for the 3D discriminator.

Expert Transformer

Wisdom Spectrum AI uses VAE’s encoder to compress the video into a latent space, then splits the latent space into chunks and expands it into long sequence embeddings z_vision. At the same time, Zhipu AI uses T5 to encode text input into text embedding z_text, and then splice z_text and z_vision along the sequence dimension. The spliced embeddings are fed into a stack of expert Transformer blocks for processing. Finally, the embeddings are back-stitched to recover the original latent space shape and decoded using VAE to reconstruct the video.

智谱版Sora开源爆火：狂揽4K Star，4090单卡运行，A6000可微调

Data

Video generation model training requires screening high-quality video data to learn real-world dynamics. Video may be inaccurate due to human editing or filming issues. Wisdom AI developed negative tags to identify and exclude low-quality videos such as over-edited, choppy motion, low-quality, lecture-style, text-dominated, and screen-noise videos. Through filters trained by video-llama, Zhipu AI annotated and filtered 20,000 video data points. At the same time, optical flow and aesthetic scores are calculated, and the threshold is dynamically adjusted to ensure the quality of the generated video.

Video data usually does not have text descriptions and needs to be converted into text descriptions for text-to-video model training. Existing video subtitle datasets have short subtitles and cannot fully describe the video content. Zhipu AI proposes a pipeline to generate video subtitles from image subtitles and fine-tunes the end-to-end video subtitle model to obtain denser subtitles. This approach generates short captions using the Panda70M model, dense image captions using the CogView3 model, and then summarizes using the GPT-4 model to generate the final short video. Zhipu AI also fine-tuned a CogVLM2-Caption model based on CogVLM2-Video and Llama 3, trained using dense subtitle data to accelerate the video subtitle generation process.

智谱版Sora开源爆火：狂揽4K Star，4090单卡运行，A6000可微调

パフォーマンス

テキストからビデオへの生成の品質を評価するために、Zhipu AI は人間のアクション、シーン、ダイナミクスなどの VBench の複数の指標を使用します。 Zhipu AI は、ビデオの動的特性に焦点を当てた、Devil の Dynamic Quality と Chrono-Magic の GPT4o-MT スコアという 2 つの追加ビデオ評価ツールも使用します。以下の表に示すとおりです。

Zhipu AI は、ビデオ生成におけるスケーリング則の有効性を検証しており、今後は、より画期的なイノベーションとより効率的なビデオ情報を備えた新しいモデルアーキテクチャを模索しながら、データスケールとモデルスケールのスケールアップを継続します。、テキストとビデオコンテンツをより完全に融合させたものです。

最後に「Clear Shadow」の効果を見てみましょう。

ヒント: 「美しく彫刻されたマストと帆を備えた繊細な木製のおもちゃのボートは、海の波を模倣した豪華な青いカーペットの上を滑らかに滑ります。船体は豊かな茶色に塗装され、小さな窓が付いています。カーペットは柔らかく、質感があり、広大な海を思わせる完璧な背景を提供し、ボートの周りにはさまざまなおもちゃや子供向けのアイテムがあり、遊び心のある環境を示唆しています。このシーンは、おもちゃのボートでの無限の冒険を象徴しています。

ヒント: 「カメラは、黒いルーフラックを備えた古い白い SUV が急な坂道を駆け上がり、タイヤが砂埃を巻き上げ、太陽が照りつけている様子を追跡します。舗装されていない道路を疾走するSUVは、暖かい光を照らしながら遠くに向かってカーブしており、道路の両側にはセコイアの木々が茂っていました。後ろから見ると、車はカーブをスムーズに進み、険しい丘や山々に囲まれ、上には薄い雲が広がっているような印象を与えます。雪に覆われた木々が立ち並び、地面も雪で覆われ、明るく穏やかな雰囲気を醸し出しています。ビデオのスタイルは、雪に覆われた森の美しさと道の静けさに焦点を当てた自然風景のショットです。軽い焦げと軽い煙のあるグリルのグリルのアップ。」

The above is the detailed content of The open source version of Sora is a hit: 4K Star is available, 4090 runs on a single card, and A6000 can be fine-tuned. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

3 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks ago By DDD

Roblox: Dead Rails - How To Tame Wolves

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks ago By DDD

Strength Levels for Every Enemy & Monster in R.E.P.O.

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1657

CakePHP Tutorial

1415

Laravel Tutorial

1309

PHP Tutorial

1257

C# Tutorial

1230

Related knowledge

DeepMind robot plays table tennis, and its forehand and backhand slip into the air, completely defeating human beginners Aug 09, 2024 pm 04:01 PM

But maybe he can’t defeat the old man in the park? The Paris Olympic Games are in full swing, and table tennis has attracted much attention. At the same time, robots have also made new breakthroughs in playing table tennis. Just now, DeepMind proposed the first learning robot agent that can reach the level of human amateur players in competitive table tennis. Paper address: https://arxiv.org/pdf/2408.03906 How good is the DeepMind robot at playing table tennis? Probably on par with human amateur players: both forehand and backhand: the opponent uses a variety of playing styles, and the robot can also withstand: receiving serves with different spins: However, the intensity of the game does not seem to be as intense as the old man in the park. For robots, table tennis

The first mechanical claw! Yuanluobao appeared at the 2024 World Robot Conference and released the first chess robot that can enter the home Aug 21, 2024 pm 07:33 PM

On August 21, the 2024 World Robot Conference was grandly held in Beijing. SenseTime's home robot brand "Yuanluobot SenseRobot" has unveiled its entire family of products, and recently released the Yuanluobot AI chess-playing robot - Chess Professional Edition (hereinafter referred to as "Yuanluobot SenseRobot"), becoming the world's first A chess robot for the home. As the third chess-playing robot product of Yuanluobo, the new Guoxiang robot has undergone a large number of special technical upgrades and innovations in AI and engineering machinery. For the first time, it has realized the ability to pick up three-dimensional chess pieces through mechanical claws on a home robot, and perform human-machine Functions such as chess playing, everyone playing chess, notation review, etc.

Claude has become lazy too! Netizen: Learn to give yourself a holiday Sep 02, 2024 pm 01:56 PM

The start of school is about to begin, and it’s not just the students who are about to start the new semester who should take care of themselves, but also the large AI models. Some time ago, Reddit was filled with netizens complaining that Claude was getting lazy. "Its level has dropped a lot, it often pauses, and even the output becomes very short. In the first week of release, it could translate a full 4-page document at once, but now it can't even output half a page!" https:// www.reddit.com/r/ClaudeAI/comments/1by8rw8/something_just_feels_wrong_with_claude_in_the/ in a post titled "Totally disappointed with Claude", full of

At the World Robot Conference, this domestic robot carrying 'the hope of future elderly care' was surrounded Aug 22, 2024 pm 10:35 PM

At the World Robot Conference being held in Beijing, the display of humanoid robots has become the absolute focus of the scene. At the Stardust Intelligent booth, the AI robot assistant S1 performed three major performances of dulcimer, martial arts, and calligraphy in one exhibition area, capable of both literary and martial arts. , attracted a large number of professional audiences and media. The elegant playing on the elastic strings allows the S1 to demonstrate fine operation and absolute control with speed, strength and precision. CCTV News conducted a special report on the imitation learning and intelligent control behind "Calligraphy". Company founder Lai Jie explained that behind the silky movements, the hardware side pursues the best force control and the most human-like body indicators (speed, load) etc.), but on the AI side, the real movement data of people is collected, allowing the robot to become stronger when it encounters a strong situation and learn to evolve quickly. And agile

ACL 2024 Awards Announced: One of the Best Papers on Oracle Deciphering by HuaTech, GloVe Time Test Award Aug 15, 2024 pm 04:37 PM

At this ACL conference, contributors have gained a lot. The six-day ACL2024 is being held in Bangkok, Thailand. ACL is the top international conference in the field of computational linguistics and natural language processing. It is organized by the International Association for Computational Linguistics and is held annually. ACL has always ranked first in academic influence in the field of NLP, and it is also a CCF-A recommended conference. This year's ACL conference is the 62nd and has received more than 400 cutting-edge works in the field of NLP. Yesterday afternoon, the conference announced the best paper and other awards. This time, there are 7 Best Paper Awards (two unpublished), 1 Best Theme Paper Award, and 35 Outstanding Paper Awards. The conference also awarded 3 Resource Paper Awards (ResourceAward) and Social Impact Award (

Li Feifei's team proposed ReKep to give robots spatial intelligence and integrate GPT-4o Sep 03, 2024 pm 05:18 PM

Deep integration of vision and robot learning. When two robot hands work together smoothly to fold clothes, pour tea, and pack shoes, coupled with the 1X humanoid robot NEO that has been making headlines recently, you may have a feeling: we seem to be entering the age of robots. In fact, these silky movements are the product of advanced robotic technology + exquisite frame design + multi-modal large models. We know that useful robots often require complex and exquisite interactions with the environment, and the environment can be represented as constraints in the spatial and temporal domains. For example, if you want a robot to pour tea, the robot first needs to grasp the handle of the teapot and keep it upright without spilling the tea, then move it smoothly until the mouth of the pot is aligned with the mouth of the cup, and then tilt the teapot at a certain angle. . this

Hongmeng Smart Travel S9 and full-scenario new product launch conference, a number of blockbuster new products were released together Aug 08, 2024 am 07:02 AM

This afternoon, Hongmeng Zhixing officially welcomed new brands and new cars. On August 6, Huawei held the Hongmeng Smart Xingxing S9 and Huawei full-scenario new product launch conference, bringing the panoramic smart flagship sedan Xiangjie S9, the new M7Pro and Huawei novaFlip, MatePad Pro 12.2 inches, the new MatePad Air, Huawei Bisheng With many new all-scenario smart products including the laser printer X1 series, FreeBuds6i, WATCHFIT3 and smart screen S5Pro, from smart travel, smart office to smart wear, Huawei continues to build a full-scenario smart ecosystem to bring consumers a smart experience of the Internet of Everything. Hongmeng Zhixing: In-depth empowerment to promote the upgrading of the smart car industry Huawei joins hands with Chinese automotive industry partners to provide

Distributed Artificial Intelligence Conference DAI 2024 Call for Papers: Agent Day, Richard Sutton, the father of reinforcement learning, will attend! Yan Shuicheng, Sergey Levine and DeepMind scientists will give keynote speeches Aug 22, 2024 pm 08:02 PM

Conference Introduction With the rapid development of science and technology, artificial intelligence has become an important force in promoting social progress. In this era, we are fortunate to witness and participate in the innovation and application of Distributed Artificial Intelligence (DAI). Distributed artificial intelligence is an important branch of the field of artificial intelligence, which has attracted more and more attention in recent years. Agents based on large language models (LLM) have suddenly emerged. By combining the powerful language understanding and generation capabilities of large models, they have shown great potential in natural language interaction, knowledge reasoning, task planning, etc. AIAgent is taking over the big language model and has become a hot topic in the current AI circle. Au

See all articles