


Attention isn't all you need! Mamba hybrid large model open source: triple Transformer throughput
Mamba Time has come?
The transformer architecture has dominated the field of generative artificial intelligence since the publication of the seminal research paper "Attention is All You Need" in 2017.
However, the transformer architecture actually has two significant disadvantages:
The memory footprint of the Transformer changes with the length of the context. This makes it challenging to run long context windows or massive parallel processing without significant hardware resources, thus limiting widespread experimentation and deployment. The memory footprint of Transformer models scales with context length, making it difficult to run long context windows or heavily parallel processing without significant hardware resources, thus limiting widespread experimentation and deployment.
The attention mechanism in the Transformer model will adjust the speed according to the increase in the context length. This mechanism will randomly expand the sequence length and reduce the amount of calculation because each token depends on it. the entire sequence before, thereby applying context beyond the scope of efficient production.
Transformers are not the only way forward for productive artificial intelligence. Recently, AI21 Labs launched and open sourced a new method called "Jamba" that surpasses the transformer on multiple benchmarks.

Hugging Face Address: https://huggingface.co/ai21labs/Jamba-v0.1

Mamba The SSM architecture can well solve the memory resource and context issues of the transformer. However, the Mamba approach struggles to provide the same level of output as the transformer model.
Jamba combines the Mamba model based on the Structured State Space Model (SSM) with the transformer architecture, aiming to combine the best properties of SSM and transformer.

Jamba is also accessible from the NVIDIA API catalog as an NVIDIA NIM inference microservice that enterprise application developers can deploy using the NVIDIA AI Enterprise software platform.
In general, the Jamba model has the following characteristics:
The first production-level model based on Mamba, using a novel SSM-Transformer hybrid architecture;
3x improved throughput on long contexts compared to Mixtral 8x7B;
Provides access to 256K context windows;
Exposed model weights;
The only model of the same parameter scale that can accommodate up to 140K contexts on a single GPU.
Model Architecture
As shown in the figure below, Jamba's architecture adopts a blocks-and-layers approach to enable Jamba to integrate Two architectures. Each Jamba block consists of an attention layer or a Mamba layer, followed by a multilayer perceptron (MLP), forming a transformer layer.

Jamba leverages MoE to increase the total number of model parameters while simplifying the number of active parameters used in inference, resulting in higher performance without a corresponding increase in computational requirements. Model capacity. To maximize model quality and throughput on a single 80GB GPU, the research team optimized the number of MoE layers and experts used, leaving enough memory for common inference workloads.
Jamba’s MoE layer allows it to utilize only 12B of the available 52B parameters at inference time, and its hybrid architecture makes these 12B active parameters more efficient than a pure transformer model of equivalent size.
Previously, no one extended Mamba beyond 3B parameters. Jamba is the first hybrid architecture of its kind to reach production scale.
Throughput and Efficiency
Initial evaluation experiments show that Jamba performs well on key metrics such as throughput and efficiency.
In terms of efficiency, Jamba achieves 3x the throughput of Mixtral 8x7B on long contexts. Jamba is more efficient than similarly sized Transformer-based models such as Mixtral 8x7B.

In terms of cost, Jamba can accommodate 140K contexts on a single GPU. Jamba offers more deployment and experimentation opportunities than other current open source models of similar size.

It should be noted that Jamba is currently unlikely to replace the current Transformer-based large language models (LLM), but it may become a complement in some areas.
Reference link:
https://www.ai21.com/blog/announcing-jamba
https://venturebeat.com/ai/ai21-labs-juices-up-gen-ai-transformers-with-jamba/
The above is the detailed content of Attention isn't all you need! Mamba hybrid large model open source: triple Transformer throughput. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











On August 21, the 2024 World Robot Conference was grandly held in Beijing. SenseTime's home robot brand "Yuanluobot SenseRobot" has unveiled its entire family of products, and recently released the Yuanluobot AI chess-playing robot - Chess Professional Edition (hereinafter referred to as "Yuanluobot SenseRobot"), becoming the world's first A chess robot for the home. As the third chess-playing robot product of Yuanluobo, the new Guoxiang robot has undergone a large number of special technical upgrades and innovations in AI and engineering machinery. For the first time, it has realized the ability to pick up three-dimensional chess pieces through mechanical claws on a home robot, and perform human-machine Functions such as chess playing, everyone playing chess, notation review, etc.

The start of school is about to begin, and it’s not just the students who are about to start the new semester who should take care of themselves, but also the large AI models. Some time ago, Reddit was filled with netizens complaining that Claude was getting lazy. "Its level has dropped a lot, it often pauses, and even the output becomes very short. In the first week of release, it could translate a full 4-page document at once, but now it can't even output half a page!" https:// www.reddit.com/r/ClaudeAI/comments/1by8rw8/something_just_feels_wrong_with_claude_in_the/ in a post titled "Totally disappointed with Claude", full of

Local fine-tuning of DeepSeek class models faces the challenge of insufficient computing resources and expertise. To address these challenges, the following strategies can be adopted: Model quantization: convert model parameters into low-precision integers, reducing memory footprint. Use smaller models: Select a pretrained model with smaller parameters for easier local fine-tuning. Data selection and preprocessing: Select high-quality data and perform appropriate preprocessing to avoid poor data quality affecting model effectiveness. Batch training: For large data sets, load data in batches for training to avoid memory overflow. Acceleration with GPU: Use independent graphics cards to accelerate the training process and shorten the training time.

At the World Robot Conference being held in Beijing, the display of humanoid robots has become the absolute focus of the scene. At the Stardust Intelligent booth, the AI robot assistant S1 performed three major performances of dulcimer, martial arts, and calligraphy in one exhibition area, capable of both literary and martial arts. , attracted a large number of professional audiences and media. The elegant playing on the elastic strings allows the S1 to demonstrate fine operation and absolute control with speed, strength and precision. CCTV News conducted a special report on the imitation learning and intelligent control behind "Calligraphy". Company founder Lai Jie explained that behind the silky movements, the hardware side pursues the best force control and the most human-like body indicators (speed, load) etc.), but on the AI side, the real movement data of people is collected, allowing the robot to become stronger when it encounters a strong situation and learn to evolve quickly. And agile

At this ACL conference, contributors have gained a lot. The six-day ACL2024 is being held in Bangkok, Thailand. ACL is the top international conference in the field of computational linguistics and natural language processing. It is organized by the International Association for Computational Linguistics and is held annually. ACL has always ranked first in academic influence in the field of NLP, and it is also a CCF-A recommended conference. This year's ACL conference is the 62nd and has received more than 400 cutting-edge works in the field of NLP. Yesterday afternoon, the conference announced the best paper and other awards. This time, there are 7 Best Paper Awards (two unpublished), 1 Best Theme Paper Award, and 35 Outstanding Paper Awards. The conference also awarded 3 Resource Paper Awards (ResourceAward) and Social Impact Award (

Deep integration of vision and robot learning. When two robot hands work together smoothly to fold clothes, pour tea, and pack shoes, coupled with the 1X humanoid robot NEO that has been making headlines recently, you may have a feeling: we seem to be entering the age of robots. In fact, these silky movements are the product of advanced robotic technology + exquisite frame design + multi-modal large models. We know that useful robots often require complex and exquisite interactions with the environment, and the environment can be represented as constraints in the spatial and temporal domains. For example, if you want a robot to pour tea, the robot first needs to grasp the handle of the teapot and keep it upright without spilling the tea, then move it smoothly until the mouth of the pot is aligned with the mouth of the cup, and then tilt the teapot at a certain angle. . this

Conference Introduction With the rapid development of science and technology, artificial intelligence has become an important force in promoting social progress. In this era, we are fortunate to witness and participate in the innovation and application of Distributed Artificial Intelligence (DAI). Distributed artificial intelligence is an important branch of the field of artificial intelligence, which has attracted more and more attention in recent years. Agents based on large language models (LLM) have suddenly emerged. By combining the powerful language understanding and generation capabilities of large models, they have shown great potential in natural language interaction, knowledge reasoning, task planning, etc. AIAgent is taking over the big language model and has become a hot topic in the current AI circle. Au

Editor of the Machine Power Report: Yang Wen The wave of artificial intelligence represented by large models and AIGC has been quietly changing the way we live and work, but most people still don’t know how to use it. Therefore, we have launched the "AI in Use" column to introduce in detail how to use AI through intuitive, interesting and concise artificial intelligence use cases and stimulate everyone's thinking. We also welcome readers to submit innovative, hands-on use cases. Oh my God, AI has really become a genius. Recently, it has become a hot topic that it is difficult to distinguish the authenticity of AI-generated pictures. (For details, please go to: AI in use | Become an AI beauty in three steps, and be beaten back to your original shape by AI in a second) In addition to the popular AI Google lady on the Internet, various FLUX generators have emerged on social platforms
