BAT method: AAAI 2024's first multi-modal target tracking universal bidirectional adapter-AI-php.cn

Table of Contents

Main Contributions

Core method

Experimental results

Home

Technology peripherals

BAT method: AAAI 2024's first multi-modal target tracking universal bidirectional adapter

PHPz

Jan 24, 2024 pm 03:33 PM

ai train

Object tracking is one of the basic tasks of computer vision. In recent years, single-modality (RGB) object tracking has made significant progress. However, due to the limitations of a single imaging sensor, we need to introduce multi-modal images (such as RGB, infrared, etc.) to make up for this shortcoming to achieve all-weather target tracking in complex environments. The application of such multi-modal images can provide more comprehensive information and enhance the accuracy and robustness of target detection and tracking. The development of multimodal target tracking is of great significance for realizing higher-level computer vision applications.

However, existing multi-modal tracking tasks also face two main problems:

Due to multi-modal target tracking The cost of data annotation is high, and most existing data sets are limited in size and insufficient to support the construction of effective multi-modal trackers;
Because different imaging methods have different effects on Objects have different sensitivities, the dominant mode in the open world changes dynamically, and the dominant correlation between multi-modal data is not fixed.

Many multi-modal tracking efforts that pre-train on RGB sequences and then fully fine-tune to multi-modal scenes have time and efficiency issues, as well as limited performance.

In addition to the complete fine-tuning method, it is also inspired by the efficient fine-tuning method of parameters in the field of natural language processing (NLP). Some recent methods have introduced parameter-efficient prompt fine-tuning in multi-modal tracking. These methods do this by freezing the backbone network parameters and adding an additional set of learnable parameters.

Typically, these methods focus on one modality (usually RGB) as the primary modality and the other modality as the auxiliary modality. However, this method ignores the dynamic correlation between multi-modal data and therefore cannot fully utilize the complementary effects of multi-modal information in complex scenes, thus limiting the tracking performance.

首个通用双向Adapter多模态目标追踪方法BAT，入选AAAI 2024

Figure 1: Different dominant modes in complex scenarios.

To solve the above problems, researchers from Tianjin University proposed a solution called Bidirectional Adapter for Multimodal Tracking (BAT). Different from traditional methods, the BAT method does not rely on fixed dominant mode and auxiliary mode, but obtains better performance in the change of auxiliary mode to dominant mode through the process of dynamically extracting effective information. The innovation of this method is that it can adapt to different data characteristics and task requirements, thereby improving the representation ability of the basic model in downstream tasks. By using the BAT method, researchers hope to provide a more flexible and efficient multi-modal tracking solution, bringing better results to research and applications in related fields.

BAT consists of two base model encoders with shared parameters specific to the modal branches and a general bidirectional adapter. During the training process, BAT did not fully fine-tune the basic model, but adopted a step-by-step training method. Each specific modality branch is initialized by using the base model with fixed parameters, and only the newly added bidirectional adapters are trained. Each modal branch learns cue information from other modalities and combines it with the feature information of the current modality to enhance representation capabilities. Two modality-specific branches interact through a universal bidirectional adapter to dynamically fuse dominant and auxiliary information with each other to adapt to the paradigm of multi-modal non-fixed association. This design enables BAT to fine-tune the content without changing the meaning of the original content, improving the model's representation ability and adaptability.

The universal bidirectional adapter adopts a lightweight hourglass structure and can be embedded into each layer of the transformer encoder of the basic model to avoid introducing a large number of learnable parameters. By adding only a small number of training parameters (0.32M), the universal bidirectional adapter has lower training cost and achieves better tracking performance compared with fully fine-tuned methods and cue learning-based methods.

The paper "Bi-directional Adapter for Multi-modal Tracking":

首个通用双向Adapter多模态目标追踪方法BAT，入选AAAI 2024

Paper link: https ://arxiv.org/abs/2312.10611

Code link: https://github.com/SparkTempest/BAT

Main Contributions

We first propose an adapter-based multi-modal tracking visual cue framework. Our model is able to perceive the dynamic changes of dominant modalities in open scenes and effectively fuse multi-modal information in an adaptive manner.
To the best of our knowledge, we propose a universal bidirectional adapter for the base model for the first time. It has a simple and efficient structure and can effectively realize multi-modal cross-cue tracking. By adding only 0.32M learnable parameters, our model is robust to multi-modal tracking in open scenarios.
We conducted an in-depth analysis of the impact of our universal adapter at different levels. We also explore a more efficient adapter architecture in experiments and verify our advantages on multiple RGBT tracking related datasets.

Core method

As shown in Figure 2, we propose a multi-modal tracking visual cue framework based on a bidirectional Adapter (BAT), the framework has a dual-stream encoder structure with RGB modality and thermal infrared modality, and each stream uses the same basic model parameters. The bidirectional Adapter is set up in parallel with the dual-stream encoder layer to cross-cue multimodal data from the two modalities.

The method does not completely fine-tune the basic model. It only efficiently transfers the pre-trained RGB tracker to multi-modal scenes by learning a lightweight bidirectional Adapter. It achieves excellent multi-modal complementarity and excellent tracking accuracy.

首个通用双向Adapter多模态目标追踪方法BAT，入选AAAI 2024

Figure 2: Overall architecture of BAT.

First, the 首个通用双向Adapter多模态目标追踪方法BAT，入选AAAI 2024 template frame of each modality (the initial frame of the target object in the first frame) and search frames (subsequent tracking images) are converted into , and they are spliced together and passed to the N-layer dual-stream transformer encoder respectively.

首个通用双向Adapter多模态目标追踪方法BAT，入选AAAI 2024

Bidirectional adapter is set up in parallel with the dual-stream encoder layer to learn feature cues from one modality to another. For this purpose, the output features of the two branches are added and input into the prediction head H to obtain the final tracking result box B.

首个通用双向Adapter多模态目标追踪方法BAT，入选AAAI 2024

The bidirectional adapter adopts a modular design and is embedded in the multi-head self-attention stage and MLP stage respectively, as shown on the right side of Figure 1. Detailed structures designed to transfer feature cues from one modality to another. It consists of three linear projection layers, tn represents the number of tokens in each modality, the input token is first dimensionally reduced to de through down projection and passes through a linear projection layer, and then projected upward to the original dimension dt and fed back as a feature prompt Transformer encoder layers to other modalities.

Through this simple structure, the bidirectional adapter can effectively perform feature prompts between 首个通用双向Adapter多模态目标追踪方法BAT，入选AAAI 2024 modalities to achieve multi-modal tracking.

Since the transformer encoder and prediction head are frozen, only the parameters of the newly added adapter need to be optimized. Notably, unlike most traditional adapters, our bidirectional adapter functions as a cross-modal feature cue for dynamically changing dominant modalities, ensuring good tracking performance in the open world.

Experimental results

As shown in Table 1, the comparison on the two data sets of RGBT234 and LasHeR shows that our method has both accuracy and success rate. Outperforms state-of-the-art methods. As shown in Figure 3, the performance comparison with state-of-the-art methods under different scene properties of the LasHeR dataset also demonstrates the superiority of the proposed method.

These experiments fully prove that our dual-stream tracking framework and bidirectional Adapter successfully track targets in most complex environments and adaptively switch from dynamically changing dominant-auxiliary modes Extract effective information from the system and achieve state-of-the-art performance.

首个通用双向Adapter多模态目标追踪方法BAT，入选AAAI 2024

Table 1 Overall performance on RGBT234 and LasHeR datasets.

首个通用双向Adapter多模态目标追踪方法BAT，入选AAAI 2024

Figure 3 Comparison of BAT and competing methods under different attributes in the LasHeR dataset.

Experiments demonstrate our effectiveness in dynamically prompting effective information from changing dominant-auxiliary patterns in complex scenarios. As shown in Figure 4, compared with related methods that fix the dominant mode, our method can effectively track the target even when RGB is completely unavailable, when both RGB and TIR can provide effective information in subsequent scenes. , the tracking effect is much better. Our bidirectional Adapter dynamically extracts effective features of the target from both RGB and IR modalities, captures more accurate target response locations, and eliminates interference from the RGB modality.

首个通用双向Adapter多模态目标追踪方法BAT，入选AAAI 2024

# Figure 4 Visualization of tracking results.

# We also evaluate our method on the RGBE trace dataset. As shown in Figure 5, compared with other methods on the VisEvent test set, our method has the most accurate tracking results in different complex scenarios, proving the effectiveness and generalization of our BAT model.

首个通用双向Adapter多模态目标追踪方法BAT，入选AAAI 2024

Figure 5 Tracking results under the VisEvent data set.

首个通用双向Adapter多模态目标追踪方法BAT，入选AAAI 2024

Figure 6 Attention weight visualization.

We visualize the attention weights of different layers tracking targets in Figure 6. Compared with the baseline-dual (dual-stream framework for basic model parameter initialization) method, our BAT effectively drives the auxiliary mode to learn more complementary information from the dominant mode, while maintaining the effectiveness of the dominant mode as the network depth increases. performance, thereby improving overall tracking performance.

Experiments show that BAT successfully captures multi-modal complementary information and achieves sample adaptive dynamic tracking.

The above is the detailed content of BAT method: AAAI 2024's first multi-modal target tracking universal bidirectional adapter. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055612 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Nordhold: Fusion System, Explained

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1670

CakePHP Tutorial

1428

Laravel Tutorial

1329

PHP Tutorial

1274

C# Tutorial

1256

Related knowledge

How to use the chrono library in C? Apr 28, 2025 pm 10:18 PM

Using the chrono library in C can allow you to control time and time intervals more accurately. Let's explore the charm of this library. C's chrono library is part of the standard library, which provides a modern way to deal with time and time intervals. For programmers who have suffered from time.h and ctime, chrono is undoubtedly a boon. It not only improves the readability and maintainability of the code, but also provides higher accuracy and flexibility. Let's start with the basics. The chrono library mainly includes the following key components: std::chrono::system_clock: represents the system clock, used to obtain the current time. std::chron

How to understand DMA operations in C? Apr 28, 2025 pm 10:09 PM

DMA in C refers to DirectMemoryAccess, a direct memory access technology, allowing hardware devices to directly transmit data to memory without CPU intervention. 1) DMA operation is highly dependent on hardware devices and drivers, and the implementation method varies from system to system. 2) Direct access to memory may bring security risks, and the correctness and security of the code must be ensured. 3) DMA can improve performance, but improper use may lead to degradation of system performance. Through practice and learning, we can master the skills of using DMA and maximize its effectiveness in scenarios such as high-speed data transmission and real-time signal processing.

Steps to add and delete fields to MySQL tables Apr 29, 2025 pm 04:15 PM

In MySQL, add fields using ALTERTABLEtable_nameADDCOLUMNnew_columnVARCHAR(255)AFTERexisting_column, delete fields using ALTERTABLEtable_nameDROPCOLUMNcolumn_to_drop. When adding fields, you need to specify a location to optimize query performance and data structure; before deleting fields, you need to confirm that the operation is irreversible; modifying table structure using online DDL, backup data, test environment, and low-load time periods is performance optimization and best practice.

What is real-time operating system programming in C? Apr 28, 2025 pm 10:15 PM

C performs well in real-time operating system (RTOS) programming, providing efficient execution efficiency and precise time management. 1) C Meet the needs of RTOS through direct operation of hardware resources and efficient memory management. 2) Using object-oriented features, C can design a flexible task scheduling system. 3) C supports efficient interrupt processing, but dynamic memory allocation and exception processing must be avoided to ensure real-time. 4) Template programming and inline functions help in performance optimization. 5) In practical applications, C can be used to implement an efficient logging system.

Top 10 digital currency trading platforms: Top 10 safe and reliable digital currency exchanges Apr 30, 2025 pm 04:30 PM

The top 10 digital virtual currency trading platforms are: 1. Binance, 2. OKX, 3. Coinbase, 4. Kraken, 5. Huobi Global, 6. Bitfinex, 7. KuCoin, 8. Gemini, 9. Bitstamp, 10. Bittrex. These platforms all provide high security and a variety of trading options, suitable for different user needs.

How to measure thread performance in C? Apr 28, 2025 pm 10:21 PM

Measuring thread performance in C can use the timing tools, performance analysis tools, and custom timers in the standard library. 1. Use the library to measure execution time. 2. Use gprof for performance analysis. The steps include adding the -pg option during compilation, running the program to generate a gmon.out file, and generating a performance report. 3. Use Valgrind's Callgrind module to perform more detailed analysis. The steps include running the program to generate the callgrind.out file and viewing the results using kcachegrind. 4. Custom timers can flexibly measure the execution time of a specific code segment. These methods help to fully understand thread performance and optimize code.

Quantitative Exchange Ranking 2025 Top 10 Recommendations for Digital Currency Quantitative Trading APPs Apr 30, 2025 pm 07:24 PM

The built-in quantization tools on the exchange include: 1. Binance: Provides Binance Futures quantitative module, low handling fees, and supports AI-assisted transactions. 2. OKX (Ouyi): Supports multi-account management and intelligent order routing, and provides institutional-level risk control. The independent quantitative strategy platforms include: 3. 3Commas: drag-and-drop strategy generator, suitable for multi-platform hedging arbitrage. 4. Quadency: Professional-level algorithm strategy library, supporting customized risk thresholds. 5. Pionex: Built-in 16 preset strategy, low transaction fee. Vertical domain tools include: 6. Cryptohopper: cloud-based quantitative platform, supporting 150 technical indicators. 7. Bitsgap:

How does deepseek official website achieve the effect of penetrating mouse scroll event? Apr 30, 2025 pm 03:21 PM

How to achieve the effect of mouse scrolling event penetration? When we browse the web, we often encounter some special interaction designs. For example, on deepseek official website, �...

See all articles