YOLOv10 is here! True real-time end-to-end target detection
In the past few years, YOLOs has become a mainstream paradigm in the field of real-time object detection due to its effective balance between computational cost and detection performance. Researchers have conducted in-depth exploration of the structural design, optimization goals, data enhancement strategies, etc. of YOLOs and have made significant progress. However, the post-processing reliance on non-maximum suppression (NMS) hinders end-to-end deployment of YOLOs and negatively impacts inference latency. Furthermore, the design of various components in YOLOs lacks comprehensive and thorough review, resulting in significant computational redundancy and limiting model performance. This results in suboptimal efficiency, and huge potential for performance improvements. In this work, we aim to further advance the performance-efficiency boundary of YOLOs from both post-processing and model architecture. To this end, we first propose persistent dual allocation for NMS-free training of YOLOs, which simultaneously brings competitive performance and lower inference latency. Furthermore, we introduce a comprehensive efficiency-accuracy driven model design strategy for YOLOs. We have comprehensively optimized each component of YOLOs from the perspectives of efficiency and accuracy, which greatly reduces computational overhead and enhances model capabilities. The result of our efforts is a new generation of the YOLO series designed for real-time end-to-end object detection, called YOLOv10. Extensive experiments show that YOLOv10 achieves state-of-the-art performance and efficiency at various model scales. For example, on the COCO dataset, our YOLOv10-S is 1.8 times faster than RT-DETR-R18 under similar AP, while reducing parameters and floating point operations (FLOPs) by 2.8 times. Compared with YOLOv9-C, YOLOv10-B reduces latency by 46% and reduces parameters by 25% under the same performance. Code link: https://github.com/THU-MIG/yolov10.
What improvements are there in YOLOv10?
First solves the redundant prediction problem in post-processing by proposing a persistent dual allocation strategy for NMS-free YOLOs. This strategy includes dual label assignment and consistent matching metrics. This enables the model to obtain rich and harmonious supervision during training while eliminating the need for NMS during inference, achieving competitive performance while maintaining high efficiency.
This time, a comprehensive efficiency-accuracy driven model design strategy is proposed for the model architecture, and each component in YOLOs is comprehensively examined. In terms of efficiency, lightweight classification heads, space-channel decoupled downsampling, and rank-guided block designs are proposed to reduce obvious computational redundancy and achieve a more efficient architecture.
In terms of accuracy, large kernel convolutions are explored and effective partial self-attention modules are proposed to enhance model capabilities and tap performance improvement potential at low cost.
Based on these methods, the author successfully implemented a series of real-time end-to-end detectors with different model sizes, namely YOLOv10-N/S/M/B/L/X. Extensive experiments on standard object detection benchmarks show that YOLOv10 demonstrates the ability to outperform previous state-of-the-art models in terms of computation-accuracy trade-offs at various model sizes. As shown in Figure 1, under similar performance, YOLOv10-S/X is 1.8 times/1.3 times faster than RT-DETR R18/R101 respectively. Compared with YOLOv9-C, YOLOv10-B achieves 46% latency reduction under the same performance. In addition, YOLOv10 shows extremely high parameter utilization efficiency. YOLOv10-L/X is 0.3 AP and 0.5 AP higher than YOLOv8-L/X with the number of parameters reduced by 1.8 times and 2.3 times respectively. YOLOv10-M achieves similar AP to YOLOv9-M/YOLO-MS while reducing the number of parameters by 23% and 31% respectively.
During the training process, YOLOs usually utilize TAL (Task Assignment Learning) to assign multiple samples to each instance. Adopting a one-to-many allocation method generates rich supervision signals, which helps optimize and achieve stronger performance. However, this also makes YOLOs must rely on NMS (non-maximum suppression) post-processing, which results in suboptimal inference efficiency when deployed. While previous works have explored one-to-one matching approaches to suppress redundant predictions, they often add additional inference overhead or result in suboptimal performance. In this work, we propose an NMS-free training strategy that adopts dual label assignment and consistent matching metric, achieving high efficiency and competitive performance. Through this strategy, our YOLOs no longer require NMS in training, achieving high efficiency and competitive performance.
#Efficiency-driven model design. The components in YOLO include the stem, downsampling layers, stages with basic building blocks, and the head. The computational cost of the backbone part is very low, so we perform efficiency-driven model design for the other three parts.
(1)轻量级的分类头。在YOLO中,分类头和回归头通常有相同的架构。然而,它们在计算开销上存在显着的差异。例如,在YOLOv8-S中,分类头(5.95G/1.51M的FLOPs和参数数量)和回归头(2.34G/0.64M)的FLOPs和参数数量分别是回归头的2.5倍和2.4倍。然而,通过分析分析分类错误和回归错误的影响(见表6),我们发现回归头对YOLO的性能更为重要。因此,我们可以在不担心性能损害的情况下减少分类头的开销。因此,我们简单地采用了轻量级的分类头架构,它由两个深度可分离卷积组成,卷积核大小为3×3,后跟一个1×1的卷积核。 通过以上改进,我们可以简化轻量级的分类头的架构,它由两个深度可分离卷积组成,卷积核大小为3×3,后跟一个1×1的卷积核。这种简化的架构可以实现分类的功能,并且具有更小的计算开销和参数数量。
(2)空间-通道解耦下采样。 YOLO通常使用步长为2的常规3×3标准卷积,同时实现空间下采样(从H × W到H/2 × W/2)和通道变换(从C到2C)。这引入了不可忽视的计算成本 和参数计数。相反,我们提出将空间缩减和通道增加操作解耦,以实现更高效的下采样。具体来说,首先利用逐点卷积来调制通道维度,然后利用深度卷积进行空间下采样。这将计算成本降低到并将参数计数降低到。同时,它在下采样过程中最大限度地保留了信息,从而在降低延迟的同时保持了竞争性能。
(3)基于rank引导的模块设计。 YOLOs通常对所有阶段都使用相同的基本构建块,例如YOLOv8中的bottleneck块。为了彻底检查YOLOs的这种同构设计,我们利用内在秩来分析每个阶段的冗余性。具体来说,计算每个阶段中最后一个基本块中最后一个卷积的数值秩,它计算大于阈值的奇异值的数量。图3(a)展示了YOLOv8的结果,表明深层阶段和大型模型更容易表现出更多的冗余性。这一观察表明,简单地对所有阶段应用相同的block设计对于实现最佳容量-效率权衡来说并不是最优的。为了解决这个问题,提出了一种基于秩的模块设计方案,旨在通过紧凑的架构设计来降低被证明是冗余的阶段的复杂性。
首先介绍了一种紧凑的倒置块(CIB)结构,它采用廉价的深度卷积进行空间混合和成本效益高的逐点卷积进行通道混合,如图3(b)所示。它可以作为有效的基本构建块,例如嵌入在ELAN结构中(图3(b))。然后,倡导一种基于秩的模块分配策略,以在保持竞争力量的同时实现最佳效率。具体来说,给定一个模型,根据其内在秩的升序对所有阶段进行排序。进一步检查用CIB替换领先阶段的基本块后的性能变化。如果与给定模型相比没有性能下降,我们将继续替换下一个阶段,否则停止该过程。因此,我们可以在不同阶段和模型规模上实现自适应紧凑块设计,从而在不影响性能的情况下实现更高的效率。
基于精度导向的模型设计。 论文进一步探索了大核卷积和自注意力机制,以实现基于精度的设计,旨在以最小的成本提升性能。
(1)大核卷积。采用大核深度卷积是扩大感受野并增强模型能力的一种有效方法。然而,在所有阶段简单地利用它们可能会在用于检测小目标的浅层特征中引入污染,同时也在高分辨率阶段引入显着的I/O开销和延迟。因此,作者提出在深层阶段的跨阶段信息块(CIB)中利用大核深度卷积。这里将CIB中的第二个3×3深度卷积的核大小增加到7×7。此外,采用结构重参数化技术,引入另一个3×3深度卷积分支,以缓解优化问题,而不增加推理开销。此外,随着模型大小的增加,其感受野自然扩大,使用大核卷积的好处逐渐减弱。因此,仅在小模型规模上采用大核卷积。
(2)部分自注意力(PSA)。自注意力机制因其出色的全局建模能力而被广泛应用于各种视觉任务中。然而,它表现出高计算复杂度和内存占用。为了解决这个问题,鉴于普遍存在的注意力头冗余,作则提出了一种高效的部分自注意力(PSA)模块设计,如图3.(c)所示。具体来说,在1×1卷积之后将特征均匀地按通道分成两部分。只将一部分特征输入到由多头自注意力模块(MHSA)和前馈网络(FFN)组成的NPSA块中。然后,将两部分特征通过1×1卷积进行拼接和融合。此外,将MHSA中查询和键的维度设置为值的一半,并将LayerNorm替换为BatchNorm以实现快速推理。PSA仅放置在具有最低分辨率的第4阶段之后,以避免自注意力的二次计算复杂度带来的过多开销。通过这种方式,可以在计算成本较低的情况下将全局表示学习能力融入YOLOs中,从而很好地增强了模型的能力并提高了性能。
实验对比
这里就不做过多介绍啦,直接上结果!!!latency减少,性能继续增加。
The above is the detailed content of YOLOv10 is here! True real-time end-to-end target detection. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











End-to-end refers to "network connectivity". In order for the network to communicate, a connection must be established. No matter how far away it is or how many machines are in the middle, a connection must be established between the two ends. Once the connection is established, it is said to be an end-to-end connection, that is, the end-to-end is a logical link.

1. Introduction Currently, the leading object detectors are two-stage or single-stage networks based on the repurposed backbone classifier network of deep CNN. YOLOv3 is one such well-known state-of-the-art single-stage detector that receives an input image and divides it into an equal-sized grid matrix. Grid cells with target centers are responsible for detecting specific targets. What I’m sharing today is a new mathematical method that allocates multiple grids to each target to achieve accurate tight-fit bounding box prediction. The researchers also proposed an effective offline copy-paste data enhancement for target detection. The newly proposed method significantly outperforms some current state-of-the-art object detectors and promises better performance. 2. The background target detection network is designed to use

In the past month, due to some well-known reasons, I have had very intensive exchanges with various teachers and classmates in the industry. An inevitable topic in the exchange is naturally end-to-end and the popular Tesla FSDV12. I would like to take this opportunity to sort out some of my thoughts and opinions at this moment for your reference and discussion. How to define an end-to-end autonomous driving system, and what problems should be expected to be solved end-to-end? According to the most traditional definition, an end-to-end system refers to a system that inputs raw information from sensors and directly outputs variables of concern to the task. For example, in image recognition, CNN can be called end-to-end compared to the traditional feature extractor + classifier method. In autonomous driving tasks, input data from various sensors (camera/LiDAR

In the field of target detection, YOLOv9 continues to make progress in the implementation process. By adopting new architecture and methods, it effectively improves the parameter utilization of traditional convolution, which makes its performance far superior to previous generation products. More than a year after YOLOv8 was officially released in January 2023, YOLOv9 is finally here! Since Joseph Redmon, Ali Farhadi and others proposed the first-generation YOLO model in 2015, researchers in the field of target detection have updated and iterated it many times. YOLO is a prediction system based on global information of images, and its model performance is continuously enhanced. By continuously improving algorithms and technologies, researchers have achieved remarkable results, making YOLO increasingly powerful in target detection tasks.

Written in front & starting point The end-to-end paradigm uses a unified framework to achieve multi-tasking in autonomous driving systems. Despite the simplicity and clarity of this paradigm, the performance of end-to-end autonomous driving methods on subtasks still lags far behind single-task methods. At the same time, the dense bird's-eye view (BEV) features widely used in previous end-to-end methods make it difficult to scale to more modalities or tasks. A sparse search-centric end-to-end autonomous driving paradigm (SparseAD) is proposed here, in which sparse search fully represents the entire driving scenario, including space, time, and tasks, without any dense BEV representation. Specifically, a unified sparse architecture is designed for task awareness including detection, tracking, and online mapping. In addition, heavy

How to use C++ for high-performance image tracking and target detection? Abstract: With the rapid development of artificial intelligence and computer vision technology, image tracking and target detection have become important research areas. This article will introduce how to achieve high-performance image tracking and target detection by using C++ language and some open source libraries, and provide code examples. Introduction: Image tracking and object detection are two important tasks in the field of computer vision. They are widely used in many fields, such as video surveillance, autonomous driving, intelligent transportation systems, etc. for

This paper discusses the field of 3D object detection, especially 3D object detection for Open-Vocabulary. In traditional 3D object detection tasks, systems need to predict the localization of 3D bounding boxes and semantic category labels for objects in real scenes, which usually relies on point clouds or RGB images. Although 2D object detection technology performs well due to its ubiquity and speed, relevant research shows that the development of 3D universal detection lags behind in comparison. Currently, most 3D object detection methods still rely on fully supervised learning and are limited by fully annotated data under specific input modes, and can only recognize categories that emerge during training, whether in indoor or outdoor scenes. This paper points out that the challenges facing 3D universal target detection are mainly

Not everyone can understand that Tesla V12 has been launched on a large scale in North America and has gained recognition from more and more users due to its excellent performance. End-to-end autonomous driving has also become the technical direction that everyone is most concerned about in the autonomous driving industry. Recently, I had the opportunity to have some exchanges with first-class engineers, product managers, investors, and media people in many industries. I found that everyone is very interested in end-to-end autonomous driving, but even in terms of some basic understanding of end-to-end autonomous driving, There are still misunderstandings of this kind. As someone who has been fortunate enough to experience the city function with and without images from a domestic first-tier brand, as well as the two versions of FSD V11 and V12, here I would like to talk about a few things at this stage based on my professional background and tracking the progress of Tesla FSD over the years.
