paper brief

YOLOv4 (arXiv:2004.10934) paper brief: feature combinations for speed and accuracy in object detection

YOLOv4 studies which CNN features and training techniques reliably improve object detection, and it reports combining selected components such as CSP, CmBN, SAT, Mish, Mosaic augmentation, DropBlock, and CIoU loss to reach state-of-the-art results.

February 22, 2026•Mira Vale•llm systems

Continue in Rorobot with the source paper open and ready for chat.

Open this paper in Rorobot

What this paper is about

The paper is titled “YOLOv4: Optimal Speed and Accuracy of Object Detection,” and it presents a design aimed at balancing speed and accuracy for object detection. [S1] The paper states that there are “a huge number of features” that are said to improve Convolutional Neural Network accuracy, and it states that practical testing of feature combinations on large datasets and theoretical justification of results are required. [S1] The paper reports that some features operate only on certain models, only for certain problems, or only for small-scale datasets. [S1] The paper also reports that some features such as batch-normalization and residual-connections are applicable to the majority of models, tasks, and datasets. [S1] The paper assumes that “universal features” include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT), and Mish-activation. [S1] The paper reports using features including WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, DropBlock regularization, and CIoU loss. [S1] The paper states that it combines some of these components to achieve state-of-the-art results. [S1]

Core claims to remember

The paper states that many CNN “features” are claimed to improve accuracy, and it states that practical testing of feature combinations on large datasets is required. [S1] The paper states that theoretical justification of the results is required alongside practical testing. [S1] The paper reports that feature applicability varies, including cases where features operate only on certain models, certain problems, or small-scale datasets. [S1] The paper reports that batch-normalization and residual-connections are applicable to the majority of models, tasks, and datasets. [S1] The paper assumes a specific set of “universal features,” namely WRC, CSP, CmBN, SAT, and Mish-activation. [S1] The paper reports a concrete feature set that it uses, including Mosaic data augmentation, DropBlock regularization, and CIoU loss in addition to WRC, CSP, CmBN, SAT, and Mish activation. [S1] The paper reports that it combines some of the listed components to achieve state-of-the-art results. [S1]

Limitations and caveats

The paper reports that some features operate on certain models exclusively and for certain problems exclusively. [S1] The paper reports that some features operate only for small-scale datasets. [S1] The paper states that practical testing of combinations of features on large datasets is required, which places emphasis on empirical evaluation across combinations. [S1] The paper states that theoretical justification of the result is required, which sets an expectation for explanatory analysis beyond empirical testing. [S1]

How to apply this in study or projects

Make a written list of the “universal features” that the paper assumes, namely WRC, CSP, CmBN, SAT, and Mish-activation, and keep the wording aligned with the paper’s terminology. [S1] Make a second written list of the full set of features the paper reports using, including Mosaic data augmentation, DropBlock regularization, and CIoU loss, alongside WRC, CSP, CmBN, SAT, and Mish activation. [S1] Track the paper’s explicit distinction between features described as broadly applicable, such as batch-normalization and residual-connections, and features described as operating only on certain models, problems, or dataset scales. [S1] Read EfficientNet’s description of compound scaling, including its statement that carefully balancing depth, width, and resolution can lead to better performance, and record the exact scaling dimensions it names. [S4] Read SimCLR’s list of major framework components, including its statements about the role of augmentation composition, the role of a learnable nonlinear transformation between representation and contrastive loss, and the role of larger batch sizes and more training steps. [S3] Read CLIP’s description of fixed-category supervision and its statement that predicting which caption goes with which image is used as a pre-training task, and record how it describes zero-shot transfer via natural language. [S2]

Sources

[S1]arxiv.org
YOLOv4: Optimal Speed and Accuracy of Object Detection
There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet
Open source Back to article
[S2]arxiv.org
Learning Transferable Visual Models From Natural Language Supervision
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

FAQ

What techniques does YOLOv4 (2004.10934) report using?

The paper reports using WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, DropBlock regularization, and CIoU loss. [S1]

What does YOLOv4 call “universal features”?

The paper assumes that universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT), and Mish-activation. [S1]