Skip to main navigation Skip to search Skip to main content

Transformers only look once with nonlinear combination for real-time object detection

  • Mandy Qi
  • , R. Xia
  • , G. Li
  • , Z. Huang
  • , Y. Pang

    Research output: Contribution to journalArticlepeer-review

    Abstract

    In this article, a novel real-time object detector called Transformers Only Look Once (TOLO) is proposed to resolve two problems. The first problem is the inefficiency of building long-distance dependencies among local features for amounts of modern real-time object detectors. The second one is the lack of inductive biases for vision Transformer networks with heavily computational cost. TOLO is composed of Convolutional Neural Network (CNN) backbone, Feature Fusion Neck (FFN), and different Lite Transformer Heads (LTHs), which are used to transfer the inductive biases, supply the extracted features with high-resolution and high-semantic properties, and efficiently mine multiple long-distance dependencies with less memory overhead for detection, respectively.

    Moreover, to find the massive potential correct boxes during prediction, we propose a simple and efficient nonlinear combination method between the object confidence and the classification score. Experiments on the PASCAL VOC 2007, 2012, and the MS COCO 2017 datasets demonstrate that TOLO significantly outperforms other state-of-the-art methods with a small input size. Besides, the proposed nonlinear combination method can further elevate the detection performance of TOLO by boosting the results of potential correct predicted boxes without increasing the training process and model parameters.
    Original languageEnglish
    JournalNeural Computing and Applications
    DOIs
    Publication statusPublished - 21 May 2022

    Keywords

    • Non-linear combination
    • Real-time object detector
    • TOLO
    • Vision Transformer networks

    Fingerprint

    Dive into the research topics of 'Transformers only look once with nonlinear combination for real-time object detection'. Together they form a unique fingerprint.

    Cite this