DctViT: Discrete Cosine Transform meet vision transformers.

Researchers

Journal

Modalities

Models

Abstract

Vision transformers (ViTs) have become one of the dominant frameworks for vision tasks in recent years because of their ability to efficiently capture long-range dependencies in image recognition tasks using self-attention. In fact, both CNNs and ViTs have advantages and disadvantages in vision tasks, and some studies suggest that the use of both may be an effective way to balance performance and computational cost. In this paper, we propose a new hybrid network based on CNN and transformer, using CNN to extract local features and transformer to capture long-distance dependencies. We also proposed a new feature map resolution reduction based on Discrete Cosine Transform and self-attention, named DCT-Attention Down-sample (DAD). Our DctViT-L achieves 84.8% top-1 accuracy on ImageNet 1K, far outperforming CMT, Next-ViT, SpectFormer and other state-of-the-art models, with lower computational costs. Using DctViT-B as the backbone, RetinaNet can achieve 46.8% mAP on COCO val2017, which improves mAP by 2.5% and 1.1% with less calculation cost compared with CMT-S and SpectFormer as the backbone.Copyright © 2024 The Authors. Published by Elsevier Ltd.. All rights reserved.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *