The monocular 3D object detection methods based on Transformer have recently progressed significantly. However, most existing methods struggle to effectively handle fine-grained objects and complex scenes, particularly when capturing the features of occluded or small objects. To tackle these issues, we propose a monocular 3D object detector, CU-DETR, based on the MonoDETR framework. CU-DETR introduces the local-global fusion encoder to enhance local feature extraction and fusion and applies an uncertainty perturbation strategy in position encoding to enhance the model’s performance in handling complex scenes. Experimental results on the KITTI public dataset demonstrate that CU-DETR outperforms the MonoDETR.
CNN-Tranformer Hybrid models, combining the strengths of Transformers in capturing global context and CNNs in local feature extraction, have become an appealing direction in vision perception. However, hybrid models still face the significant challenge of minimizing computing expenses and balancing computational throughput and accuracy. This paper proposes an efficient CNN-Transformer hybrid model that improves throughput and memory consumption with high accuracy, named HTViT. Based on the three-stage architecture of LeViT, HTViT introduced a sparse cascaded group attention mechanism and global-local downsampling modules. The sparse cascaded group attention mechanism compresses the key and value in each group attention by the local aggregation to improve throughput and memory consumption. The global-local downsampling module introduces multi-scale convolution downsampling to enhance the local features and retain more valuable information to improve model performance. Comparison experiments with SOTA efficient hybrid models are conducted separately on CIFAR-10, STL-10, and Imagenette datasets. The experimental results demonstrate that HTViT significantly outperforms the baseline model LeViT and better balances the model size, throughput, memory consumption, and accuracy than other hybrid models.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.