Transformer models are demonstrating remarkable and emergent capabilities in the natural language processing domain. These models are bounded only by the availability of large training datasets. These datasets can be tractably obtained since natural language models are pre-trained using self-supervision in the form of token masking. Papers like He et al. and Cao et al. have recently shown the power of this token masking technique by utilizing masked autoencoders as scalable vision learners in combination with a self-supervised pre-training technique for vision transformer models. Feichtenhofer et al. extended these techniques to video, proving that masked autoencoders are scalable spatiotemporal learners as well. To our best knowledge, these techniques have only been experimented on ground-level, object-centric style imagery and video. Extending these techniques to remote or overhead imagery presents two significant problems. First, the size of objects of interest are small compared to the typical mask patch size. Second, the frames are not object centered. In this study, we explore if modern self-supervised pre-training techniques like masked auto encoding extend well to overhead wide area motion imagery (WAMI) data. We argue that modern pre-training techniques like MAE are well suited to WAMI data given the typical object size in this domain as well as the ability to leverage strong global spatial contextual information. To this end, we conduct a comprehensive exploration of different patch sizes and masking ratios on the popular WAMI dataset, WPAFB 2009. We find that domain-specific adjustments to these pre-training techniques result in downstream performance improvements on computer vision tasks including object detection.
Multiple object tracking (MOT) is a common computer vision problem that focuses on detecting objects and maintaining their identities through a sequence of image frames. Until now, there have been three main approaches to improve MOT performance: 1) improving the detector’s quality, 2) improving the tracker’s quality, or 3) creating novel approaches to jointly model detection and tracking. In this work, we argue that there is a fourth, simpler way to improve MOT performance, by fusing multiple multiple object trackers together. In this paper, we introduce a novel approach, TrackFuse, that aims to fuse the final tracks from two different models into a single output, similar to classification ensembling or weighted box fusion for object detection. The fundamental assumption of TrackFuse is that multiple trackers will fail uniquely, and similarly, multiple detectors will fail uniquely too. Thus, by fusing the output of multiple approaches to MOT, we can improve tracking performance. We test our approach on combinations of several high performing approaches to tracking and show state-of-the-art results on the MOTA metric on a held out validation set of the MOT17 dataset, compared to individual tracking models. Furthermore, we consistently show that fusing multiple object trackers provides a performance boost on multiple metrics compared to results of individual model outputs sent for fusion. Our code will be released soon.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.