This study looks into the stumbling block of temporal activity localization via natural language (TALL) in the untrimmed video. It’s a difficult task since the target temporal activity may be misled by disorder query. Existing approaches used sliding windows, regression, and ranking to handle the query without the usage of grammar-based rules. When a query is out of sequence and cannot be correlated with the relevant activity, these approaches suffer performance deterioration. We introduce the visual, action, object, and connecting words concepts to address the issue of non-sequence queries. Integration of visual-textual entities network (IVTEN) is our proposed architecture, which consists of three submodules: (1) visual graph convolutional network (visual-GCN), (2) textual graph convolutional network (textual-GCN), and (3) compatible method for learning embeddings (CME). Visual nodes detect activity, object, and actor in the same way as textual nodes maintain word sequence using grammar-based rules. (CME) integrates several modalities (activity, query) and trained grammar-based words into the same embedding space. We also include a stochastic latent variable in CME to align and retain the query sequence with the relevant activity. On three typical benchmark datasets, our IVTEN approach outperforms the state-of-the-art Charades-STA, TACoS, and ActivityNet-Captions.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.