Video is becoming a prevalent medium for e-learning. Lecture videos contain text information in both the presentation slides and lecturer's speech. This paper examines the relative utility of automatically recovered text from these sources for lecture video retrieval. To extract the visual information, we automatically detect slides within the videos and apply optical character recognition to obtain their text. Automatic speech recognition is used similarly to extract spoken text from the recorded audio. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Results reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Experiments demonstrate that automatically extracted slide text enables higher precision video retrieval than automatically recovered spoken text.
This paper describes research activities at FX Palo Alto Laboratory (FXPAL) in the area of multimedia browsing,
search, and retrieval. We first consider interfaces for organization and management of personal photo collections.
We then survey our work on interactive video search and retrieval. Throughout we discuss the evolution of both
the research challenges in these areas and our proposed solutions.
Technology abounds for capturing presentations. However, no simple solution exists that is completely automatic. ProjectorBox is a "zero user interaction" appliance that automatically captures, indexes, and manages presentation multimedia. It operates continuously to record the RGB information sent from presentation devices, such as a presenter's laptop, to display devices, such as a projector. It seamlessly captures high-resolution slide images, text and audio. It requires no operator, specialized software, or changes to current presentation practice. Automatic media analysis is used to detect presentation content and segment presentations. The analysis substantially enhances the web-based user interface for browsing, searching, and exporting captured presentations. ProjectorBox has been in use for over a year in our corporate conference room, and has been deployed in two universities. Our goal is to develop automatic capture services that address both corporate and educational needs.
We present a framework for analyzing the structure of digital media streams. Though our methods work for video, text, and audio, we concentrate on detecting the structure of digital music files. In the first step, spectral data is used to construct a similarity matrix calculated from inter-frame spectral similarity.The digital audio can be robustly segmented by correlating a kernel along the diagonal of the similarity matrix. Once segmented, spectral statistics of each segment are computed. In the second step,segments are clustered based on the self-similarity of their statistics. This reveals the structure of the digital music in a set of segment boundaries and labels. Finally, the music is summarized by selecting clusters with repeated segments throughout the piece. The summaries can be customized for various applications based on the structure of the original music.
Automated target recognition (ATR) is a problem of great importance in a wide variety of applications: from military target recognition to recognizing flow-patterns in fluid- dynamics to anatomical shape-studies. The basic goal is to utilize observations (images, signals) from remote sensors (such as videos, radars, MRI or PET) to identify the objects being observed. In a statistical framework, probability distributions on parameters representing the object unknowns are derived an analyzed to compute inferences (please refer to [1] for a detailed introduction). An important challenge in ATR is to determine efficient mathematical models for the tremendous variability of object appearance which lend themselves to reasonable inferences. This variation may be due to differences in object shapes, sensor-mechanisms or scene- backgrounds. To build models for object variabilities, we employ deformable templates. In brief, the object occurrences are described through their typical representatives (called templates) and transformations/deformations which particularize the templates to the observed objects. Within this pattern-theoretic framework, ATR becomes a problem of selecting appropriate templates and estimating deformations. For an object (alpha) (epsilon) A, let I(alpha ) denote a template (for example triangulated CAD-surface) and let s (epsilon) S be a particular transformation, then denote the transformed template by sI(alpha ). Figure 1 shows instances of the template for a T62 tank at several different orientations. For the purpose of object classification, the unknown transformation s is considered a nuisance parameter, leading to a classical formulation of Bayesian hypothesis- testing in presence of unknown, random nuisance parameters. S may not be a vector-space, but it often has a group structure. For rigid objects, the variation in translation and rotation can be modeled through the action of special Euclidean group SE(n). For flexible objects, such as anatomical shapes, higher-dimensional groups such as a diffeomorphisms are utilized.
The recognition of targets in infrared scenes is complicated by the wide variety of appearances associated with different thermodynamic states. We represent the variability in the thermodynamic signatures of targets via an expansion in terms of 'eigentanks' derived from a principal component analysis performed over the target's surface. Employing a Poisson sensor likelihood, or equivalently a likelihood based on Csiszar's I-divergence, a natural discrepancy measure for nonnegative images, yields a coupled set of nonlinear equations which must be solved to computed maximum a posteriori estimates of the thermodynamic expansion coefficients. We propose a weighted least-squares approximation to the Poisson loglikelihood for which the MAP estimates are solutions of linear equations. Bayesian model order estimation techniques are employed to choose the number of coefficients; this prevents target models with numerous eigentanks in their representation from having an unfair advantage over simple target models. The Bayesian integral is approximated by Schwarz's application of Laplace's method of integration; this technique is closely related to Rissanen's minimum description length and Wallace's minimum message length criteria. Our implementation of these techniques on Silicon Graphics computers exploits the flexible nature of their rendering engines. The implementation is illustrated in estimating the orientation of a tank and the optimum number of representative eigentanks for real data provided by the U.S. Army Night Vision and Electronic Sensors Directorate.
In our earlier work, we focused on pose estimation of ground- based targets as viewed via forward-looking passive infrared (FLIR) systems and laser radar (LADAR) imaging sensors. In this paper, we will study individual and joint sensor performance to provide a more complete understanding of our sensor suite. We will also study the addition of a high range- resolution radar (HRR). Data from these three sensors are simulated using CAD models for the targets of interest in conjunction with XPATCH range radar simulation software, Silicon Graphics workstations and the PRISM infrared simulation package. Using a Lie Group representation of the orientation space and a Bayesian estimation framework, we quantitatively examine both pose-dependent variations in performance, and the relative performance of the aforementioned sensors via mean squared error analysis. Using the Hilbert-Schmidt norm as an error metric, the minimum mean squared error (MMSE) estimator is reviewed and mean squared error (MSE) performance analysis is presented. Results of simulations are presented and discussed. In our simulations, FLIR and HRR sensitivities were characterized by their respective signal-to-noise ratios (SNRs) and the LADAR by its carrier-to-noise ratio (CNR). These figures-of-merit can, in turn, be related to the sensor, atmosphere, and target parameters for scenarios of interest.
We have been studying information theoretic measures, entropy and mutual information, as performance metrics for object recognition given a standard suite of sensors. Our work has focused on performance analysis for the pose estimation of ground-based objects viewed remotely via a standard sensor suite. Target pose is described by a single angle of rotation using a Lie group parameterization: O (epsilon) SO(2), the group of 2 X 2 rotation matrices. Variability in the data due to the sensor by which the scene is observed is statistically characterized via the data likelihood function. Taking a Bayesian approach, the inference is based on the posterior density, constructed as the product of the data likelihood and the prior density for object pose. Given multiple observations of the scene, sensor fusion is automatic in the joint likelihood component of the posterior density. The Bayesian approach is consistent with the source-channel formulation of the object recognition problem, in which parameters describing the sources (objects) in the scene must be inferred from the output (observation) of the remote sensing channel. In this formulation, mutual information is a natural performance measure. In this paper we consider the asymptotic behavior of these information measures as the signal to noise ratio (SNR) tends to infinity. We focus on the posterior entropy of the object rotation angle conditioning on image data. We consider single and multiple sensor scenarios and present quadratic approximations to the posterior entropy. Our results indicate that for broad ranges of SNR, low dimensional posterior densities in object recognition estimation scenarios are accurately modeled asymptotically.
Our work focuses on pose estimation of ground-based targets viewed via multiple sensors including forward-looking infrared radar (FLIR) systems and laser radar (LADAR) range imagers. Data from these two sensors are simulated using CAD models for the targets of interest in conjunction with Silicon Graphics workstations, the PRISM infrared simulation package, and the statistical model for LADAR described by Green Shapiro. Using a Bayesian estimation framework, we quantitatively examine both pose-dependent variations in performance, and the relative performance of the aforementioned sensors when their data is used separately or optimally fused together. Using the Hilbert-Schmidt norm as an error metric, the minimum mean squared error (MMSE) estimator is reviewed and its mean squared error (MSE) performance analysis is presented. Results of simulations are presented and discussed.
We have been studying information theoretic measures, entropy and mutual information, as performance bounds on the information gain given a standard suite of sensors. Object pose is described by a single angle of rotation using a Lie group parameterization; observations are simulated using CAD models for the targets of interest and simulators such as the PRISM infrared simulator. Variability in the data due to the sensor by which the scene is remotely observed is statistically characterized via the data likelihood function. Taking a Bayesian approach, the inference is based on the posterior density, constructed as the product of the data likelihood and the prior density for target pose. Given observations from multiple sensors, data fusion is automatic in the posterior density. Here, we consider the mutual information between the target pose and remote observation as a performance measure in the pose estimation context. We have quantitatively examined target thermodynamic state information gain dependency of FLIR systems, the relative information gain of the FLIR and video sensors, and the additional information gain due to sensor fusion. Furthermore, we have applied to the Kullback-Leibler distance measures to quantify information loss due to thermodynamic signature mismatch.
Our work has focused on deformable template representations of geometric variability in automatic target recognition (ATR). Within this framework we have proposed the generation of conditional mean estimates of pose of ground-based targets remotely sensed via forward-looking IR radar (FLIR) systems. Using the rotation group parameterization of the orientation space and a Bayesian estimation framework, conditional mean estimators are defined on the rotation group with minimum mean squared error (MMSE) performance bounds calculated following. This paper focuses on the accommodation of thermodynamic variation. Our new approach relaxes assumptions of the target's underlying thermodynamic state, expanding thermodynamic state as a scalar field. Estimation within the deformable template setting poses geometric and thermodynamic variation as a joint inference. MMSE pose estimators for geometric variation are derived, demonstrating the 'cost' of accommodating thermodynamic variability. Performance is quantitatively examined, and simulations are presented.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.