Abstract:
The goal of this paper's research is to develop learning methods that promote the automatic analysis and interpretation of human and mime-gestural movement from various perspectives and using various data sources images, video, depth, mocap data, audio, and inertial sensors, for example. Deep neural models are used as well as supervised classification and semi-supervised feature learning modeling temporal dependencies, and their effectiveness in a set of tasks that are fundamental, such as detection, classification, and parameter estimation, is demonstrated as well as user verification.
A method for identifying and classifying human actions and gestures based on utilizing multi-dimensional and multi-modal deep learning from visual signals (for example, live stream, depth, and motion - based data). A training strategy that uses, first, individual modalities must be carefully initialized, followed by gradual fusion (called ModDrop) to learn correlations between modalities while preserving the uniqueness of each modality specific representation. In addition, the suggested ModDrop training approach assures that the classifier detect has weak inputs for one or maybe more channels, enabling these to make valid predictions from any amount of data points accessible modalities. In this paper, inertial sensors (such as accelerometers and gyroscopes) embedded in mobile devices collect data are also used.