Deep Convolutional Neural Networks On Multichannel

2m ago
985.28 KB
7 Pages

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)Deep Convolutional Neural Networks On Multichannel Time SeriesFor Human Activity RecognitionJian Bo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, Shonali KrishnaswamyData Analytics Department, Institute for Infocomm Research, A*STAR, Singapore more accurate and more effective deployment of signal acquisition on human body; iii) on-body sensors enjoythe merits on information privacy, as their acquired signalsare target-specific while the signals acquired by camera mayalso contain the information of other nontarget subjects in thescene. In the past few years, body-worn based HAR madepromising applications, e.g. game consoles, personal fitnesstraining, medication intake and health monitoring. An excellent survey on this topic can be found at [Bulling et al., 2014].The key factor attributed to the success of a HAR system isto find an effective representation of the time series collectedfrom the on-body sensors. Though considerable research efforts have been made to investigate this issue, diminishingreturns occurred. Conventionally, the HAR problem is often taken as one of specific applications of time series analysis. The widely-used features in HAR include basis transform coding (e.g. signals with wavelet transform and Fouriertransform) [Huynh and Schiele, 2005], statistics of raw signals (e.g, mean and variance of time sequences) [Bulling etal., 2014] and symbolic representation [Lin et al., 2003]. Although these features are widely used in many time seriesproblems, they are heuristic and not task-dependent. It isworth noting that the HAR task has its own challenges, suchas intraclass variability, interclass similarity, the NULL-classdominance, and complexness and diversity of physical activities [Bulling et al., 2014]. All these challenges make it highlydesirable to develop a systematical feature representation approach to effectively characterize the nature of signals relativeto the activity recognition task.Recently, deep learning has emerged as a family of learningmodels that aim to model high-level abstractions in data [Bengio, 2009; Deng, 2014]. In deep learning, a deep architecturewith multiple layers is built up for automating feature design.Specifically, each layer in deep architecture performs a nonlinear transformation on the outputs of the previous layer,so that through the deep learning models the data are represented by a hierarchy of features from low-level to high-level.The well-known deep learning models include convolutionalneural network, deep belief network and autoencoders. Depending on the usage of label information, the deep learningmodels can be learned in either supervised or unsupervisedmanner. Though deep learning models achieve remarkableresults in computer vision, natural language processing, andspeech recognition, it has not been fully exploited in the fieldThis paper focuses on human activity recognition(HAR) problem, in which inputs are multichanneltime series signals acquired from a set of bodyworn inertial sensors and outputs are predefined human activities. In this problem, extracting effective features for identifying activities is a criticalbut challenging task. Most existing work relies onheuristic hand-crafted feature design and shallowfeature learning architectures, which cannot findthose distinguishing features to accurately classifydifferent activities. In this paper, we propose a systematic feature learning method for HAR problem.This method adopts a deep convolutional neuralnetworks (CNN) to automate feature learning fromthe raw inputs in a systematic way. Through thedeep architecture, the learned features are deemedas the higher level abstract representation of lowlevel raw time series signals. By leveraging thelabelled information via supervised learning, thelearned features are endowed with more discriminative power. Unified in one model, feature learning and classification are mutually enhanced. Allthese unique advantages of the CNN make it outperform other HAR algorithms, as verified in theexperiments on the Opportunity Activity Recognition Challenge and other benchmark datasets.1IntroductionAutomatically recognizing human’s physical activities (a.k.a.human activity recognition or HAR) has emerged as a keyproblem to ubiquitous computing, human-computer interaction and human behavior analysis [Bulling et al., 2014; Plätzet al., 2012; Reddy et al., 2010]. In this problem, human’sactivity is recognized based upon the signals acquired (in realtime) from multiple body-worn (or body-embedded) inertialsensors. For HAR, signals acquired by on-body sensors arearguably favorable over the signals acquired by video cameras, due to the following reasons: i) on-body sensors alleviate the limitations of environment constraints and stationarysettings that cameras often suffer from [Bulling et al., 2014;Ji et al., 2010; Le et al., 2011]; ii) multiple on-body sensors3995

signals in different scales is captured. This deep architectureis not only for decomposing a large and complex problem intoa series of small problems, but more importantly for obtaining specific “variance” of signals at different scales. Here, the“variances” of signals reflect the salient patterns of signals.As stated in [Bengio, 2009], what matters for generalizationof a learning algorithm is the number of such “variance” ofsignals we wish to obtain after learning.By contrast, the traditional features extraction methodssuch as basis transform coding (e.g. signals with wavelettransform and Fourier transform) [Huynh and Schiele, 2005],statistics of raw signals (e.g, mean and covariance of timesequences) [Bulling et al., 2014] and symbolic representation [Lin et al., 2003] are deemed to play a comparablerole of transforming the data by one or a few of neuronsin one layer of a deep learning model. Another type ofdeep learning models, called Deep Belief Network (DBN)[Hinton and Osindero, 2006; Le Roux and Bengio, 2008;Tieleman, 2008], was also investigated for HAR by [Plätzet al., 2012]. However, this feature learning method doesnot employ the effective signal processing units (like convolution, pooling and rectifier) and also neglects the availablelabel information in feature extraction. The primary use ofthe CNN mainly lies in 2D image [Krizhevsky et al., 2012;Zeiler and Fergus, 2014], 3D videos [Ji et al., 2010] andspeech recognition [Deng et al., 2013]. However, in this paper, we attempt to build a new architecture of the CNN tohandle the unique challenges existed in HAR. The most related work is [Zeng et al., 2014], in which a shallow CNN isused and the HAR problem is restricted to the accelerometerdata.of HAR.In this paper, we tackle the HAR problem by adapting oneparticular deep learning model —- the convolutional neuralnetworks (CNN). The key attribute of the CNN is conducting different processing units (e.g. convolution, pooling, sigmoid/hyperbolic tangent squashing, rectifier and normalization ) alternatively. Such a variety of processing units canyield an effective representation of local salience of the signals. Then, the deep architecture allows multiple layers ofthese processing units to be stacked, so that this deep learning model can characterize the salience of signals in different scales. Therefore, the features extracted by the CNN aretask dependent and non-handcrafted. Moreover, these features also own more discriminative power, since the CNN canbe learned under the supervision of output labels. All theseadvantages of the CNN will be further elaborated in the following sections.As detailed in the following sections, in the application onHAR, the convolution and pooling filters in the CNN are applied along the temporal dimension for each sensor, and allthese feature maps for different sensors need to be unifiedas a common input for the neural network classifier. Therefore, a new architecture of the CNN is developed in this paper. In the experiments, we performed an extensive study onthe comparison between the proposed method and the stateof-the-art methods on benchmark datasets. The results showthat the proposed method is a very competitive algorithm forthe HAR problems. We also investigate the efficiency of theCNN, and conclude that the CNN is fast enough for onlinehuman activity recognition.2Motivations and Related Work3It is highly desired to develop a systematical and taskdependent feature extraction approach for HAR. Though thesignals collected from wearable sensors are time series, theyare different from other time series like speech signals andfinancial signals. Specifically, in HAR, only a few parts ofcontinuous signal stream are relevant to the concept of interest (i.e. human activities), and the dominant irrelevant partmostly corresponds to the Null activity. Furthermore, considering how human activity is performed in reality, we learnthat every activity is a combination of several basic continuous movements. Typically, a human activity could last afew seconds in practice, and within one second a few basicmovements could be involved. From the perspective of sensorsignals, the basic continuous movements are more likely tocorrespond to the smooth signals, and the transitions amongdifferent basic continuous movements may cause significantchange of signal values. These properties of signals in HARrequire the feature extraction method to be effective enoughto capture the nature of basic continuous movements as wellas the salience of the combination of basic movements.As such, we are motivated to build a deep architecture of aseries of signal processing units for feature extraction. Thisdeep architecture consists of multiple shallow architectures,and each shallow architecture is composed by a set of linear/nonlinear processing units on locally stationary signals.When all shallow architectures are cascaded, the salience ofConvolutional Neural Networks in HARConvolutional neural networks have great potential to identify the various salient patterns of HAR’s signals. Specifically, the processing units in the lower layers obtain thelocal salience of the signals (to characterize the nature ofeach basic movement in a human activity). The processingunits in the higher layers obtain the salient patterns of signals at high-level representation (to characterize the salienceof a combination of several basic movements). Note thateach layer may have a number of convolution or pooling operators (specified by different parameters) as described below, so multiple salient patterns learned from different aspects are jointly considered in the CNN. When these operators with the same parameters are applied on local signals (ortheir mapping) at different time segments, a form of translation invariance is obtained [Fukushima, 1980; Bengio, 2009;Deng, 2014]. Consequently, what matters is only the salientpatterns of signals instead of their positions or scales. However, in HAR we confront with multiple channels of time series signals, in which the traditional CNN cannot be used directly. The challenges in our problem include (i) processingunits in CNN need applied along temporal dimension and (ii)sharing or unifying the units in CNN among multiple sensors.In what follows, we will define the convolution and poolingoperators along the temporal dimension, and then present theentire architecture of the CNN used in HAR.3996

llconnection1210864205101520253035404550Section 1(c):[email protected] x 26Input:D x 30Section 1(s):[email protected] x 13Section 2(s):[email protected] x 3Section 2(c):[email protected] x 9Section 3(c): Section 4(u): Section 5(o):[email protected]@D x 118X1Figure 1: Illustration of the CNN architecture used for a multi-sensor based human activity recognition problems. We use theOpportunity Activity Recognition dataset presented in Section 4 as an illustrative example. The symbols “c”, “s”,“u”, “o” inthe parentheses of the layer tags refer to convolution, subsampling, unification and output operations respectively. The numbersbefore and after “@” refer to the number of feature maps and the dimension of a feature map in this layer. Note that pooling,ReLU and normalization layers are not showed due to the limitation of space.where Qi is the length of the pooling region.We start with the notations used in the CNN. A sliding window strategy is adopted to segment the time series signal intoa collection of short pieces of signals. Specifically, an instance used by the CNN is a two-dimensional matrix containing r raw samples (each sample with D attributes). Here, r ischosen to be as the sampling rate (e.g. 30 and 32 used in theexperiments), and the step size of sliding a window is chosen to be 3. One may choose smaller step size to increase theamount of the instances while higher computational cost maybe incurred. For training data, the true label of the matrix instance is determined by the most-frequently happened labelfor r raw records. For the jth feature map in the ith layer ofthe CNN, it is also a matrix, and the value at the xth row forx,dsensor d is denoted as vijfor convenience.3.13.2Based on the above introduced operators, we construct a CNNshown in Figure 1. For convenience, all layers of the CNNcan be grouped into five sections as detailed below.For the first two sections, each section is constituted by(i) a convolution layer that convolves the input or the previous layer’s output with a set of kernels to be learned; (ii) arectified linear unit (ReLU) layer that maps the output of theprevious layer by the function relu(v) max(v, 0); (iii) amax pooling layer that finds the maximum feature map over arange of local temporal neighborhood (a subsampling operator is often involved); (iv) a normalization layer that normalizes the values of different feature maps in the previous layer βP2vij