|
1.IntroductionIn pattern recognition and computer vision, face recognition is a very important research field.1–6 Due to the complexity of facial features and the difficulty of manual feature selection,1,5,6 it is commonly agreed that the best features can be obtained by using unsupervised feature extraction methods.3–5 Recently, with Google Alpha Go Zero defeating many Go masters, deep learning has received intensive attentions.7,8 As a classical deep learning model, convolution neural networks (CNNs) with convolution and pooling layers have achieved astonishing results in many image recognition tasks, reaching an unprecedented accuracy.9,10 However, CNN still has many shortcomings. During the process of training a CNN model, researchers need to obtain a huge number of parameters, which leads to high computational cost.11 To solve this problem, researchers are committed to finding a simple CNN model that requires a small number of parameters. Chan et al.12 proposed PCANet, which is a simple deep learning network based on unsupervised learning. PCANet uses PCA to learn the filters and deploys simple binary hashing and block histogram for indexing and pooling. Unlike other CNNs that learn filters by backpropagation, PCANet learns filters using the PCA method. Thus, PCANet requires less computational cost, less time, and storage space. The experimental results show the astonishing performance of PCANet. The PCA method used by PCANet is based on one-dimensional (1-D) vectors. Before deploying PCA, we need to convert two-dimensional (2-D) image matrices into 1-D vectors, which will cause two major problems: (1) Some spatial information of image is implied in the 2-D structure of the image.13,14 Obviously, the intrinsic information is discarded when the image matrix is converted into 1-D vector.13,15 (2) The long 1-D vector leads to the requirement of large computational time and storage space in computing the eigenvectors. To solve these problems, Yu et al.16 proposed 2-D principal component analysis network (2DPCANet), which replaces PCA with 2DPCA.15,17–19 And Tian et al.20 proposed multiple scales principal component analysis network (MS-PCANet). However, both PCA and 2DPCA are based on L2-norm method. It is well known that the methods based on L2-norm are sensitive to outliers so that data with outliers can totally ruin the results from the desired methods.5,21,22 To solve this problem, Kwak23 proposed a PCA method based on L1-norm. L1-norm is widely considered to be more robust to outliers.21,24 L1-PCA adopts the L1-norm for measuring the reconstruction error. On this basis, Xuelong et al.14 proposed L1-norm-based 2DPCA. In this paper, L1-norm was introduced into PCANet to get L1-PCANet. Then, we generalize L1-PCANet to , which shares the same structure with 2DPCANet to generate the feature of input data but learns filters by L1-2DPCA. In addition, we use support vector machine (SVM) as classifiers for the features generated by the networks. To test the performance of , we compare it with other three networks (PCANet, 2DPCANet, and L1-PCANet) on Yale, AR,25 extended Yale B,26 labeled faces in the wild-aligned (LFW-a),27 and Face Recognition Technology database (FERET)28 face databases. The rest of paper is organized as follows. Sections 2.1 and 2.2 review related work on L1-PCA and L1-2DPCA. L1-PCANet and are given in Sec. 2.3. Section 3 reports the detail of experiments. Section 4 reports the results and the analysis of the experiments and Sec. 5 concludes this paper. 2.Materials and Methods2.1.L1-Norm-Based PCAThe proposed L1-PCANet is based on L1-PCA.21,23 L1-PCA is considered as the simplest and most efficient among many models of L1-norm-based PCA. Let , with . The is a function that maps a matrix to a vector and . Suppose be the principal vector to be obtained. Here, we set the number of principal vectors to one to simplify the procedure. The objective of L1-PCA is to maximize the L1-norm variance in the feature space and the successive greedy solutions are expected to provide a good approximation as the following: where denotes L2-norm and denotes L1-norm.To solve the computational problems posed by the symbol of absolute value, we introduce a polarity parameter in Eq. (1): By introducing , Eq. (1) can be rewritten as follows: The process of maximization is achieved by Algorithm 1. Here, denotes the number of iterations and and denote and during iteration . Algorithm 1L1-PCA method.
By the above algorithm, we can obtain the first principal vector . To compute , we have to update the training data as follows: 2.2.L1-Norm-Based 2DPCAIn this section, we extend L1-PCA to L1-2DPCA.14 As mentioned above, 2DPCA computes eigenvectors with 2-D input. Suppose denote input training images and being the image size. Let be the first principal component to be learned. Let , with . Note, The objective of L1-PCA is to maximize the L1-norm variance in feature space as follows: The polarity parameter can be computed as follows: The process of maximization is achieved by Algorithm 2. To compute , we have to update the training data as follows: Algorithm 2L1-2DPCA method.
At this point, we can find that the difference between L1-PCA and L1-2DPCA is that L1-PCA converts an image matrix into a vector, however, L1-2DPCA directly uses each row in the original image matrix as a vector. 2.3.Proposed Method2.3.1.L1-PCANetIn this section, we propose a PCA-based deep learning network, L1-PCANet. To overcome the sensitivity to outliers in PCANet due to the use of L2-norm, we use the L1-PCA rather than the PCA to learn the filters. L1-PCANet and PCANet12 share the same network architecture, which is shown in Fig. 1. Suppose there are training images of size , and we get patches of size around each pixel in . Then, we take all overlapping patches and map them into vectors: And we remove the patch mean from each patch and obtain as follows: For all input images, we construct the same matrix and combine them into one matrix to obtain as follows: Then, we use L1-PCA mentioned above to learn the filters in stage 1. The filter we want to find is . We take as the input data of L1-PCA. Assuming that the number of filters in stage 1 is , we can obtain the first stage filters by repeatedly calling Algorithm 1. The L1-PCA filters of stage 1 are expressed as follows: where .The output of stage 1 can expressed as follows: where denotes 2-D convolution. We set the boundary of the input image to zero-padding to make sure that is of the same size as . We can get the filters of the second and subsequent layers by simply repeating the process of the first layer design. The pooling layer of L1-PCANet is almost the same as the pooling layer of .2.3.2.L1-2D2PCANetIn this section, we generalize L1-PCANet to , which shares the same network with 2DPCANet,16 as shown in Fig. 2. First stage of L1-2D2PCANetLet all the assumptions be the same as in Section III. We get all the overlapping patches: and subtract the patch mean from each of them and we form a matrix:And we use the transpose of to form matrix: For all input images, we construct the matrix by the same way and put them into one matrix, we can obtain as follows: Then, we use L1-2DPCA mentioned above to learn the filters in stage 1. We want to obtain filters and , where . and are the input data for L1-2DPCA. Assuming that the number of filters in stage 1 is , the first stage filters and are obtained by repeatedly calling Algorithm 2. The filters we need in stage 1 can finally be expressed as follows: The output of stage 1 will be Second stage of L1-2D2PCANetLike in the first stage, we can start with the overlapping patches of and remove the patch mean from each patch. Then, we have Further, we define the matrix that collects all the patches without the patch mean of the ’th output being removed as Finally, the input of the second stage is obtained by concatenating and for all filters: We take and as the input data of L1-2DPCA. Assuming that the number of filters in stage 2 is , we design the second stage filters and by repeatedly calling Algorithm 2. The L1-2DPCA filters of stage 2 are expressed as follows: where .Therefore, we have outputs for each output of stage 1: Note that the number of outputs of stage 2 is . Pooling stageFirst, we use a Heaviside-like step function to binarize the output of stage 2. The function can be expressed as follows: Each pixel is encoded by the following function: where is an integer of range .Second, we divide into blocks. Then, we make a histogram of all blocks of with values and concatenate all the histogram of blocks into one vector . In this way, we obtain histograms and we put them into a vector: Using the L1-2DPCA model described above, we can transform an input image into a feature vector as the output of . 3.ExperimentsIn this section, we evaluate the performance of L1-PCANet and L1- with PCANet and 2DPCANet as baselines on Yale, AR, extended Yale B, and FERET databases, respectively, which are shown in Fig. 3. SVM29 implementation from the libsvm is used as the classifier with default settings. We repeat some experiments 10 times and calculate the average recognition accuracy and root mean square error (RMSE). In all experiments, we create all PCANet and its different variations instances on MATLAB and other CNNs on Tensorflow. 3.1.Extended Yale BExtended Yale B consists of 2414 images of 38 individuals captured with different lighting conditions. These pictures are preprocessed to have the same size and alignment. The parameters are set as , , . In experiment 1, we compare L1-PCANet and L1- with PCANet and 2DPCANet. We randomly select images per individual for training and use the rest for testing. We also create AlexNet30 and GoogleNet11 instances for comparison, which are trained on 1024 images randomly selected from extended Yale B for 20 epochs. The architecture of AlexNet is the same as in Ref. 30 and the architecture of GoogleNet is the same as in Ref. 11. The parameters of two CNNs are set as , , . The results are shown in Table 1. Table 1Experiment 1 on extended Yale B.26
In experiment 2, to evaluate the robustness of L1-PCANet and L1- to outliers, we randomly add blockwise noise to the test images to generate test images with outliers. Within each block, the pixel value is randomly set to be 0 or 255. These blocks occupy 10%, 20%, 30%, and 50% of the images and they are added to the random position of the image, respectively, which can be seen in Fig. 4. The results are shown in Table 2. Table 2Experiment 2 on extended Yale B.26
To demonstrate the superiority of the proposed method, we compare L1-PCANet and L1- with the traditional L1-PCA and L1-2DPCA in experiment 3. We create L1-PCA and L1-2DPCA instances based on Refs. 23 and 24. The parameters of L1-PCA and L1-2DPCA are set as . We randomly select images per individual for gallery images and seven images per individual for training. The results are shown in Table 3. Table 3Experiment 3 on extended Yale B.26
In experiment 4, we examine the impact of the block size B for L1-PCANet and L1-. The block size changes from to . The results are shown in Fig. 5(a). 3.2.ARAR face database contains 2600 color images corresponding to 100 people’s faces (50 men and 50 women). It has two session data from two different days and each person in each session has 13 images, including 7 images with only illumination and expression change, 3 images wearing sunglasses, and 3 images wearing scarf. Images show frontal faces with different facial expressions, illumination conditions, and occlusions (sunglasses and scarf). These pictures are preprocessed to . The parameters are set as , , , respectively. In experiment 5, in order to investigate the impact of the choice of training images, we divide the experiment into four groups: (1) In group 1, we randomly select five images with only illumination and expression change from session 1 per individual as training images; (2) in group 2, we randomly select four images with only illumination and expression change and one image wearing sunglasses from session 1 per individual as training images; (3) in group 3, we randomly select four images with only illumination and expression change and one image wearing scarf from session 1 per individual as training images. The remaining images are test samples; and (4) in group 4, we randomly select three images with only illumination and expression change, one image wearing sunglasses and one image wearing scarf from session 1 per individual as training images. The remaining images in session 1 and all images in session 2 are used as test images. We manually select five images from session 1 as the gallery images and keep gallery images of each group the same. The results are shown in Table 4. Table 4Experiment 5 on AR.25
In order to investigate the impact of the choice of gallery images, experiment 6 is the same as experiment 5 except that the gallery images and the training images are exchanged. We use the remaining images in session 1 and all images in session 2 as test samples. The results are shown in Table 5. Table 5Experiment 6 on AR.25
3.3.FERETThis database contains a total of 11338 facial images. They were collected by photographing 994 subjects at various facial angles. We gathered a subset from FERET, which is composed by 1400 images recording of 200 individuals, with each seven images exhibit large variations in facial expression, facial angle, and illumination. This subset is available in our GitHub repository. These pictures are preprocessed to have the same size and alignment. The parameters are set as , , , respectively. In experiment 7, we divide the experiment into seven groups. The training images of each group consist of 200 images from the subset with different facial angle, expression, and illumination. We use the remaining images in the subset as test images. The results are shown in Table 6. Table 6Experiment 7 on FERET.28
In experiment 8, we examine the impact of the block size B for L1-PCANet and L1-. The block size changes from to . The results are shown in Fig. 5(b). 3.4.YaleYale consists of 15 individuals and 11 images for each individual, which shows varying facial expressions and configurations. These pictures are preprocessed to have the same size . The parameters are set as , , , respectively. In experiment 9, we randomly select images per individual for training and use the rest for testing. The results are shown in Table 7. Table 7Experiment 9 on Yale.26
3.5.LFW-aLFW-a is a version of LFW after alignment with deep funneling. We gathered the individuals, including more than nine images from LFW-a. The parameters are set as , , , respectively. In experiment 10, we randomly choose images per individual for gallery images and keep training images of each group the same. The results are shown in Table 8. Table 8Experiment 10 on LFW-a.27
4.Results and AnalysisTables 1 and 3 show the results of experiments 1 and 3 on extended Yale B, Table 4 shows the result of experiment 5 on AR, Table 6 shows the result of experiment 7 on FERET, Table 7 shows the result on Yale, and Table 8 shows the result on LFW-a. In these experiments, we changed the training images by random selection. From the results, we can see that the outperforms L1-PCA, L1-2DPCA, PCANet, 2DPCANet, and L1-PCANet in terms of recognition accuracy and RMSE, because we introduce L1-norm into the network. The two L1-norm-based networks we proposed are far superior to the traditional L2-norm-based networks in terms of RMSE, which means the proposed networks are insensitive to changes in training images. That is, the accuracy of the traditional L2-norm-based networks largely depends on the choice of training images while the L1-norm-based networks we proposed can achieve better and stable accuracy under any training images. A possible explanation of this phenomenon is as follows. In fact, the expression, posture, illumination condition, and occlusion in the images can be regarded as interference or noise in face recognition. This noise degrades L2-norm-based networks much more than it degrades L1-norm-based networks. Therefore, the proposed networks exhibit the superiority when the training images contain some changes in expression, posture, illumination condition, and occlusion. Table 2 shows the result of experiment 2 on extended Yale B. In this experiment, we randomly add blockwise noise to the test images. From the results, we can see that as the blockwise noise increases from 10% of the image size to 50%, the performance of PCANet, 2DPCANet, and L1-PCANet drops rapidly while still has good performance. Therefore, it can be considered that has better robustness against outlier and noise than other three networks. We also investigate the impact of the choice of gallery images on AR; see Table 4. From the horizontal comparison of Table 5, the more categories the gallery contains, the higher the accuracy is. Figure 5 shows the result of experiment 4 on extended Yale B and experiment 8 on FERET. When the block is small, the local information cannot be contained perfectly, and it may get more noise when the block is big. 5.ConclusionIn this paper, we have proposed a deep learning network , which is a simple but robust method. We use the L1-norm-based 2DPCA14 instead of L2-norm-based 2DPCA15 for the filter learning because of the advantages of L1-norm. It is more robust to outliers than L2-norm. By introducing L1-norm into 2DPCANet,16 we hope the network will inherit such advantages. To verify the performance of , we evaluate them on the facial datasets, including AR, extended Yale B, Yale, and FERET, respectively. The results show that has three distinct advantages over traditional L2-norm-based networks: (1) Statistically, the accuracy of is higher than that of other networks on all test datasets. (2) has better robustness to changes in training images compared with the other networks. (3) Compared with the other networks, has better robustness to noise and outliers. Therefore, is an efficient and robust network for face recognition. However, L1-2DPCA brings more computational load to the network, which increases the computational cost of . Despite this, the computational cost of is far less than those traditional CNNs, which are based on backpropagation. In the future work, we will work on the improving of L1-2DPCA algorithm to solve the problem of the computational cost of . AcknowledgmentsThe paper is supported by the National Natural Science Foundation of China (Grant Nos. 61672265 and U1836218), the 111 Project of Ministry of Education of China (Grant No. B12018), UK EPSRC under Grant No. EP/N007743/1, and MURI/EPSRC/dstl under Grant No. EP/R018456/1. ReferencesP. A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach, Prentice Hall International, New Jersey
(1982). Google Scholar
B. D. Ripley,
“Pattern recognition and neural networks,”
Technometrics, 39
(2), 233
–234
(1999). Google Scholar
A. K. Jain, R.P.W. Duin and J. Mao,
“Statistical pattern recognition: a review,”
IEEE Trans. Pattern Anal. Mach. Intell., 22
(1), 4
–37
(2000). https://doi.org/10.1109/34.824819 Google Scholar
C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), 049901 Springer-Verlag, New York
(2006). Google Scholar
X.-J. Wu et al.,
“A new direct LDA (D-LDA) algorithm for feature extraction in face recognition,”
in Int. Conf. Pattern Recognit.,
(2004). https://doi.org/10.1109/ICPR.2004.1333830 Google Scholar
Y. Yi et al.,
“Face recognition using spatially smoothed discriminant structure-preserved projections,”
J. Electron. Imaging, 23
(2), 023012
(2014). https://doi.org/10.1117/1.JEI.23.2.023012 JEIME5 1017-9909 Google Scholar
Y. Lecun, Y. Bengio and G. Hinton,
“Deep learning,”
Nature, 521
(7553), 436
–444
(2015). https://doi.org/10.1038/nature14539 Google Scholar
D. Silver et al.,
“Mastering the game of Go without human knowledge,”
Nature, 550
(7676), 354
–359
(2017). https://doi.org/10.1038/nature24270 Google Scholar
S. Lawrence et al.,
“Face recognition: a convolutional neural-network approach,”
IEEE Trans. Neural Networks, 8
(1), 98
–113
(1997). https://doi.org/10.1109/72.554195 ITNNEP 1045-9227 Google Scholar
N. Kalchbrenner, E. Grefenstette and P. Blunsom,
“A convolutional neural network for modelling sentences,”
1
(2014). Google Scholar
C. Szegedy et al.,
“Going deeper with convolutions,”
in IEEE Comput. Vision and Pattern Recognit.,
(2015). https://doi.org/10.1109/CVPR.2015.7298594 Google Scholar
T. H. Chan et al.,
“PCANet: a simple deep learning baseline for image classification?,”
IEEE Trans. Image Process., 24
(12), 5017
–5032
(2015). https://doi.org/10.1109/TIP.2015.2475625 IIPRE4 1057-7149 Google Scholar
X. J. Wu et al.,
“A new algorithm for generalized optimal discriminant vectors,”
J. Comput. Sci. Technol., 17
(3), 324
–330
(2002). https://doi.org/10.1007/BF02947310 JCTEEM 1000-9000 Google Scholar
L. Xuelong, P. Yanwei and Y. Yuan,
“L1-norm-based 2DPCA,”
IEEE Trans. Syst. Man Cybern. Part B Cybern., 40
(4), 1170
–1175
(2010). https://doi.org/10.1109/TSMCB.2009.2035629 Google Scholar
J. Yang et al.,
“Two-dimensional PCA: a new approach to appearance-based face representation and recognition,”
IEEE Trans. Pattern Anal. Mach. Intell., 26
(1), 131
–137
(2004). https://doi.org/10.1109/TPAMI.2004.1261097 ITPIDJ 0162-8828 Google Scholar
D. Yu and X. J. Wu,
“2DPCANet: a deep leaning network for face recognition,”
Multimedia Tools Appl., 77
(10), 12919
–12934
(2018). Google Scholar
M. Hirose et al.,
“Principal component analysis for surface reflection components and structure in the facial image and synthesis of the facial image in various ages,”
Proc. SPIE, 9398 939809
(2015). https://doi.org/10.1117/12.2076694 PSISDG 0277-786X Google Scholar
Z. Jia, B. Han and X. Gao,
“2DPCANet: dayside aurora classification based on deep learning,”
in CCF Chin. Conf. Comput. Vision.,
323
–334
(2015). Google Scholar
Q. R. Zhang,
“Two-dimensional parameter principal component analysis for face recognition,”
Adv. Mater. Res., 971–973 1838
–1842
(2014). https://doi.org/10.4028/www.scientific.net/AMR.971-973 ADMRBX 0568-0018 Google Scholar
L. Tian, C. Fan and Y. Ming,
“Multiple scales combined principle component analysis deep learning network for face recognition,”
J. Electron. Imaging, 25
(2), 023025
(2016). https://doi.org/10.1117/1.JEI.25.2.023025 JEIME5 1017-9909 Google Scholar
C. Ding,
“R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization,”
in Int. Conf. Mach. Learn.,
(2006). Google Scholar
A. Baccini, P. Besse, A. D. Falguerolles,
“A l1-norm PCA and a heuristic approach,”
Ordinal and Symbolic Data Analysis, 359
–368 Springer, New York
(1996). Google Scholar
N. Kwak,
“Principal component analysis based on L1-norm maximization,”
IEEE Trans. Pattern Anal. Mach. Intell., 30
(9), 1672
–1680
(2008). https://doi.org/10.1109/TPAMI.2008.114 ITPIDJ 0162-8828 Google Scholar
X. Li, Y. Pang and Y. Yuan,
“L1-norm-based 2DPCA,”
IEEE Trans. Syst. Man Cybern. Part B, 40
(4), 1170
–1175
(2010). https://doi.org/10.1109/TSMCB.2009.2035629 Google Scholar
A. M. Martinez,
“The AR face database,”
24
(1998). Google Scholar
A. S. Geo et al.,
“From few to many: illumination cone models for face recognition under variable lighting and pose,”
IEEE Trans. Pattern Anal. Mach. Intell., 23
(6), 643
–660
(2001). https://doi.org/10.1109/34.927464 ITPIDJ 0162-8828 Google Scholar
P. Zhu et al.,
“Multi-scale patch based collaborative representation for face recognition with margin distribution optimization,”
in Eur. Conf. Comput. Vision,
822
–835
(2012). Google Scholar
P. J. Phillips et al.,
“The FERET September 1996 database and evaluation procedure,”
Lect. Notes Comput. Sci., 1206 395
–402
(1997). https://doi.org/10.1007/BFb0015972 LNCSD9 0302-9743 Google Scholar
C. A. Burges,
“Tutorial on support vector machines for pattern recognition,”
Data Mining Knowl. Discovery, 2 121
–167
(1998). https://doi.org/10.1023/A:1009715923555 Google Scholar
A. Krizhevsky, I. Sutskever and G. Hinton,
“ImageNet classification with deep convolutional neural networks,”
Adv. Neural Inf. Process. Syst., 25
(2),
(2012). Google Scholar
BiographyYun-Kun Li received his BS degree in microelectronics from the School of Internet of Things Engineering, Jiangnan University, in 2017. He is currently a postgraduate in the Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Jiangnan University. His research interests include pattern recognition and deep learning. Xiao-Jun Wu received his BS degree in mathematics from Nanjing Normal University, Nanjing, in 1991, and his MS and PhD degrees in pattern recognition and intelligent system from Nanjing University of Science and Technology, Nanjing, in 1996 and 2002, respectively. He has published more than 200 papers in his fields of research. His current research interests include pattern recognition, computer vision, and computational intelligence. Josef Kittler received his BA, PhD, and DSc degrees from the University of Cambridge, in 1971, 1974, and 1991, respectively. He is currently a professor of machine intelligence with the Centre for Vision, Speech, and Signal Processing, University of Surrey, Guildford, United Kingdom. He has authored the textbook “Pattern Recognition: A Statistical Approach” and over 600 scientific papers. His current research interests include biometrics, video and image database retrieval, medical image analysis, and cognitive vision. |