Graph Regularized Semi-Supervised Concept Factorization

Concept Factorization (CF) is a new matrix decomposition technique for data representation. A modified CF algorithm called Graph Regularized Semi-supervised Concept Factorization (GRSCF) is proposed for addressing the limitations of CF and Local Consistent Concept Factorization (LCCF), which did not consider the geometric structure or the label information of the data. GRSCF preserves the intrinsic geometry of data as regularized term and use the label information as semi-supervised learning, it makes nearby samples with the same class-label are more compact, and nearby classes are separated. Compared with Non-Negative Matrix Factorization (NMF), CNMF, CF and LCCF, experiment results on ORL face database and Coil20 image database have shown that the proposed method achieves better clustering results.


Introduction
Machine learning, pattern recognition and data mining constitute areas of great development during the past decade.One of the primary goals of many data mining and machine learning systems is dimensionality reduction.The goal of dimensionality reduction is to reduce the number of features of data in order to perform tasks like clustering and/or training a classifier.Matrix factorization techniques such as eigendecomposition, Singular Value Decomposition (SVD) have been widely applied in dimensionality reduction algorithms, including Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA) and graph embedding.
Recently, Nonnegative Matrix Factorization (NMF, [1]) has been proposed to incorporate the non-negativity constraints and obtain a parts-based representation.It has been widely applied to image processing and pattern recognition applications, because the learned bases can be interpreted as a natural parts-based representation of data.In particular, it represents data as a linear combination of a set of basis vectors, in which both the combination coefficients and the basis vectors are nonnegative.However, one drawback of NMF is that it can only be performed in the original feature space of the data points, and thus cannot make use of the power of kernelization.To overcome the limitations of NMF while inheriting all its strengths, Xu and Gong developed Concept Factorization (CF) for image clustering [2].In CF, each basis vector (named concept) is modeled as a linear combination of the data points, and each data point as a linear combination of the basis vectors.With this model, CF is accomplished by computing the two sets of linear coefficients, which is carried out by finding the non-negative solution that minimizes the reconstruction error of the data points.After some mathematical transformations, it can be shown that the computation of CF only involves the kernel matrix of the data.The capability of using any kernel function for defining the kernel matrix, dramatically increases the power of CF.
Motivated by recent progress in semi-supervised and manifold learning [3,4,5,6,7,8], in this paper we proposed a novel algorithm, called Graphed Regularized Semi-supervised Concept Factorization (GRSCF), which explicitly considers the local geometrical information of the data and takes the label information of two images from each category as additional hard constraints.The central idea of our approach is that the data points from the same class should be merged together in the new representation space, and we want to find a parts-based representation space in which two data points are sufficiently close to each other, if they are connected in the nearest neighbor graph.In this way, GRSCF learns a concept factorization of the data matrix which has more discriminating power.The objective function of GRSCF is not convex in its variables jointly, but convex in them individually.This fact suggests an optimization scheme that minimizes the objective function alternatively, each time optimizing one variable while keeping the others fixed.Thus, we can find a local optimal solution for GRSCF through alternative convex programming.We present some initial results of the proposed method using ORL and Coil20 images databases.

Non-negative Matrix Factorization (NMF)
Suppose that the data matrix 1 2 [ , , , ] consists of m-dimensional nonnegative data vectors, each of which is a sample vector.NMF aims to find two nonnegative matrices which minimize the following objective function: where denotes the matrix Frobenius norm.The multiplicative updating rules minimizing the above objective function as follows: ) It has been proved that the above update rules will find a local minimum of the objective function.

Graph Regularized Semi-supervised Concept Factorization (GRSCF)
Concept Factorization (CF).Concept Factorization (CF) is developed on the basis of NMF, it models basis vector matrix U of NMF as a linear combinations of the data points, that isU XW = ,then the data matrix is factorized as X XWH ≈ ,the optimization problem of CF is given by The iterative update algorithm as follows: ) where T K X X = , which computes the inner product in the original data space.

Locally Consistent Concept Factorization( ( ( (LCCF) ) )
). . . .CF fails to discover the local geometrical of the data space, which is essential to the clustering problem.Recent studies on manifold learning theory have demonstrated that the local geometric structure can be efficiently modeled through a nearest neighbor graph on a scatter of data points.Let { , } G X S = be an undirected weighted graph and X denotes the vector set, n n S R × ∈ is the similarity matrix, which is symmetric matrix and its element measures the similarity between a pair of vertices.Define the edge weight matrix S as follows:

584
Information Technology for Manufacturing Systems III , ( ) ( ), 0, , where ( ) N x denotes the set of nearest neighbors of i x .Then, the following term can be used to measure the smoothness of the low-dimensional representations.
where ( ) Tr ⋅ denotes the trace of a matrix, D is a diagonal matrix, and = − , which is called graph Laplacian.
The objective function of LCCF as follows: 2 ( ) where the 0 λ ≥ is the regularization parameter.
We can get updating rules as follows: )

Graph Regularization Semi-supervised Concept Factorization( ( ( (GRSCF) ) )
).As NMF, CF and LCCF are unsupervised learning algorithm.There always have a small amount of labeled data in real-world, however, many researchers have found it can produce considerable improvement in learning accuracy [8].
Consider the data sets consisting of n data points , 1, 2, , We hope the data points from the same class can be merged together in the new representation space, thus the obtained low-dimensional representation has the same label with the original data, and therefore can have more discriminating power.Similar to CNMF, we consider label information in LCCF, and get the following objective function of GRSCF: 2 ( ) We rewrite the objective function O : Advanced Engineering Forum Vols.6-7 585 The partial derivatives of L with respect toW and H are 2 2 Using the KKT conditions 0 ik ik w φ = and 0 The above equations lead to the following updating formulas ( ) ( ) When 0 λ = or C I = , it is easy to get CF and LCCF are special case of GRSCF.

Numerical Experiments
Cluster is an important application of NMF and its variants.To evaluate the clustering performance of the proposed algorithm, we compare GRSCF with NMF [1], CF [2], LCCF [6], CNMF [7] and k-means.Two metrics, purity [9] and normalized mutual information (NMI) [2] are adopted to measure the clustering performance.All these algorithms were implemented in Matlab 7.10 and all experiments were run on Inter core2 Quad 2.2 GHz processor with 2 GB memory under Windows XP.We set the maximum iterations number of these algorithms as 1000 and keep it constant in all the following experiments.
Our empirical studies on clustering were accomplished on two images database: ORL and Coil20.Preprocessing to locate the faces was applied.Original images were normalized, aligned and cropped image is 32 × 32 pixels.Fig. 1 shows some sample images.
For each data set, the evaluations are conducted with different numbers of clusters varying from two to ten.For the semi-supervised algorithm of GRSCF and CNMF, we randomly picked up two images from each category and use their label information as the semi-supervised information.We set the dimensionality of the new space to be the same as the number of clustering.The input matrix X for all methods has the same arrangement: each column denotes an instance.The experiments are performed 10 times and the average results are reported.The number of nearest neighbors p is set 9, and the regularized parameter λ is searched from the grid: { 5 0 10 , ,10 ,

Conclusions
A novel method Graph Regularization Semi-supervised Concept Factorization (GRSCF) to incorporate local geometrical structure of the data and the label information into concept factorization is presented.Specifically, GRSCF models the data space as a submanifold embedded in the ambient space and projects the two samples with same label of each class to one point in this space as semi-supervised learning.As a result, it makes nearby samples with the same class-label are more compact, and nearby classes are separated, the new representations of the data points can have more discriminating power.Compared with NMF, CNMF, CF, LCCF and K-means, the experimental results on two standard image databases have demonstrated the effectiveness of our approach.
matrix A , where 1 ij a = if j x is labeled with the i -th class; 0 ij a = , otherwise.A label information matrix C as follows: 0

− 5 ,
10 }, and 10 λ = is used for LCCF, CNMF and GRSCF, Fig.2and Fig.3shows the average clustering performance of these two databases.It can be seen from Fig.2and Fig.3, our proposed GRSCF algorithms consistently outperform all the other algorithms.