You are here

Session Details

Clustering 1

Tuesday, 11 December
10:00 – 12:00
Room: Salle des nations I
Session Chair: Mohammad Hasan

10:00 Efficient Kernel Clustering Using Random Fourier Features short_paper) echo(" (Short)");?>
Radha Chitta, Rong Jin, and Anil K. Jain
DM275

Kernel clustering algorithms have the ability to capture the non-linear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them non-scalable to large data sets. In this paper, we employ random Fourier maps, originally proposed for large scale classification, to accelerate kernel clustering. The key idea behind the use of random Fourier maps for clustering is to project the data into a low-dimensional space where the inner product of the transformed data points approximates the kernel similarity between them. An efficient linear clustering algorithm can then be applied to the points in the transformed space. We also propose an improved scheme which uses the top singular vectors of the transformed data matrix to perform clustering, and yields a better approximation of kernel clustering under appropriate conditions. Our empirical studies demonstrate that the proposed schemes can be efficiently applied to large data sets containing millions of data points, while achieving accuracy similar to that achieved by state-of-the-art kernel clustering algorithms.

10:20 Kernel-based Weighted Multi-view Clustering short_paper) echo(" (Short)");?>
Grigorios Tzortzis and Aristidis Likas
DM359

Exploiting multiple representations, or views, for the same set of instances within a clustering framework is a popular practice for boosting clustering accuracy. However, some of the available sources may be misleading (due to noise, errors in measurement etc.) in revealing the true structure of the data, thus, their inclusion in the clustering process may have negative influence. This aspect seems to be overlooked in the multi-view literature where all representations are equally considered. In this work, views are expressed in terms of given kernel matrices and a weighted combination of the kernels is learned in parallel to the partitioning. Weights assigned to kernels are indicative of the quality of the corresponding views' information. Additionally, the combination scheme incorporates a parameter that controls the admissible sparsity of the weights to avoid extremes and tailor them to the data. Two efficient iterative algorithms are proposed that alternate between updating the view weights and recomputing the clusters to optimize the intra-cluster variance from different perspectives. The conducted experiments reveal the effectiveness of our methodology compared to other multi-view methods.

10:40 A General and Scalable Approach to Mixed Membership Clustering short_paper) echo(" (Short)");?>
Frank Lin and William W. Cohen
DM481

Spectral clustering methods are elegant and effective graph-based node clustering methods, but they do not allow mixed membership clustering. We describe an approach that first transforms the data from a node-centric representation to an edge-centric one, and then use this representation to define a scalable and competitive mixed membership alternative to spectral clustering methods. Experimental results show the proposed approach improves substantially in mixed membership clustering tasks over node clustering methods.

11:00 Robust Nonnegative Matrix Factorization via Half-Quadratic Minimization short_paper) echo(" (Short)");?>
Liang Du, Xuan Li, and Yi-Dong Shen
DM272

Nonnegative matrix factorization (NMF) is a popular technique for learning parts-based representation and data clustering. It usually uses the squared residuals to quantify the quality of factorization, which is optimal specifically to zero-mean, Gaussian noise and sensitive to outliers in general cases. In this paper, we propose a robust NMF method based on the correntropy induced metric, which is much more insensitive to outliers. A half-quadratic optimization algorithm is developed to solve the proposed problem efficiently. The proposed method is further extended to handle outlier rows by incorporating structural knowledge about the outliers. Experimental results on data sets with and without apparent outliers demonstrate the effectiveness of the proposed algorithms.

11:20 An Ellipsoidal K-means for document clustering short_paper) echo(" (Short)");?>
Faabom Dzogang, Christophe Marsala, Marie-Jeanne Lesot, and Maria Rifqi
DM915

We propose an extension of the spherical K-means algorithm to deal with settings where the number of data points is largely inferior to the number of dimensions. We assume the data to lie in local and dense regions of the original space and we propose to embed each cluster into its specific ellipsoid. A new objective function is introduced, analytical solutions are derived for both the centroids and the associated ellipsoids. Furthermore, a study on the complexity of this algorithm highlights that it is of same order as the regular K-means algorithm. Results on both synthetic and real data show the efficiency of the proposed method.

11:40 Self-adjusting Models for Semi-supervised Learning in Partially-observed Settings short_paper) echo(" (Short)");?>
Ferit Akova, Murat Dundar, Yuan Qi, and Bartek Rajwa
DM931

We present a new direction for semi-supervised learning where self-adjusting generative models replace fixed ones and unlabeled data can potentially improve learning even when labeled data is only partially-observed. We model each class data by a mixture model and use a hierarchical Dirichlet process (HDP) to model observed as well as unobserved classes. We extend the standard HDP model to accommodate unlabeled samples and introduce a new sharing strategy, within the context of Gaussian mixture models, that restricts sharing with covariance matrices while leaving the mean vectors free. Our research is mainly driven by real-world applications with evolving data-generating mechanisms where obtaining a fully-observed labeled data set is impractical. We demonstrate the feasibility of the proposed approach for semi-supervised learning in two such applications.