You are here

Session Details

Clustering 3

Thursday, 13 December
10:00 – 12:00
Room: Alto & Mezzo & Tempo
Session Chair: Emmanuel Müller

10:00 Robust Ensemble Clustering by Matrix Completion short_paper) echo(" (Short)");?>
Jinfeng Yi, Tianbao Yang, Rong Jin, Anil K. Jain, and Mehrdad Mahdavi

Data clustering is an important task and has found applications in numerous real-world problems. Since no single clustering algorithm is able to identify all different types of cluster shapes and structures, ensemble clustering was proposed to combine different partitions of the same data generated by multiple clustering algorithms. The key idea of most ensemble clustering algorithms is to find a partition that is consistent with most of the available partitions of the input data. One problem with these algorithms is their inability to handle uncertain data pairs, i.e. data pairs for which about half of the partitions put them into the same cluster and the other half do the opposite. When the number of uncertain data pairs is large, they can mislead the ensemble clustering algorithm in generating the final partition. To overcome this limitation, we propose an ensemble clustering approach based on the technique of matrix completion. The proposed algorithm constructs a partially observed similarity matrix based on the data pairs whose cluster memberships are agreed upon by most of the clustering algorithms in the ensemble. It then deploys the matrix completion algorithm to complete the similarity matrix. The final data partition is computed by applying an efficient spectral clustering algorithm to the completed matrix. Our empirical studies with multiple real-world datasets show that the proposed algorithm performs significantly better than the state-of-the-art algorithms for ensemble clustering.

10:12 Sparse Subspace Representation for Spectral Document Clustering short_paper) echo(" (Short)");?>
Budhaditya Saha, Dinh Phung, Duc Son Pham, and Svetha Venkatesh

We present a novel method for document clustering using sparse representation of documents in conjunction with spectral clustering. An l1 - norm optimization formulation is posed to learn the sparse representation of each document, allowing us to characterize the affinity between documents by considering the overall information instead of traditional pair wise similarities. This document affinity is encoded through a graph on which spectral clustering is performed. The decomposition into multiple subspaces allows documents to be part of a sub-group that shares a smaller set of similar vocabulary, thus allowing for cleaner clusters. Extensive experimental evaluations on two real-world datasets from Reuters-21578 and 20Newsgroup corpora show that our proposed method consistently outperforms state-of-the-art algorithms. Significantly, the performance improvement over other methods is prominent for this datasets.

10:24 Labels vs. Pairwise Constraints: A Unified View of Label Propagation and Constrained Spectral Clustering short_paper) echo(" (Short)");?>
Xiang Wang, Buyue Qian, and Ian Davidson

In many real-world applications we can model the data as a graph with each node being an instance and the edges indicating a degree of similarity. Side information is often available in the form of labels for a small subset of instances, which gives rise to two problem settings and two types of algorithms. In the label propagation style algorithms, the known labels are propagated to the unlabeled nodes. In the constrained clustering style algorithms, known labels are first converted to pair wise constraints (Must-Link and Cannot-Link), then a constrained cut is computed as a tradeoff between minimizing the cut cost and maximizing the constraint satisfaction. Both techniques are evaluated by their ability to recover the ground truth labeling, i.e. by 0/1 loss function either directly on the labels or on the pair wise relations derived from the labels. These two fields have developed separately, but in this paper, we show that they are indeed related. This insight allows us to propose a novel way to generate constraints from the propagated labels, which our empirical study shows outperforms and is more stable than the state-of-the-art label propagation and constrained spectral clustering algorithms.

10:36 A Cluster-Oriented Genetic Algorithm for Alternative Clustering short_paper) echo(" (Short)");?>
Duy Tin Truong and Roberto Battiti

Supervised alternative clusterings is the problem of finding a set of clusterings which are of high quality and different from a given negative clustering. The task is therefore a clear multi-objective optimization problem. Optimizing two conflicting objectives requires dealing with trade-offs. Most approaches in the literature optimize these objectives sequentially or indirectly, resulting in solutions which are dominated. We develop a multi-objective algorithm, called COGNAC, able to optimize the objectives directly and simultaneously and producing solutions approximating the Pareto front. COGNAC performs the recombination operator at the cluster level instead of the object level as in traditional genetic algorithms. It can accept arbitrary clustering quality and dissimilarity objectives and provide solutions dominating those of other state-of-the-art algorithms. COGNAC can also be used to generate a sequence of alternative clusterings, each of which is guaranteed to be different from all previous ones.

10:48 Low Dimensional Localized Clustering (LDLC) short_paper) echo(" (Short)");?>
Pooyan Khajehpour Tadavani and Ali Ghodsi

In the space of high-dimensional data, it is generally reasonable to assume that the data points are on (or close to) one or more submanifolds. Each of these submanifolds can be modeled by a number of linear subspaces. This is in fact the main intuition behind a majority of subspace clustering algorithms. In many cases, however, the subspaces computed by these algorithms consist of disconnected subsets of the underlying submanifolds and therefore, do not form localized and compact clusters. To address this problem, we propose "Low Dimensional Localized Clustering (LDLC)", a new method for subspace clustering. Unlike existing methods, LDLC respects the topology of the underling submanifolds and assigns the data points to localized clusters such that the total reconstruction error is minimized. This is a valuable property in many tasks, such as semi-supervised classification, data visualization and local dimensionality reduction. We establish connections between LDLC, K-Means, and VQPCA from different perspectives, and validate our method through various experiments on synthetic and real data sets.

11:00 Exclusive Row Biclustering for Gene Expression Using a Combintorial Auction Approach short_paper) echo(" (Short)");?>
Amichai Painsky and Saharon Rosset

The availability of large microarray data has led to a growing interest in biclustering methods in the past decade. Several algorithms have been proposed to identify subsets of genes and conditions according to different similarity measures and under varying constraints. In this paper we focus on the exclusive row biclusteing problem (also known as projected clustering) for gene expression data sets, in which each row can only be a member of a single bicluster while columns can participate in multiple clusters. This type of biclustering may be adequate, for example, for clustering groups of cancer patients where each patient (row) is expected to be carrying only a single type of cancer, while each cancer type is associated with multiple (and possibly overlapping) genes (columns). In this paper we present a novel method to identify these exclusive row biclusters through a combination of existing biclustering algorithms and combinatorial auction techniques. We devise an approach for tuning the threshold for our algorithm based on comparison to a null model in the spirit of the Gap statistic approach. We demonstrate our approach on both synthetic and real-world gene expression data and show its power in identifying large span non-overlapping rows sub matrices, while considering their unique nature. The Gap statistic approach succeeds in identifying appropriate thresholds in all our examples.

11:12 Co-clustering of Multi-View Datasets: a Parallelizable Approach short_paper) echo(" (Short)");?>
Bisson Gilles and Clément Grimal

In many applications, entities of the domain are described through different views that clustering methods often process one by one. We introduce here the architecture MVSim, that is able to deal simultaneously with all the information contained in such multi-view datasets by using several instances of a co-similarity algorithm. We show that this architecture provides better results than both single-view and multi-view approaches and that it can be easily parallelized thus reducing both time and space complexities of the computations.

11:24 Clustering by Learning Constraints Priorities short_paper) echo(" (Short)");?>
Masayuki Okabe and Seiji Yamada

A method for creating a constrained clustering ensemble by learning the priorities of pair wise constraints is proposed in this paper. This method integrates multiple clusters produced by using a simple constrained K-means algorithm that we modify to utilize the constraints priorities. The cluster ensemble is executed according to a boosting framework, which adaptively learns the constraints priorities and provides them for the modified constrained K-means to create diverse clusters that finally improve the clustering performance. The experimental results show that our proposed method outperforms the original constrained K-means and is comparable to several state-of-the-art constrained clustering methods.

11:36 Deep Learning to Hash with Multiple Representations short_paper) echo(" (Short)");?>
Yoonseop Kang, Saehoon Kim, and Seungjin Choi

Hashing seeks an embedding of high-dimensional objects into a similarity-preserving low-dimensional Hamming space such that similar objects are indexed by binary codes with small Hamming distances. A variety of hashing methods have been developed, but most of them resort to a single view (representation) of data. However, objects are often described by multiple representations. For instance, images are described by a few different visual descriptors (such as SIFT, GIST, and HOG), so it is desirable to incorporate multiple representations into hashing, leading to multi-view hashing. In this paper we present a deep network for multi-view hashing, referred to as deep multi-view hashing, where each layer of hidden nodes is composed of view-specific and shared hidden nodes, in order to learn individual and shared hidden spaces from multiple views of data. Numerical experiments on image datasets demonstrate the useful behavior of our deep multi-view hashing (DMVH), compared to recently-proposed multi-modal deep network as well as existing shallow models of hashing.

11:48 A Similarity Model and Segmentation Algorithm for White Matter Fiber Tracks short_paper) echo(" (Short)");?>
Son Mai, Sebastian Goebl, and Claudia Plant

Recently, fiber segmentation has become an emerging technique in neuroscience. Grouping fiber tracts into anatomical meaningful bundles allows to study the structure of the brain and to investigate onset and progression of neurodegenerative and mental diseases. In this paper, we propose a novel technique for fiber tracts based on shape similarity and connection similarity. For shape similarity, we propose some new techniques adapted from existing similarity measures for trajectory data. We also propose a new technique called Warped Longest Common Subsequence (WLCS) for which we additionally developed a lower-bounding distance function to speed up the segmentation process. Our segmentation is based on an outlier-robust density-based clustering algorithm. Extensive experiments on diffusion tensor images demonstrate the efficiency and effectiveness of our technique.