You are here

Session Details

Classification 1

Tuesday, 11 December
10:00 – 12:00
Room: Rembrandt & Permeke
Session Chair: Chengqi Zhang

10:00 Hierarchical Multilabel Classification with Minimum Bayes Risk short_paper) echo(" (Short)");?>
Wei Bi and James T. Kwok
DM220

Hierarchical multilabel classification (HMC) allows an instance to have multiple labels residing in a hierarchy. A popular loss function used in HMC is the H-loss, which penalizes only the first classification mistake along each prediction path. However, the H-loss metric can only be used on tree-structured label hierarchies, but not on DAG hierarchies. Moreover, it may lead to misleading predictions as not all misclassifications in the hierarchy are penalized. In this paper, we overcome these deficiencies by proposing a hierarchy-aware loss function that is more appropriate for HMC. Using Bayesian decision theory, we then develop a Bayes-optimal classifier with respect to this loss function. Instead of requiring an exhaustive summation and search for the optimal multilabel, the proposed classification problem can be efficiently solved using a greedy algorithm on both tree-and DAG-structured label hierarchies. Experimental results on a large number of real-world data sets show that the proposed algorithm outperforms existing HMC methods.

10:20 Multi-Task Semi-Supervised Semantic Feature Learning for Classification short_paper) echo(" (Short)");?>
Changying Du, Fuzhen Zhuang, Qing He, and Zhongzhi Shi
DM223

Multi-task learning has proven to be useful to boost the learning of multiple related but different tasks. Meanwhile, latent semantic models such as LSA and LDA are popular and effective methods to extract discriminative semantic features of high dimensional dyadic data. In this paper, we present a method to combine these two techniques together by introducing a new matrix tri-factorization based formulation for semi-supervised latent semantic learning, which can incorporate labeled information into traditional unsupervised learning of latent semantics. Our inspiration for multi-task semantic feature learning comes from two facts, i.e., 1) multiple tasks generally share a set of common latent semantics, and 2) a semantic usually has a stable indication of categories no matter which task it is from. Thus to make multiple tasks learn from each other we wish to share the associations between categories and those common semantics among tasks. Along this line, we propose a novel joint Nonnegative matrix tri-factorization framework with the aforesaid associations shared among tasks in the form of a semantic-category relation matrix. Our new formulation for multi-task learning can simultaneously learn (1) discriminative semantic features of each task, (2) predictive structure and categories of unlabeled data in each task, (3) common semantics shared among tasks and specific semantics exclusive to each task. We give alternating iterative algorithm to optimize our objective and theoretically show its convergence. Finally extensive experiments on text data along with the comparison with various baselines and three state-of-the-art multi-task learning algorithms demonstrate the effectiveness of our method.

10:40 Handling Ambiguity via Input-Output Kernel Learning short_paper) echo(" (Short)");?>
Xinxing Xu, Ivor W. Tsang, and Dong Xu
DM486

Data ambiguities exist in many data mining and machine learning applications such as text categorization and image retrieval. For instance, it is generally beneficial to utilize the ambiguous unlabeled documents to learn a more robust classifier for text categorization under the semi-supervised learning setting. To handle general data ambiguities, we present a unified kernel learning framework named Input-Output Kernel Learning (IOKL). Based on our framework, we further propose a novel soft margin group sparse Multiple Kernel Learning (MKL) formulation by introducing a group kernel slack variable to each group of base input-output kernels. Moreover, an efficient block-wise coordinate descent algorithm with an analytical solution for the kernel combination coefficients is developed to solve the proposed formulation. We conduct comprehensive experiments on benchmark datasets for both semi-supervised learning and multiple instance learning tasks, and also apply our IOKL framework to a computer vision application called text-based image retrieval on the NUS-WIDE dataset. Promising results demonstrate the effectiveness of our proposed IOKL framework.

11:00 ConfDTree: Improving Decision Trees Using Confidence Intervals short_paper) echo(" (Short)");?>
Gilad Katz, Asaf Shabtai, Lior Rokach, and Nir Ofek
DM403

Decision trees have three main disadvantages: reduced performance when the training set is small, rigid decision criteria and the fact that a single "uncharacteristic" attribute might "derail" the classification process. In this paper we present ConfDTree - a post-processing method which enables decision trees to better classify outlier instances. This method, which can be applied on any decision trees algorithm, uses confidence intervals in order to identify these hard-to-classify instances and proposes alternative routes. The experimental study indicates that the proposed post-processing method consistently and significantly improves the predictive performance of decision trees, particularly for small, imbalanced or multi-class datasets in which an average improvement of 5%-9% in the AUC performance is reported.

11:20 Unsupervised Multi-Class Regularized Least-Squares Classification short_paper) echo(" (Short)");?>
Tapio Pahikkala, Antti Airola, Fabian Gieseke, and Oliver Kramer
DM597

Regularized least-squares classification is one of the most promising alternatives to standard support vector machines, with the desirable property of closed-form solutions that can be obtained analytically, and efficiently. While the supervised, and mostly binary case has received tremendous attention in recent years, unsupervised multi-class settings have not yet been considered. In this work we present an efficient implementation for the unsupervised extension of the multi-class regularized least-squares classification framework, which is, to the best of the authors' knowledge, the first one in the literature addressing this task. The resulting kernel-based framework efficiently combines steepest descent strategies with powerful meta-heuristics for avoiding local minima. The computational efficiency of the overall approach is ensured through the application of matrix algebra shortcuts that render efficient updates of the intermediate candidate solutions possible. Our experimental evaluation indicates the potential of the novel method, and demonstrates its superior clustering performance over a variety of competing methods on real-world data sets.

11:40 A Novel Semantic Smoothing Method based on Higher Order Paths for Text Classification short_paper) echo(" (Short)");?>
Mithat Poyraz, Zeynep Hilal Urhan, and Murat Can Ganiz
DM686

It has been shown that Latent Semantic Indexing (LSI) takes advantage of implicit higher-order (or latent) structure in the association of terms and documents. Higher order relations in LSI capture "latent semantics". Inspired by this, a novel Bayesian framework for classification named Higher Order Naïve Bayes (HONB), which can explicitly make use of these higher-order relations, has been introduced previously. We present a novel semantic smoothing method named Higher Order Smoothing (HOS) for the Naive Bayes algorithm. HOS is built on a similar graph based data representation of HONB which allows semantics in higher order paths to be exploited. Additionally, we take the concept one step further in HOS and exploited the relationships between instances of different classes in order to improve the parameter estimation when dealing with insufficient labeled data. As a result, we have not only been able to move beyond instance boundaries, but also class boundaries to exploit the latent information in higher-order paths. The results of our extensive experiments demonstrate the value of HOS on several benchmark datasets.

Clustering 1

Tuesday, 11 December
10:00 – 12:00
Room: Salle des nations I
Session Chair: Mohammad Hasan

10:00 Efficient Kernel Clustering Using Random Fourier Features short_paper) echo(" (Short)");?>
Radha Chitta, Rong Jin, and Anil K. Jain
DM275

Kernel clustering algorithms have the ability to capture the non-linear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them non-scalable to large data sets. In this paper, we employ random Fourier maps, originally proposed for large scale classification, to accelerate kernel clustering. The key idea behind the use of random Fourier maps for clustering is to project the data into a low-dimensional space where the inner product of the transformed data points approximates the kernel similarity between them. An efficient linear clustering algorithm can then be applied to the points in the transformed space. We also propose an improved scheme which uses the top singular vectors of the transformed data matrix to perform clustering, and yields a better approximation of kernel clustering under appropriate conditions. Our empirical studies demonstrate that the proposed schemes can be efficiently applied to large data sets containing millions of data points, while achieving accuracy similar to that achieved by state-of-the-art kernel clustering algorithms.

10:20 Kernel-based Weighted Multi-view Clustering short_paper) echo(" (Short)");?>
Grigorios Tzortzis and Aristidis Likas
DM359

Exploiting multiple representations, or views, for the same set of instances within a clustering framework is a popular practice for boosting clustering accuracy. However, some of the available sources may be misleading (due to noise, errors in measurement etc.) in revealing the true structure of the data, thus, their inclusion in the clustering process may have negative influence. This aspect seems to be overlooked in the multi-view literature where all representations are equally considered. In this work, views are expressed in terms of given kernel matrices and a weighted combination of the kernels is learned in parallel to the partitioning. Weights assigned to kernels are indicative of the quality of the corresponding views' information. Additionally, the combination scheme incorporates a parameter that controls the admissible sparsity of the weights to avoid extremes and tailor them to the data. Two efficient iterative algorithms are proposed that alternate between updating the view weights and recomputing the clusters to optimize the intra-cluster variance from different perspectives. The conducted experiments reveal the effectiveness of our methodology compared to other multi-view methods.

10:40 A General and Scalable Approach to Mixed Membership Clustering short_paper) echo(" (Short)");?>
Frank Lin and William W. Cohen
DM481

Spectral clustering methods are elegant and effective graph-based node clustering methods, but they do not allow mixed membership clustering. We describe an approach that first transforms the data from a node-centric representation to an edge-centric one, and then use this representation to define a scalable and competitive mixed membership alternative to spectral clustering methods. Experimental results show the proposed approach improves substantially in mixed membership clustering tasks over node clustering methods.

11:00 Robust Nonnegative Matrix Factorization via Half-Quadratic Minimization short_paper) echo(" (Short)");?>
Liang Du, Xuan Li, and Yi-Dong Shen
DM272

Nonnegative matrix factorization (NMF) is a popular technique for learning parts-based representation and data clustering. It usually uses the squared residuals to quantify the quality of factorization, which is optimal specifically to zero-mean, Gaussian noise and sensitive to outliers in general cases. In this paper, we propose a robust NMF method based on the correntropy induced metric, which is much more insensitive to outliers. A half-quadratic optimization algorithm is developed to solve the proposed problem efficiently. The proposed method is further extended to handle outlier rows by incorporating structural knowledge about the outliers. Experimental results on data sets with and without apparent outliers demonstrate the effectiveness of the proposed algorithms.

11:20 An Ellipsoidal K-means for document clustering short_paper) echo(" (Short)");?>
Faabom Dzogang, Christophe Marsala, Marie-Jeanne Lesot, and Maria Rifqi
DM915

We propose an extension of the spherical K-means algorithm to deal with settings where the number of data points is largely inferior to the number of dimensions. We assume the data to lie in local and dense regions of the original space and we propose to embed each cluster into its specific ellipsoid. A new objective function is introduced, analytical solutions are derived for both the centroids and the associated ellipsoids. Furthermore, a study on the complexity of this algorithm highlights that it is of same order as the regular K-means algorithm. Results on both synthetic and real data show the efficiency of the proposed method.

11:40 Self-adjusting Models for Semi-supervised Learning in Partially-observed Settings short_paper) echo(" (Short)");?>
Ferit Akova, Murat Dundar, Yuan Qi, and Bartek Rajwa
DM931

We present a new direction for semi-supervised learning where self-adjusting generative models replace fixed ones and unlabeled data can potentially improve learning even when labeled data is only partially-observed. We model each class data by a mixture model and use a hierarchical Dirichlet process (HDP) to model observed as well as unobserved classes. We extend the standard HDP model to accommodate unlabeled samples and introduce a new sharing strategy, within the context of Gaussian mixture models, that restricts sharing with covariance matrices while leaving the mean vectors free. Our research is mainly driven by real-world applications with evolving data-generating mechanisms where obtaining a fully-observed labeled data set is impractical. We demonstrate the feasibility of the proposed approach for semi-supervised learning in two such applications.

Social Networks 1

Tuesday, 11 December
10:00 – 12:00
Room: Salle des nations II
Session Chair: Ruoming Jin

10:00 Defining and Evaluating Network Communities based on Ground-truth short_paper) echo(" (Short)");?>
Jaewon Yang and Jure Leskovec
DM911

Nodes in real-world networks organize into densely linked communities where edges appear with high concentration among the members of the community. Identifying such communities of nodes has proven to be a challenging task mainly due to a plethora of definitions of a community, intractability of algorithms, issues with evaluation and the lack of a reliable gold-standard ground-truth. In this paper we study a set of 230 large real-world social, collaboration and information networks where nodes explicitly state their group memberships. For example, in social networks nodes explicitly join various interest based social groups. We use such groups to define a reliable and robust notion of ground-truth communities. We then propose a methodology which allows us to compare and quantitatively evaluate how different structural definitions of network communities correspond to ground-truth communities. We choose 13 commonly used structural definitions of network communities and examine their sensitivity, robustness and performance in identifying the ground-truth. We show that the 13 structural definitions are heavily correlated and naturally group into four classes. We find that two of these definitions, Conductance and Triad-participation-ratio, consistently give the best performance in identifying ground-truth communities. We also investigate a task of detecting communities given a single seed node. We extend the local spectral clustering algorithm into a heuristic parameter-free community detection method that easily scales to networks with more than hundred million nodes. The proposed method achieves 30% relative improvement over current local clustering methods.

10:20 Community Preserving Lossy Compression of Social Networks short_paper) echo(" (Short)");?>
Hossein Maserrat and Jian Pei
DM381

Compression plays an important role in social network analysis from both practical and theoretical points of view. Although there are a few pioneering studies on social network compression, they mainly focus on lossless approaches. In this paper, we tackle the novel problem of community preserving lossy compression of social networks. The trade-off between space and information preserved in a lossy compression presents an interesting angle for social network analysis, and, at the same time, makes the problem very challenging. We propose a sequence graph compression approach, discuss the design of objective functions towards community preservation, and present an interesting and practically effective greedy algorithm. Our experimental results on both real data sets and synthetic data sets demonstrate the promise of our method.

10:40 Spotting Culprits in Epidemics: How many and Which ones? short_paper) echo(" (Short)");?>
B. Aditya Prakash, Jilles Vreeken, and Christos Faloutsos
DM399

Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? In this paper we answer this question affirmatively, and give an efficient method called NETSLEUTH for the well-known Susceptible-Infected virus propagation model. Essentially, we are after that set of seed nodes that best explain the given snapshot. We propose to employ the Minimum Description Length principle to identify the best set of seed nodes and virus propagation ripple, as the one by which we can most succinctly describe the infected graph. We give an highly efficient algorithm to identify likely sets of seed nodes given a snapshot. Then, given these seed nodes, we show we can optimize the virus propagation ripple in a principled way by maximizing likelihood. With all three combined, NETSLEUTH can automatically identify the correct number of seed nodes, as well as which nodes are the culprits. Experimentation on our method shows high accuracy in the detection of seed nodes, in addition to the correct automatic identification of their number. Moreover, we show NETSLEUTH scales linearly in the number of nodes of the graph.

11:00 Detecting Anomalies in Bipartite Graphs with Mutual Dependency Principles short_paper) echo(" (Short)");?>
Hanbo DAI, Feida ZHU, Ee-Peng LIM, and HweeHwa Pang
DM545

Bipartite graphs can model many real life applications including users-rating-products in online marketplaces, users-clicking-webpages on the World Wide Web and users referring- users in social networks. In these graphs, the anomalousness of nodes in one partite often depends on that of their connected nodes in the other partite. Previous studies have shown that this dependency can be positive (the anomalousness of a node in one partite increases or decreases along with that of its connected nodes in the other partite) or negative (the anomalousness of a node in one partite rises or falls in opposite direction to that of its connected nodes in the other partite). In this paper, we unify both positive and negative mutual dependency relationships in an unsupervised framework for detecting anomalous nodes in bipartite graphs. This is the first work that integrates both mutual dependency principles to model the complete set of anomalous behaviors of nodes that cannot be identified by either principle alone. We formulate our principles and design an iterative algorithm to simultaneously compute the anomaly scores of nodes in both partites. Moreover, we mathematically prove that the ranking of nodes by anomaly scores in each partite converges. Our framework is examined on synthetic graphs and the results show that our model outperforms existing models with only positive or negative mutual dependency principles. We also apply our framework to two real life datasets: Goodreads as a users-rating-books setting and Buzzcity as a users-clicking advertisements setting. The results show that our method is able to detect suspected spamming users and spammed books in Goodreads and achieve higher precision in identifying fraudulent advertisement publishers than existing approaches.

11:20 RankTopic: Ranking Based Topic Modeling short_paper) echo(" (Short)");?>
Dongsheng Duan, Yuhua Li, Ruixuan Li, Rui Zhang, and Aiming Wen
DM864

Topic modeling has become a widely used tool for document management due to its superior performance. However, there are few topic models distinguishing the importance of documents on different topics. In this paper, we investigate how to utilize the importance of documents to improve topic modeling and propose to incorporate link based ranking into topic modeling. Specifically, topical pagerank is used to compute the topic level ranking of documents, which indicates the importance of documents on different topics. By retreating the topical ranking of a document as the probability of the document involved in corresponding topic, a generalized relation is built between ranking and topic modeling. Based on the relation, a ranking based topic model Rank Topic is proposed. With Rank Topic, a mutual enhancement framework is established between ranking and topic modeling. Extensive experiments on paper citation data and Twitter data are conducted to compare the performance of Rank Topic with that of some state-of-the-art topic models. Experimental results show that Rank Topic performs much better than some baseline models and is comparable with the state-of-the-art link combined relational topic model (RTM) in generalization performance, document clustering and classification by setting a proper balancing parameter. It is also demonstrated in both quantitative and qualitative ways that topics detected by Rank Topic are more interpretable than those detected by some baseline models and still competitive with RTM.

11:40 Automatically Discovering Talented Musicians with Acoustic Analysis of YouTube Videos short_paper) echo(" (Short)");?>
Eric Nichols, Charles DuHadway, Hrishikesh Aradhye, and Richard Lyon
DM856

Online video presents a great opportunity for up-and-coming singers and artists to be visible to a worldwide audience. However, the sheer quantity of video makes it difficult to discover promising musicians. We present a novel algorithm to automatically identify talented musicians using machine learning and acoustic analysis on a large set of "home singing" videos. We describe how candidate musician videos are identified and ranked by singing quality. To this end, we present new audio features specifically designed to directly capture singing quality. We evaluate these vis-a-vis a large set of generic audio features and demonstrate that the proposed features have good predictive performance. We also show that this algorithm performs well when videos are normalized for production quality.

Efficient Algorithms 1

Tuesday, 11 December
13:30 – 15:30
Room: Salle des nations I
Session Chair: Joost Kok

13:30 Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems short_paper) echo(" (Short)");?>
Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon
DM332

Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle web-scale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization. There has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to large-scale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other large-scale problems, but its application to matrix factorization for recommender systems has not been explored thoroughly. In this paper, we show that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updatesrank-one factors one by one. In addition, CCD++ can be easily parallelized on both multi-core and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, on a synthetic dataset with 2 billion ratings, CCD++ is 4 times faster than both SGD and ALS using a distributed system with 20 machines.

13:50 Tuple MapReduce: Beyond classic MapReduce short_paper) echo(" (Short)");?>
Pedro Ferrera, Ivan de Prado, Eric Palacios, Jose Luis Fernandez-Marquez, and Giovanna Di Marzo Serugendo
DM805

This paper proposes Tuple Map Reduce, a new foundational model extending Map Reduce with the notion of tuples. Tuple Map Reduce allows to bridge the gap between the low-level constructs provided by Map Reduce and higher-level needs required by programmers, such as compound records, sorting or joins. This paper presents as well Pangool, an open-source framework implementing Tuple Map Reduce. Pangool eases the design and implementation of applications based on Map Reduce and increases their flexibility, still maintaining Hadoop's performance.

14:10 Parallelization with Multiplicative Algorithms for Big Data Mining short_paper) echo(" (Short)");?>
Dijun Luo, Chris Ding, and Heng Huang
DM671

We propose a nontrivial strategy to parallelize a series of data mining and machine learning problems, including 1-class and 2-class support vector machines, nonnegative least square problems, and l1 regularized regression (LASSO) problems. Our strategy fortunately leads to extremely simple multiplicative algorithms which can be straightforwardly implemented in parallel computational environments, such as Map Reduce, or CUDA. We provide rigorous analysis of the correctness and convergence of the algorithm. We demonstrate the scalability and accuracy of our algorithms in comparison with other current leading algorithms.

14:30 Efficient Pattern-Based Time Series Classification on GPU short_paper) echo(" (Short)");?>
Kai-Wei Chang, Biplab Deka, Wen-Mei W. Hwu, and Dan Roth
DM699

Time series shapelet discovery algorithm finds subsequences from a set of time series for use as primitives for time series classification. This algorithm has drawn a lot of interest because of the interpretability of its results. However, computation requirements restrict the algorithm from dealing with large data sets and may limit its application in many domains. In this paper, we address this issue by redesigning the algorithm for implementation on highly parallel Graphics Process Units (GPUs). We investigate several concepts of GPU programming and propose a dynamic programming algorithm that is suitable for implementation on GPUs. Results show that the proposed GPU implementation significantly reduces the running time of the shapelet discovery algorithm. For example, on the largest sample dataset from the original authors, the running time is reduced from half a day to two minutes.

14:50 GPU-Accelerated Feature Selection for Outlier Detection using the Local Kernel Density Ratio short_paper) echo(" (Short)");?>
Fatemeh Azmandian, Ayse Yilmazer, Jennifer G. Dy, Javed A. Aslam, and David R. Kaeli
DM371

Effective outlier detection requires the data to be described by a set of features that captures the behavior of normal data while emphasizing those characteristics of outliers which make them different than normal data. In this work, we present a novel non-parametric evaluation criterion for filter-based feature selection which caters to outlier detection problems. The proposed method seeks the subset of features that represents the inherent characteristics of the normal dataset while forcing outliers to stand out, making them more easily distinguished by outlier detection algorithms. Experimental results on real datasets show the advantage of our feature selection algorithm compared to popular and state-of-the-art methods. We also show that the proposed algorithm is able to overcome the small sample space problem and perform well on highly imbalanced datasets. Furthermore, due to the highly parallelizable nature of the feature selection, we implement the algorithm on a graphics processing unit (GPU) to gain significant speedup over the serial version. The benefits of the GPU implementation are two-fold, as its performance scales very well in terms of the number of features, as well as the number of data points.

15:10 Scalable Training of Sparse Linear SVM short_paper) echo(" (Short)");?>
Guo-Xun Yuan and Kwan-Liu Ma
DM247

Sparse linear support vector machines have been widely applied to variable selection in many applications. For large data, managing the cost of training a sparse model with good predication performance is an essential topic. In this work, we propose a scalable training algorithm for large-scale data with millions of examples and features. We develop a dual alternating direction method for solving L1-regularized linear SVMs. The learning procedure simply involves quadratic programming in the same form as the standard SVM dual, followed by a soft-thresholding operation. The proposed training algorithm possesses two favorable properties. First, it is a decomposable algorithm by which a large problem can be reduced to small ones. Second, the sparsity of intermediate solutions is maintained throughout the training process. It naturally promotes the solution sparsity by soft-thresholding. We demonstrate that, by experiments, our method outperforms state-of-the-art approaches on large-scale benchmark data sets. We also show that it is well suited for training large sparse models on a distributed system.

Privacy and Security

Tuesday, 11 December
13:30 – 15:30
Room: Salle des nations II
Session Chair: Bettina Berendt

13:30 Reconstructing Graphs from Neighborhood Data short_paper) echo(" (Short)");?>
Dora Erdos, Rainer Gemulla, and Evimaria Terzi
DM387

Consider a social network and suppose that we are given the number of common friends between each pair of users. Can we reconstruct the underlying network? Similarly, consider a set of documents and the words that appear in them. If we know the number of common words for every pair of documents, as well as the number of common documents for every pair of words, can we infer which words appear in which documents? In this paper, we develop a general methodology for answering questions like the ones above. We formalize these questions in what we call the Reconstruct problem: Given information about the common neighbors of nodes in a network, our goal is to reconstruct the hidden binary matrix that indicates the presence or absence of relationships between individual nodes. We propose an effective and practical heuristic, which exploits properties of the singular value decomposition of the hidden binary matrix. More specifically, we show that using the available neighborhood information, we can reconstruct the hidden matrix by finding the components of its singular value decomposition and then combining them appropriately. Our extensive experimental study suggests that our methods are able to reconstruct binary matrices of different characteristics with up to 100% accuracy.

13:50 Differentially Private Histogram Publishing through Lossy Compression short_paper) echo(" (Short)");?>
Gergely Acs, Claude Castelluccia, and Rui Chen
DM928

Differential privacy has emerged as one of the most promising privacy models for private data release. It can be used to release different types of data, and, in particular, histograms, which provide useful summaries of a dataset. Several differentially private histogram releasing schemes have been proposed recently. However, most of them directly add noise to the histogram counts, resulting in undesirable accuracy. In this paper, we propose two sanitization techniques that exploit the inherent redundancy of real-life datasets in order to boost the accuracy of histograms. They lossily compress the data and sanitize the compressed data. Our first scheme is an optimization of the Fourier Perturbation Algorithm (FPA) presented in {RN10}. It improves the accuracy of the initial FPA by a factor of 10. The other scheme relies on clustering and exploits the redundancy between bins. Our extensive experimental evaluation over various real-life and synthetic datasets demonstrates that our techniques preserve very accurate distributions and considerably improve the accuracy of range queries over attributed histograms.

14:10 Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them) short_paper) echo(" (Short)");?>
Byron C. Wallace and Issa J. Dahabreh
DM578

Obtaining good probability estimates is imperative for many applications. The increased uncertainty and typically asymmetric costs surrounding rare events increases this need. Experts (and classification systems) often rely on probabilities to inform decisions. However, we demonstrate that class probability estimates attained via supervised learning in imbalanced scenarios systematically underestimate the probabilities for minority class instances, despite ostensibly good overall calibration. To our knowledge, this problem has not previously been explored. Motivated by our exposition of this issue, we propose a simple, effective and theoretically motivated method to mitigate the bias of probability estimates for imbalanced data that bags estimators calibrated over balanced bootstrap samples. This approach drastically improves performance on the minority instances without greatly affecting overall calibration. We show that additional uncertainty can be exploited via a Bayesian approach by considering posterior distributions over bagged probability estimates.

14:30 A General Framework for Publishing Privacy Protected and Utility Preserved Graph short_paper) echo(" (Short)");?>
Mingxuan Yuan, Lei Chen, Weixiong Rao, and Hong Mei
DM494

The privacy protection of graph data has become more and more important in recent years. Many works have been proposed to publish a privacy preserving graph. All these works prefer publishing a graph, which guarantees the protection of certain privacy with the smallest change to the original graph. However, there is no guarantee on how the utilities are preserved in the published graph. In this paper, we propose a general fine-grained adjusting framework to publish a privacy protected and utility preserved graph. With this framework, the data publisher can get a trade-off between the privacy and utility according to his customized preferences. We used the protection of a weighted graph as an example to demonstrate the implementation of this framework.

14:42 Privacy-preserving SimRank over Distributed Information Network short_paper) echo(" (Short)");?>
Yu-Wei Chu, Chih-Hua Tai, Ming-Syan Chen, and Philip S. Yu
DM682

Information network analysis has drawn a lot attention in recent years. Among all the aspects of network analysis, similarity measure of nodes has been shown useful in many applications, such as clustering, link prediction and community identification, to name a few. As linkage data in a large network is inherently sparse, it is noted that collecting more data can improve the quality of similarity measure. This gives different parties a motivation to cooperate. In this paper, we address the problem of link-based similarity measure of nodes in an information network distributed over different parties. Concerning the data privacy, we propose a privacy-preserving Sim Rank protocol based on fully-homomorphic encryption to provide cryptographic protection for the links.

14:54 Mining Permission Request Patterns from Android and Facebook Applications short_paper) echo(" (Short)");?>
Mario Frank, Ben Dong, Adrienne Porter Felt, and Dawn Song
DM370

Android and Face book provide third-party applications with access to users' private data and the ability to perform potentially sensitive operations (e.g., post to a user's wall or place phone calls). As a security measure, these platforms restrict applications' privileges with permission systems: users must approve the permissions requested by applications before the applications can make privacy-or security-relevant API calls. However, recent studies have shown that users often do not understand permission requests and are unsure of which permissions are typical for applications. As a first step towards simplifying permission systems, we cluster a corpus of 188,389 Android applications and 27,029 Face book applications to find patterns in permission requests. Using a method for Boolean matrix factorization to find overlapping clusters of permissions, we find that Face book permission requests follow a clear structure that can be fitted well with only five patterns, whereas Android applications demonstrate more complex permission requests. We also find that low-reputation applications often deviate from the permission request patterns that we identified for high-reputation applications, which suggests that permission request patterns can be indicative of user satisfaction or application quality.

15:06 Multiple Kernel Learning Clustering with an Application to Malware short_paper) echo(" (Short)");?>
Blake Anderson, Curtis Storlie, and Terran Lane
DM716

With the increasing prevalence of richer, more complex data sources, learning with multiple views is becoming more widespread. Multiple kernel learning (MKL) has been developed to address this problem, but in general, the solutions provided by traditional MKL are restricted to a classification objective function. In this work, we develop a novel multiple kernel learning algorithm that is based on a spectral clustering objective function which is able to find an optimal kernel weight vector for the clustering problem. We go on to show how this optimization problem can be cast as a semidefinite program and efficiently solved using off-the-shelf interior point methods.

15:18 Risks of Friendships on Social Networks short_paper) echo(" (Short)");?>
Cuneyt Gurcan Akcora, Barbara Carminati, and Elena Ferrari
DM674

In this paper, we explore the risks of friends in social networks caused by their friendship patterns, by using real life social network data and starting from a previously defined risk model. Particularly, we observe that risks of friendships can be mined by analyzing users' attitude towards friends of friends. This allows us to give new insights into friendship and risk dynamics on social networks.

Spatial and Temporal

Tuesday, 11 December
13:30 – 15:30
Room: Rembrandt & Permeke
Session Chair: Takashi Washio

13:30 Effective and Robust Mining of Temporal Subspace Clusters short_paper) echo(" (Short)");?>
Hardy Kremer, Stephan Günnemann, Arne Held, and Thomas Seidl
DM667

Mining temporal multivariate data by clustering is an important research topic. In today's complex data, interesting patterns are often neither bound to the whole dimensional nor temporal extent of the data domain. This challenge is met by temporal subspace clustering methods. Their effectiveness, however, is impeded by aspects unavoidable in real world data: Misalignments between time series, for example caused by out-of-sync sensors, and measurement errors. Under these conditions, existing temporal subspace clustering approaches miss the patterns contained in the data. In this paper, we propose a novel clustering method that mines temporal subspace clusters reflected by sets of objects and relevant intervals. We enable flexible handling of misaligned time series by adaptively shifting time series in the time domain, and we achieve robustness to measurement errors by allowing certain fractions of deviating values in each relevant point in time. We show the effectiveness of our method in experiments on real and synthetic data.

13:50 Student-t based Robust Spatio-Temporal Prediction short_paper) echo(" (Short)");?>
Yang Chen, Feng Chen, Jing Dai, T. Charles Clancy, and Yao-Jan Wu
DM847

This paper describes an efficient and effective design of Robust Spatio-Temporal Prediction based on Student's t distribution, namely, St-RSTP, to provide estimations based on observations over spatio-temporal neighbors. The proposed St-RSTP is more resilient to outliers or other small departures from model assumptions than its ancestor, the Spatio-Temporal Random Effects (STRE) model. STRE is a state-of-the-art statistical model with linear order complexity for large scale processing. However, it assumes Gaussian observations, which has the well-known limitation of non-robustness. In our St-RSTP design, the measurement error follows Student's t distribution, instead of a traditional Gaussian distribution. This design reduces the influence of outliers, improves prediction quality, and keeps the problem analytically intractable. We propose a novel approximate inference approach, which approximates the model into the form that separates the high dimensional latent variables into groups, and then estimates the posterior distributions of different groups of variables separately in the framework of Expectation Propagation. As a good property, our approximate approach decentralizes to the standard STRE based prediction, when the degree of freedom of the Student's t distribution is set to infinite. Extensive experimental evaluations based on both simulation and real-life data sets demonstrated the robustness and the efficiency of our Student-t prediction model. The proposed approach provides critical functionality for stochastic processes on spatio-temporal data.

14:10 Robust Prediction and Outlier Detection for Spatial Dataset short_paper) echo(" (Short)");?>
Xutong Liu, Feng Chen, and Chang-Tien Lu
DM876

Spatial kriging is a widely used predictive model for spatial datasets. In spatial kriging model, the observations are assumed to be Gaussian for computational convenience. However, its predictive accuracy could be significantly compromised if the observations are contaminated by outliers. This deficiency can be systematically addressed by increasing the robustness of spatial kriging model using heavy tailed distributions, such as the Huber, Laplace, and Student's t distributions. This paper presents a novel Robust and Reduced Rank Spatial Kriging Model (R3-SKM), which is resilient to the influences of outliers and allows for fast spatial inference. Furthermore, three effective and efficient algorithms are proposed based on R3-SKM framework that can perform robust parameter estimation, spatial prediction, and spatial outlier detection with a linear-order time complexity. Extensive experiments on both simulated and real data sets demonstrated the robustness and efficiency of our proposed techniques.

14:30 Understanding Data Completeness in Network Monitoring Systems short_paper) echo(" (Short)");?>
Flip Korn, Ruilin Liu, and Hui (Wendy) Wang
DM884

In many networks including Internet Service Providers, transportation monitoring systems and the electric grid, measurements from a set of objects are continuously taken over time and used for important decisions such as provisioning and pricing. It is therefore vital to understand the completeness and reliability of such data. Given the large volume of data generated by these systems, rather than enumerating the times and objects incurring missing or spurious data, it is more effective to provide patterns (groups of tuples) concisely summarizing trends that may not otherwise be readily observable. In this paper, we define the Graph Tableau Discovery Problem where the measured tuples can be thought of as edges in a bipartite graph on an ordered attribute (time) and an unordered attribute (object identifiers). We define the problem of finding an optimal summary, show that it is NP-complete, and then provide a polynomial-time approximation algorithm with guarantees to find a good summary. Experiments on real and synthetic data demonstrate the effectiveness and efficiency of our approach.

14:50 Inferring the Root Cause in Road Traffic Anomalies short_paper) echo(" (Short)");?>
Sanjay Chawla, Yu Zheng, and Jiafeng Hu
DM711

We propose a novel two-step mining and optimization framework for inferring the root cause of anomalies that appear in road traffic data. We model road traffic as a time-dependent flow on a network formed by partitioning a city into regions bounded by major roads. In the first step we identify link anomalies based on their deviation from their historical traffic profile. However, link anomalies on their own shed very little light on what caused them to be anomalous. In the second step we take a generative approach by modeling the flow in a network in terms of the origin-destination (OD) matrix which physically relates the latent flow between origin and destination and the observable flow on the links. The key insight is that instead of using all of link traffic as the observable vector we only use the link anomaly vector. By solving an L1 inverse problem we infer the routes (the origin-destination pairs) which gave rise to the link anomalies. Experiments on a very large GPS data set consisting on nearly eight hundred million data points demonstrate that we can discover routes which can clearly explain the appearance of link anomalies. The use of optimization techniques to explain observable anomalies in a generative fashion is, to the best of our knowledge, entirely novel.

15:10 A Stochastic Model for Context-Aware Anomaly Detection in Indoor Location Traces short_paper) echo(" (Short)");?>
Chuanren Liu, Hui Xiong, Yong Ge, Wei Geng, Matt Perkins
DM720

Rapid growth in the development of real-time location system solutions has led to an increased interest in indoor location-aware services, such as hospital asset management. Although there are extensive studies in the literature on the analysis of outdoor location traces, the studies of indoor location traces are less touched and fragmented. To that end, in this paper, we provide a focused study of indoor location traces collected by the sensors attached to medical devices in a hospital environment. Along this line, we first introduce some unique properties of these indoor location traces. We show that they can capture the movement patterns of the medical devices, which are tightly coupled with the work flow in the controlled hospital environment. Based on this observation, we propose a stochastic model for context-aware anomaly detection in indoor location traces, which exploits the hospital work flow and models the movements of medical devices as transitions in finite state machines. In detail, we first develop a density-based method to identify the hotspots filled with high-level abnormal activities in the indoor environment. The discovered hotspots serve as the context for nearby trajectories. Then, we introduce an N-gram based method for measuring the degree of anomaly based on the detected hotspots, which is able to predict the missing events possibly due to the devices being stolen. Besides, to address the noisy nature of the indoor sensor networks, we also propose an iterative algorithm to estimate the transition probabilities. This algorithm allows to effectively recover the missing location records which are critical for the abnormality estimation. Finally, the experimental results on the real-world date sets validate the effectiveness of the proposed context-aware anomaly detection method for identifying abnormal events.

Models and Algorithms for New Applications

Tuesday, 11 December
16:00 – 18:00
Room: Salle des nations I
Session Chair: Matthijs van Leeuwen

16:00 ETM: Entity Topic Models for Mining Documents Associated with Entities short_paper) echo(" (Short)");?>
Hyungsul Kim, Yizhou Sun, Julia Hockenmaier, and Jiawei Han
DM822

Topic models, which factor each document into different topics and represent each topic as a distribution of terms, have been widely and successfully used to better understand collections of text documents. However, documents are also associated with further information, such as the set of real-world entities mentioned in them. For example, news articles are usually related to several people, organizations, countries or locations. Since those associated entities carry rich information, it is highly desirable to build more expressive, entity-based topic models, which can capture the term distributions for each topic, each entity, as well as each topic-entity pair. In this paper, we therefore introduce a novel Entity Topic Model (ETM) for documents that are associated with a set of entities. ETM not only models the generative process of a term given its topic and entity information, but also models the correlation of entity term distributions and topic term distributions. A Gibbs sampling-based algorithm is proposed to learn the model. Experiments on real datasets demonstrate the effectiveness of our approach over several state-of-the-art baselines.

16:20 Efficient Algorithms for Finding Richer Subgroup Descriptions in Numeric and Nominal Data short_paper) echo(" (Short)");?>
Michael Mampaey, Siegfried Nijssen, Ad Feelders, and Arno Knobbe
DM694

Subgroup discovery systems are concerned with finding interesting patterns in labeled data. How these systems deal with numeric and nominal data has a large impact on the quality of their results. In this paper, we consider two ways to extend the standard pattern language of subgroup discovery: using conditions that test for interval membership for numeric attributes, and value set membership for nominal attributes. We assume a greedy search setting, that is, iteratively refining a given subgroup, with respect to a (convex) quality measure. For numeric attributes, we propose an algorithm that finds the optimal interval in linear (rather than quadratic) time, with respect to the number of examples and split points. Similarly, for nominal attributes, we show that finding the optimal set of values can be achieved in linear (rather than exponential) time, with respect to the number of examples and the size of the domain of the attribute. These algorithms operate by only considering subgroup refinements that lie on a convex hull in ROC space, thus significantly narrowing down the search space. We further provide efficient algorithms specifically for the popular Weighted Relative Accuracy quality measure, taking advantage of some of its properties. Our algorithms are shown to perform well in practice, and furthermore provide additional expressive power leading to higher-quality results.

16:40 Sparse Group Selection on Fused Lasso Components for Identifying Group-specific DNA Copy Number Variations short_paper) echo(" (Short)");?>
Ze Tian, Huanan Zhang, and Rui Kuang
DM380

Detecting DNA copy number variations (CNVs) from arrayCGH or genotyping-array data to correlate with cancer outcomes is crucial for understanding the molecular mechanisms underlying cancer. Previous methods either focus on detecting CNVs in each individual patient sample or common CNVs across all the patient samples. These methods ignore the discrepancies introduced by the heterogeneity in the patient samples, which implies that common CNVs might only be shared within some groups of samples instead of all samples. In this paper, we propose a latent feature model that couples sparse sample group selection with fused lasso on CNV components to identify group-specific CNVs. Assuming a given group structure on patient samples by clinical information, sparse group selection on fused lasso (SGS-FL) identifies the optimal latent CNV components, each of which is specific to the samples in one or several groups. The group selection for each CNV component is determined dynamically by an adaptive algorithm to achieve a desired sparsity. Simulation results show that SGS-FL can more accurately identify the latent CNV components when there is a reliable underlying group structure in the samples. In the experiments on arrayCGH breast cancer and bladder cancer datasets, SGS-FL detected CNV regions that are more relevant to cancer, and provided latent feature weights that can be used for better sample classification.

17:00 Multiplicative Algorithms for Constrained Non-negative Matrix Factorization short_paper) echo(" (Short)");?>
Chengbin Peng, Ka-Chun Wong, Alyn Rockwood, Xiangliang Zhang, Jinling Jiang, and David E. Keyes
DM334

Non-negative matrix factorization (NMF) provides the advantage of parts-based data representation through additive only combinations. It has been widely adopted in areas like item recommending, text mining, data clustering, speech denoising, etc. In this paper, we provide an algorithm that allows the factorization to have linear or approximately linear constraints with respect to each factor. We prove that if the constraint function is linear, algorithms within our multiplicative framework will converge. This theory supports a large variety of equality and inequality constraints, and can facilitate application of NMF to a much larger domain. Taking the recommender system as an example, we demonstrate how a specialized weighted and constrained NMF algorithm can be developed to fit exactly for the problem, and the tests justify that our constraints improve the performance for both weighted and unweighted NMF algorithms under several different metrics. In particular, on the Movie lens data with 94% of items, the Constrained NMF improves recall rate 3% compared to SVD50 and 45% compared to SVD150, which were reported as the best two in the top-N metric.

17:12 Rapid and Robust Denoising of Pyrosequenced Amplicons for Metagenomics short_paper) echo(" (Short)");?>
Byunghan Lee, Joonhong Park, and Sungroh Yoon
DM446

Metagenomic sequencing has become a crucial tool for obtaining a gene catalogue of operational taxonomic units (OTUs) in a microbial community. High-throughput pyrosequencing is a next-generation sequencing technique very popular in microbial community analysis due to its longer read length compared to alternative methods. Computational tools are inevitable to process raw data from pyrosequencers, and in particular, noise removal is a critical data-mining step to obtain robust sequence reads. However, the slow rate of existing denoisers has bottlenecked the whole pyrosequencing process, let alone hindering efforts to improve robustness. To address these, we propose a new approach that can accelerate the denoising process substantially. By using our approach, it now takes only about 2 hours to denoise 62,873 pyrosequenced amplicons from a mixture of 91 full-length 16S rRNA clones. It would otherwise take nearly 2.5 days if existing software tools were used. Furthermore, our approach can effectively reduce overestimating the number of OTUs, producing 6.7 times fewer species-level OTUs on average than a state-of-the-art alternative under the same condition. Leveraged by our approach, we hope that metagenomic sequencing will become an even more appealing tool for microbial community analysis.

17:24 Multi-task Learning for Classifying Proteins with Dual Hierarchies short_paper) echo(" (Short)");?>
Anveshi Charuvaka and Huzefa Rangwala
DM724

Several biological databases organize information in taxonomies/hierarchies. These databases differ in terms of curation process, input data, coverage and annotation errors. SCOP and CATH are examples of two databases that classify proteins hierarchically into structurally related groups based on experimentally determined structures. Given the large number of protein sequences with unavailable structure, there is a need to develop prediction methods to classify protein sequences into structural classes. We have developed a novel classification approach that utilizes the underlying relationships across multiple hierarchical source databases within a multi-task learning (MTL) framework. MTL is used to simultaneously learn multiple related tasks, and has been shown to improve generalization performance. Specifically, we have developed and evaluated an MTL approach for predicting the structural class, as defined by two hierarchical databases, CATH and SCOP, using protein sequence information only. We define one task per node of the hierarchies and formulate the MTL problem as a combination of these binary classification tasks. Our experimental evaluation demonstrates that the MTL approach that integrates both the hierarchies outperforms the base-line approach that trains independent models per task, as well as a MTL approach that integrates tasks across a single hierarchical database. We also performed extensive experiments that evaluate different regularization penalties and incorporate different task relationships that achieve superior classification performance.

17:36 A New Proposal for Score Normalization in Biometric Signature Recognition Based on Client Threshold Prediction short_paper) echo(" (Short)");?>
Carlos Vivaracho-Pascual, Arancha Simon-Hurtado, Esperanza Manso-Martinez, and Juan Pascual-Gaspar
DM579

Score Normalization is a usual technique in pattern recognition to standardize the classifier output ranges so as to, for example, fuse these outputs. The use of score normalization in biometric recognition is a very important part of the system, specially in those based on behavioral traits, such as written signature or voice, conditioning the final system performance. Then, many works can be found that focus on the problem. A successful new approach for client threshold prediction, based on Multiple Linear Prediction, has been presented in recent works. Here, a new approach for score normalization, based on this proposal for biometric manuscript signature user verification, is shown. This proposal is compared with the state of the art methods, achieving an improvement of 19% and 16% for Equal Error Rate (EER) and 60% and 26% for Detection Cost Function (DCF) performance measures, for random and skilled forgeries, respectively.

17:48 Analysis of Temporal High-Dimensional Gene Expression Data for Identifying Informative Biomarker Candidates short_paper) echo(" (Short)");?>
Qiang Lou and Zoran Obradovic
DM859

Identifying informative biomarkers from a large pool of candidates is the key step for accurate prediction of an individual's health status. In clinical applications traditional static feature selection methods that flatten the temporal data cannot be directly applied since the patient's observed clinical condition is a temporal multivariate time series where different variables can capture various stages of temporal change in the patient's health status. In this study, in order to identify informative genes in temporal microarray data, a margin based feature selection filter is proposed. The proposed method is based on well-established machine learning techniques without any assumptions about the data distribution. The objective function of temporal margin-based feature selection is defined to maximize each subject's temporal margin in its own relevant subspace. In the objective function, the uncertainty in calculating nearest neighbors is taken into account by considering the change in feature weights in each iteration. A fixed-point gradient descent method is proposed to solve the formulated objective function. The experimental results on both synthetic and real data provide evidence that the proposed method can identify more informative features than the alternatives that flatten the temporal data in advance.

Data Pre-Processing

Tuesday, 11 December
16:00 – 18:00
Room: Salle des nations II
Session Chair: Pauli Miettinen

16:00 Estimating Local Information Trustworthiness via Multi-Source Joint Matrix Factorization short_paper) echo(" (Short)");?>
Liang Ge, Jing Gao, Xiao Yu, Wei Fan, and Aidong Zhang
DM784

We investigate how to estimate information trustworthiness by considering multiple information sources jointly in a latent matrix space. We particularly focus on user review and recommendation systems, as there are multiple platforms where people can rate items and services that they have purchased, and many potential customers rely on these opinions to make decisions. Information trustworthiness is a serious problem because ratings are generated freely by end-users so that many stammers take advantage of freedom of speech to promote their business or damage reputation of competitors. We propose to simply use customer ratings to estimate each individual source's reliability by exploring correlations among multiple sources. Ratings of items are provided by users of diverse tastes and styles, and thus may appear noisy and conflicting across sources, however, they share some underlying common behavior. Therefore, we can group users based on their opinions, and a source is reliable on an item if its opinions given by latent groups are consistent across platforms. Inspired by this observation, we solve the problem by a two-step model -- a joint matrix factorization procedure followed by reliability score computation. We propose two effective approaches to decompose rating matrices as the products of group membership and group rating matrices, and then compute consistency degrees from group rating matrices as source reliability scores. We conduct experiments on both synthetic data and real user ratings collected from Orbitz, Priceline and Trip Advisor on all the hotels in Las Vegas and New York City. Results show that the proposed method is able to give accurate estimates of source reliability and thus successfully identify inconsistent, conflicting and unreliable information.

16:12 Heterogeneous Constraint Propagation with Constrained Sparse Representation short_paper) echo(" (Short)");?>
Zhiwu Lu and Yuxin Peng
DM568

This paper presents a graph-based method for heterogeneous constraint propagation on multi-modal data using constrained sparse representation. Since heterogeneous pair wise constraints are defined over pairs of data points from different modalities, heterogeneous constraint propagation is more challenging than the transitional homogeneous constraint propagation on single-modal data which has been studied extensively in previous work. The main difficulty of heterogeneous constraint propagation lies in how to effectively propagate heterogeneous pair wise constraints across different modalities. To address this issue, we decompose heterogeneous constraint propagation into semi-supervised learning sub problems which can then be efficiently solved by graph-based label propagation. Moreover, we develop a constrained sparse representation method for graph construction over each modality using homogeneous pair wise constraints. The experimental results in cross-modal retrieval have shown the superior performance of our heterogeneous constraint propagation.

16:24 Collaborative Filtering with Aspect-based Opinion Mining: A Tensor Factorization Approach short_paper) echo(" (Short)");?>
Yuanhong Wang, Yang Liu, and Xiaohui Yu
DM974

Collaborative filtering (CF) aims to produce user specific recommendations based on other users' ratings of items. Most existing CF methods rely only on users' overall ratings of items, ignoring the variety of opinions users may have towards different aspects of the items. Using the movie domain as a case study, we propose a framework that is able to capture users' opinions on different aspects from the textual reviews, and use that information to improve the effectiveness of CF. This framework has two components, an opinion mining component and a rating inference component. The former extracts and summarizes the opinions on multiple aspects from the reviews, generating ratings on the various aspects. The latter component, on the other hand, infers the overall ratings of items based on the aspect ratings, which forms the basis for item recommendation. Our core contribution is in the proposal of a tensor factorization approach for the rating inference. Operating on the tensor composed of the overall and aspect ratings, this approach is able to capture the intrinsic relationships between users, items, and aspects, and provide accurate predictions on unknown ratings. Experiments on a movie dataset show that our proposal significantly improves the prediction accuracy compared with two baseline methods.

16:36 Signal Disaggregation via Sparse Coding with Featured Discriminative Dictionary short_paper) echo(" (Short)");?>
Bingsheng Wang, Feng Chen, Haili Dong, Arnold P. Boedihardjo, Chang-Tien Lu
DM954

As the issue of freshwater shortage is increasing daily, it's critical to take effective measures for water conservation. Based on previous studies, device level consumption could lead to significant conservation of freshwater. However, current smart meter deployments only produce low sample rate aggregated data. In this paper, we examine the task of separating whole-home water consumption into its component appliances. A key challenge is to address the unique features of low sample rate data. To this end, we propose Sparse Coding with Featured Discriminative Dictionary (SCFDD) by incorporating inherent shape and activation features to capture the discriminative characteristics of devices. In addition, extensive experiments were performed to validate the effectiveness of SCFDD.

16:48 Hashing with Generalized Nystrom Approximation short_paper) echo(" (Short)");?>
Jeong-Min Yun, Saehoon Kim, and Seungjin Choi
DM376

Hashing, which involves learning binary codes to embed high-dimensional data into a similarity-preserving low-dimensional Hamming space, is often formulated as linear dimensionality reduction followed by binary quantization. Linear dimensionality reduction, based on maximum variance formulation, requires leading eigenvectors of data covariance or graph Laplacian matrix. Computing leading singular vectors or eigenvectors in the case of high-dimension and large sample size, is a main bottleneck in most of data-driven hashing methods. In this paper we address the use of generalized Nyström method where a subset of rows and columns are used to approximately compute leading singular vectors of the data matrix, in order to improve the scalability of hashing methods in the case of high-dimensional data with large sample size. Especially we validate the useful behavior of generalized Nyström approximation with uniform sampling, in the case of a recently-developed hashing method based on principal component analysis (PCA) followed by an iterative quantization, referred to as PCA+ITQ, developed by Gong and Lazebnik. We compare the performance of generalized Nyström approximation with uniform and non-uniform sampling, to the full singular value decomposition (SVD) method, confirming that the uniform sampling improves the computational and space complexities dramatically, while the performance is not much sacrificed. In addition we present low-rank approximation error bounds for generalized Nyström approximation with uniform sampling, which is not a trivial extension of available results on the non-uniform sampling case.

17:00 Low-Rank Transfer Subspace Learning short_paper) echo(" (Short)");?>
Ming Shao, Carlos Castillo, Zhenghong Gu, and Yun Fu
DM459

One of the most important challenges in machine learning is performing effective learning when there are limited training data available. However, there is an important case when there are sufficient training data coming from other domains (source). Transfer learning aims at finding ways to transfer knowledge learned from a source domain to a target domain by handling the subtle differences between the source and target. In this paper, we propose a novel framework to solve the aforementioned knowledge transfer problem via low-rank representation constraints. This is achieved by finding an optimal subspace where each datum in the target domain can be linearly represented by the corresponding subspace in the source domain. Extensive experiments on several databases, i.e., Yale B, CMU PIE, UB Kin Face databases validate the effectiveness of the proposed approach and show the superiority to the existing, well-established methods.

17:12 Geodesic based semi-supervised multi-manifold feature extraction short_paper) echo(" (Short)");?>
Mingyu Fan, Xiaoqin Zhang, Zhouchen Lin, Zhongfei Zhang, and Hujun Bao
DM538

Manifold learning is an important feature extraction approach in data mining. This paper presents a new semi-supervised manifold learning algorithm, called Multi-Manifold Discriminative Analysis (Multi-MDA). The proposed method is designed to explore the discriminative information hidden in geodesic distances. The main contributions of the proposed method are: 1) we propose a semi-supervised graph construction method which can effectively capture the multiple manifolds structure of the data, 2) each data point is replaced with an associated feature vector whose elements are the graph distances from it to the other data points. Information of the nonlinear structure is contained in the feature vectors which are helpful for classification, 3) we propose a new semi-supervised linear dimension reduction method for feature vectors which introduces the class information into the manifold learning process and establishes an explicit dimension reduction mapping. Experiments on benchmark data sets are conducted to show the effectiveness of the proposed method.

17:24 Learning Heterogeneous Similarity Measures for Hybrid-Recommendations in Meta-Mining short_paper) echo(" (Short)");?>
Phong Nguyen, Jun Wang, Melanie Hilario, and Alexandros Kalousis
DM663

The notion of meta-mining has appeared recently and extends traditional meta-learning in two ways. First it provides support for the whole data-mining process. Second it pries open the so called algorithm black-box approach where algorithms and workflows also have descriptors. With the availability of descriptors both for datasets and datamining workflows we are faced with a problem the nature of which is much more similar to those appearing in recommendation systems. In order to account for the metamining specificities we derive a novel metric-based-learning recommender approach. Our method learns two homogeneous metrics, one in the dataset and one in the workflow space, and a heterogeneous one in the dataset-workflow space. All learned metrics reflect similarities established from the datasetworkflow preference matrix. The latter is constructed from the performance results obtained by the application of workflows to datasets. We demonstrate our method on meta-mining over biological (microarray datasets) problems. The application of our method is not limited to the meta-mining problem, its formulation is general enough so that it can be applied on problems with similar requirements.

17:36 IceCube: Efficient Targeted Mining in Data Cubes short_paper) echo(" (Short)");?>
Shrutendra K Harsola, Prasad M Deshpande, and Jayant R Haritsa
DM932

We address the problem of mining targeted association rules over multidimensional market-basket data. Here, each transaction has, in addition to the set of purchased items, ancillary dimension attributes associated with it. Based on these dimensions, transactions can be visualized as distributed over cells of an n-dimensional cube. In this framework, a targeted association rule is of the form {X - > Y}R, where R is a convex region in the cube and X - > Y is a traditional association rule within region R. We first describe the TOARM algorithm, based on classical techniques, for identifying targeted association rules. Then, we discuss the concepts of bottom-up aggregation and cubing, leading to the Cell Union technique. This approach is further extended, using notions of cube-count interleaving and credit-based pruning, to derive the Ice Cube algorithm. Our experiments demonstrate that Ice Cube consistently provides the best execution time performance, especially for large and complex data cubes.

17:48 Direct Discovery of High Utility Itemsets without Candidate Generation short_paper) echo(" (Short)");?>
Junqiang Liu, Ke Wang, and Benjamin C.M. Fung
DM554

Utility mining emerged recently to address the limitation of frequent itemset mining by introducing interestingness measures that reflect both the statistical significance and the user's expectation. Among utility mining problems, utility mining with the itemset share framework is a hard one as no anti-monotone property holds with the interestingness measure. The state-of-the-art works on this problem all employ a two-phase, candidate generation approach, which suffers from the scalability issue due to the huge number of candidates. This paper proposes a high utility itemset growth approach that works in a single phase without generating candidates. Our basic approach is to enumerate itemsets by prefix extensions, to prune search space by utility upper bounding, and to maintain original utility information in the mining process by a novel data structure. Such a data structure enables us to compute a tight bound for powerful pruning and to directly identify high utility itemsets in an efficient and scalable way. We further enhance the efficiency significantly by introducing recursive irrelevant item filtering with sparse data, and a lookahead strategy with dense data. Extensive experiments on sparse and dense, synthetic and real data suggest that our algorithm outperforms the state-of-the-art algorithms over one order of magnitude.

Foundations and Quality Control

Tuesday, 11 December
16:00 – 18:00
Room: Rembrandt & Permeke
Session Chair: Jilles Vreeken

16:00 Hierarchical Multi-task Learning with Application to Wafer Quality Prediction short_paper) echo(" (Short)");?>
Jingrui He and Yada Zhu
DM416

Many real problems of multi-task learning exhibit hierarchical task relatedness. In other words, the tasks are partitioned into multiple groups. Different tasks within the same group are related on the task-level, whereas different groups are related on the group-level. For example, in semiconductor manufacturing, the problem of wafer quality prediction can be considered as hierarchical multi-task learning, where each task corresponds to a single side of a chamber with side-level relatedness, and a group of tasks corresponds to a chamber of multiple sides with chamber-level relatedness. Motivated by this application, in this paper, we propose an optimization framework for hierarchical multi-task learning, which partitions all the input features into 2 sets based on their characteristics, and models task-level and group-level relatedness by imposing different constraints on the coefficient vectors of the 2 sets. This is different from existing work on task clustering where the goal is to uncover the grouping of tasks, the tasks do not exhibit group-level relatedness, and the input features are not discriminated in the prediction model to model task-level and group-level relatedness. To solve this framework, we propose the hear algorithm based on block coordinate descent, and demonstrate its effectiveness on both synthetic and real data sets from domains of semiconductor manufacturing and document classification.

16:20 Stream Classification with Recurring and Novel Class Detection using Class-Based Ensemble short_paper) echo(" (Short)");?>
Tahseen Al-Khateeb, Mohammad M. Masud, Latifur Khan, Charu Aggarwal, Jiawei Han, and Bhavani Thuraisingham
DM284

Concept-evolution has recently received a lot of attention in the context of mining data streams. Concept-evolution occurs when a new class evolves in the stream. Although many recent studies address this issue, most of them do not consider the scenario of recurring classes in the stream. A class is called recurring if it appears in the stream, disappears for a while, and then reappears again. Existing data stream classification techniques either misclassify the recurring class instances as another class, or falsely identify the recurring classes as novel. This increases the prediction error of the classifiers, and in some cases causes unnecessary waste in memory and computational resources. In this paper we address the recurring class issue by proposing a novel "class-based" ensemble technique, which substitutes the traditional "chunk-based" ensemble approaches and correctly distinguishes between a recurring class and a novel one. We analytically and experimentally confirm the superiority of our method over state-of-the-art techniques.

16:40 Assessing the Significance of Data Mining Results on Graphs with Feature Vectors short_paper) echo(" (Short)");?>
Stephan Günnemann, Phuong Dao, Mohsen Jamali, and Martin Ester
DM773

Assessing the significance of data mining results is an important step in the knowledge discovery process. While results might appear interesting at a first glance, they can often be explained by already known characteristics of the data. Randomization is an established technique for significance testing, and methods to assess data mining results on vector data or network data have been proposed. In many applications, however, both sources are simultaneously given. Since these sources are rarely independent of each other but highly correlated, naively applying existing randomization methods on each source separately is questionable. In this work, we present a method to assess the significance of mining results on graphs with binary features vectors. We propose a novel null model that preserves correlation information between both sources. Our randomization exploits an adaptive Metropolis sampling and interweaves attribute randomization and graph randomization steps. In thorough experiments, we demonstrate the application of our technique. Our results indicate that while simultaneously using both sources is beneficial, often one source of information is dominant for determining the mining results.

17:00 Scaling Inference for Markov Logic via Dual Decomposition short_paper) echo(" (Short)");?>
Feng Niu, Ce Zhang, Christopher Ré, and Jude Shavlik
DM377

Markov logic is a knowledge-representation language that allows one to specify large graphical models. However, the resulting large graphical models can make inference for Markov logic a computationally challenging problem. Recently, dual decomposition (DD) has become a popular approach for scalable inference on graphical models. We study how to apply DD to scale up inference in Markov logic. A standard approach for DD first partitions a graphical model into multiple tree-structured sub problems. We apply this approach to Markov logic and show that DD can outperform prior inference approaches. Nevertheless, we observe that the standard approach for DD is suboptimal as it does not exploit the rich structure often present in the Markov logic program. Thus, we describe a novel decomposition strategy that partitions a Markov logic program into parts based on its structure. A crucial advantage of our approach is that we can use specialized algorithms for portions of the input problem -- some of which have been studied for decades, e.g., coreference resolution. Empirically, we show that our program-level decomposition approach outperforms both non-decomposition and graphical model-based decomposition approaches to Markov logic inference on several data-mining tasks.

17:12 Estimating the Expected Effectiveness of Text Classification Solutions under Subclass Distribution Shifts short_paper) echo(" (Short)");?>
Nedim Lipka, Benno Stein, and James G. Shanahan
DM786

Automated text classification is one of the most important learning technologies to fight information overload. However, the information society is not only confronted with an information flood but also with an increase in "information volatility", by which we understand the fact that kind and distribution of a data source's emissions can significantly vary. In this paper we show how to estimate the expected effectiveness of a classification solution when the underlying data source undergoes a shift in the distribution of its subclasses (modes). Subclass distribution shifts are observed among others in online media such as tweets, blogs, or news articles, where document emissions follow topic popularity. To estimate the expected effectiveness of a classification solution we partition a test sample by means of clustering. Then, using repetitive resampling with different margin distributions over the clustering, the effectiveness characteristics is studied. We show that the effectiveness is normally distributed and introduce a probabilistic lower bound that is used for model selection. We analyze the relation between our notion of expected effectiveness and the mean effectiveness over the clustering both theoretically and on standard text corpora. An important result is a heuristic for expected effectiveness estimation that is solely based on the initial test sample and that can be computed without resampling.

17:24 Adapting Component Analysis short_paper) echo(" (Short)");?>
Fatemeh Dorri and Ali Ghodsi
DM747

A main problem in machine learning is to predict the response variables of a test set given the training data and its corresponding response variables. A predictive model can perform satisfactorily only if the training data is an appropriate representative of the test data. This is usually reflected in the assumption that the training data and the test data are drawn from the same underlying probability distribution. However, the assumption may not be correct in many applications for various reasons. We propose a method based on kernel distribution embedding and Hilbert-Schmidt Independence Criterion (HSIC) to address this problem. The proposed method explores a new representation of the data in a new feature space with two properties: (i) the distributions of the training and the test data sets are as close as possible in the new feature space, and (ii) the important structural information of the data is preserved. The algorithm can reduce the dimensionality of the data while it preserves the aforementioned properties and therefore it can be seen as a dimensionality reduction method as well. Our method has a closed-form solution and the experimental results show that it works well in practice.

17:36 Cost-Sensitive Online Classification and Its Application to Online Anomaly Detection short_paper) echo(" (Short)");?>
Jialei Wang, Peilin Zhao, and Steven C.H. Hoi
DM468

Both cost-sensitive classification and online learning have been studied extensively in data mining and machine learning communities, respectively. It is a bit surprising that there was very limited comprehensive study for addressing an important intersecting problem, that is, cost-sensitive online classification. In this paper, we formally study this problem, and propose a new framework for cost-sensitive online classification by exploiting the idea of online gradient descent techniques. Based on the framework, we propose a family of cost-sensitive online classification algorithms, which are designed to directly optimize two well-known cost-sensitive measures: (i) maximization of weighted sum of sensitivity and specificity, and (ii) minimization of weighted misclassification cost. We analyze the theoretical bounds of the cost-sensitive measures made by the proposed algorithms, and extensively examine their empirical performance on a variety of cost-sensitive online classification tasks.

17:48 Self-Taught Active Learning from Crowds short_paper) echo(" (Short)");?>
Meng Fang, Xingquan Zhu, Bin Li, Wei Ding, and Xindong Wu
DM826

The emergence of social tagging and crowdsourcing systems provides a unique platform where multiple weak labelers can form a crowd to fulfill a labeling task. Yet crowd labelers are often noisy, inaccurate, and have limited labeling knowledge, and worst of all, they act independently without seeking complementary knowledge from each other to improve labeling performance. In this paper, we propose a Self-Taught Active Learning (STAL) paradigm, where imperfect labelers are able to learn complementary knowledge from one another to expand their knowledge sets and benefit the underlying active learner. We employ a probabilistic model to characterize the knowledge of each labeler through which a weak labeler can learn complementary knowledge from a stronger peer. As a result, the self-taught active learning process eventually helps achieve high classification accuracy with minimized labeling costs and labeling errors.

Classification 2

Wednesday, 12 December
10:00 – 12:00
Room: Rembrandt & Permeke
Session Chair: David Martens

10:00 The mixture of multi-kernel relevance vector machines model short_paper) echo(" (Short)");?>
Konstantinos Blekas and Aristidis Likas
DM651

We present a new regression mixture model where each mixture component is a multi-kernel version of the Relevance Vector Machine (RVM). In the proposed model, we exploit the enhanced modeling capability of RVMs due to their embedded sparsity enforcing properties. Moreover, robustness is achieved with respect to the kernel parameters, by employing a weighted multi-kernel scheme. The mixture model is trained using the maximum a posteriori (MAP) approach, where the Expectation Maximization (EM) algorithm is applied offering closed form update equations for the model parameters. An incremental learning methodology is also presented to tackle the parameter initialization problem of the EM algorithm. The efficiency of the proposed mixture model is empirically demonstrated on the time series clustering problem using various artificial and real benchmark datasets and by performing comparisons with other regression mixture models.

10:20 Self-Training with Selection-by-Rejection short_paper) echo(" (Short)");?>
Yan Zhou, Murat Kantarcioglu, and Bhavani Thuraisingham
DM607

Practical machine learning and data mining problems often face shortage of labeled training data. Self-training algorithms are among the earliest attempts of using unlabeled data to enhance learning. Traditional self-training algorithms label unlabeled data on which classifiers trained on limited training data have the highest confidence. In this paper, a self-training algorithm that decreases the disagreement region of hypotheses is presented. The algorithm supplements the training set with self-labeled instances. Only instances that greatly reduce the disagreement region of hypotheses are labeled and added to the training set. Empirical results demonstrate that the proposed self-training algorithm can effectively improve classification performance.

10:40 Co-Labeling: A New Multi-view Learning Approach for Ambiguous Problems short_paper) echo(" (Short)");?>
Wen Li, Lixin Duan, Ivor Wai-Hung Tsang, and Dong Xu
DM502

We propose a multi-view learning approach called co-labeling which is applicable for several machine learning problems where the labels of training samples are uncertain, including semi-supervised learning (SSL), multi-instance learning (MIL) and max-margin clustering (MMC). Particularly, we first unify those problems into a general ambiguous problem in which we simultaneously learn a robust classifier as well as find the optimal training labels from a finite label candidate set. To effectively utilize multiple views of data, we then develop our co-labeling approach for the general multi-view ambiguous problem. In our work, classifiers trained on different views can teach each other by iteratively passing the predictions of training samples from one classifier to the others. The predictions from one classifier are considered as label candidates for the other classifiers. To train a classifier with a label candidate set for each view, we adopt the Multiple Kernel Learning (MKL) technique by constructing the base kernel through associating the input kernel calculated from input features with one label candidate. Compared with the traditional co-training method which was specifically designed for SSL, the advantages of our co-labeling are two-fold: 1) it can be applied to other ambiguous problems such as MIL and MMC, 2) it is more robust by using the MKL method to integrate multiple labeling candidates obtained from different iterations and biases. Promising results on several real-world multi-view data sets clearly demonstrate the effectiveness of our proposed co-labeling for both MIL and SSL.

11:00 Graph-oriented Learning via Automatic Group Sparsity for Data Analysis short_paper) echo(" (Short)");?>
Yuqiang Fang, Ruili Wang, and Bin Dai
DM922

The key task in graph-oriented learning is con-structing an informative graph to model the geometrical and discriminant structure of a data manifold. Since traditional graph construction methods are sensitive to noise and less datum-adaptive to changes in density, a new graph construction method so-called L1-Graph has been proposed [1] recently. A graph construction method needs to have two important properties: sparsity and locality. However, the L1-Graph is strong in sparsity property, but weak in locality. In order to overcome such limitation, we propose a new method of constructing an informative graph using automatic group sparse regularization based on the work of L1-Graph, which is called as group sparse graph (GroupSp-Graph). The newly developed GroupSp-Graph has the same noise-insensitive property as L1-Graph, and also can successively preserve the group and local information in the graph. In other words, the proposed group sparse graph has both properties of sparsity and locality simultaneously. Furthermore, we integrate the proposed graph with several graph-oriented learning algorithms: spectral em-bedding, spectral clustering, subspace learning and manifold regularized non-negative matrix factorization. The empirical studies on benchmark data sets show that the proposed algo-rithms achieve considerable improvement over classic graph constructing methods and the L1-Graph method in various learning task.

11:20 Ensemble Pruning via Constrained Eigen-Optimization short_paper) echo(" (Short)");?>
Linli Xu, Bo Li, and Enhong Chen
DM941

An ensemble is composed of a set of base learners that make predictions jointly. The generalization performance of an ensemble has been justified both theoretically and in practice. However, existing ensemble learning methods sometimes produce unnecessarily large ensembles, with an expense of extra computational costs and memory consumption. The purpose of ensemble pruning is to select a subset of base learners with comparable or better prediction performance. In this paper, we formulate the ensemble pruning problem into a combinatorial optimization problem with the goal to maximize the accuracy and diversity at the same time. Solving this problem exactly is computationally hard. Fortunately, we can relax and reformulate it as a constrained eigenvector problem, which can be solved with an efficient algorithm that is guaranteed to converge globally. Convincing experimental results demonstrate that this optimization based ensemble pruning algorithm outperforms the state-of-the-art heuristics in the literature.

11:40 Healing Truncation Bias : Self-weighted Truncation framework for Dual Averaging short_paper) echo(" (Short)");?>
Hidekazu Oiwa, Shin Matsushima, and Hiroshi Nakagawa
DM879

We propose a new truncation framework for online supervised learning. Learning a compact predictive model in an online setting has recently attracted a great deal of attention. The combination of online learning with sparsity-inducing regularization enables faster learning with a smaller memory space than a conventional learning framework. However, a simple combination of these triggers the truncation of weights whose corresponding features rarely appear, even if these features are crucial for prediction. Furthermore, it is difficult to emphasize these features in advance while preserving the advantages of online learning. We develop an extensional truncation framework to Dual Averaging, which retains rarely occurring but informative features. Our proposed framework integrates information on all previous sub gradients of the loss functions into a regularization term. Our enhancement of a conventional L1-regularization accomplishes the automatic adjustment of each feature's truncations. This extension enables us to identify and retain rare but informative features without preprocessing. In addition, our framework achieves the same computational complexity and regret bound as standard Dual Averaging. Experiments demonstrated that our framework outperforms other sparse online learning algorithms.

Clustering 2

Wednesday, 12 December
10:00 – 12:00
Room: Salle des nations I
Session Chair: Katharina Morik

10:00 Scalable and Memory-efficient Clustering of Large Scale Social Networks short_paper) echo(" (Short)");?>
Joyce Jiyoung Whang, Xin Sui, and Inderjit S. Dhillon
DM892

Clustering of social networks is an important task for their analysis, however, most existing algorithms do not scale to the massive size of today's social networks. A popular class of graph clustering algorithms for large-scale networks, such as PMetis, KMetis and Graclus, is based on a multilevel framework. Generally, these multilevel algorithms work reasonably well on networks with a few million vertices. However, when the network size increases to the scale of 10 million vertices or greater, the performance of these algorithms rapidly degrades. Furthermore, an inherent property of social networks, the power law degree distribution, makes these algorithms infeasible to apply to large-scale social networks. In this paper, we propose a scalable and memory-efficient clustering algorithm for large-scale social networks. We name our algorithm GEM, by mixing two key concepts of the algorithm, Graph Extraction and weighted kernel k-Means. GEM efficiently extracts a good skeleton graph from the original graph, and propagates the clustering result of the extracted graph to the rest of the network. Experimental results show that GEM produces clusters of quality comparable to or better than existing state-of-the-art graph clustering algorithms, while it is much faster and consumes much less memory. Furthermore, the parallel implementation of GEM, called PGEM, not only produces higher quality of clusters but also achieves much better scalability than most current parallel graph clustering algorithms.

10:20 Clustering Time Series using Unsupervised-Shapelets short_paper) echo(" (Short)");?>
Jesin Zakaria, Abdullah Mueen, and Eamonn Keogh
DM708

Time series clustering has become an increasingly important research topic over the past decade. Most existing methods for time series clustering rely on distances calculated from the entire raw data using the Euclidean distance or Dynamic Time Warping distance as the distance measure. However, the presence of significant noise, dropouts, or extraneous data can greatly limit the accuracy of clustering in this domain. Moreover, for most real world problems, we cannot expect objects from the same class to be equal in length. As a consequence, most work on time series clustering only considers the clustering of individual time series "behaviors," e.g., individual heart beats or individual gait cycles, and contrives the time series in some way to make them all equal in length. However, contriving the data in such a way is often a harder problem than the clustering itself. In this work, we show that by using only some local patterns and deliberately ignoring the rest of the data, we can mitigate the above problems and cluster time series of different lengths, i.e., cluster one heartbeat with multiple heartbeats. To achieve this we exploit and extend a recently introduced concept in time series data mining called shapelets. Unlike existing work, our work demonstrates for the first time the unintuitive fact that shapelets can be learned from unlabeled time series. We show, with extensive empirical evaluation in diverse domains, that our method is more accurate than existing methods. Moreover, in addition to accurate clustering results, we show that our work also has the potential to give insights into the domains to which it is applied.

10:40 Dimensional Testing for Multi-Step Similarity Search short_paper) echo(" (Short)");?>
Michael E. Houle, Xiguo Ma, Michael Nett, and Vincent Oria
DM875

In data mining applications such as subspace clustering or feature selection, changes to the underlying feature set can require the reconstruction of search indices to support fundamental data mining tasks. For such situations, multi-step search approaches have been proposed that can accommodate changes in the underlying similarity measure without the need to rebuild the index. In this paper, we present a heuristic multi-step search algorithm that utilizes a measure of intrinsic dimension, the generalized expansion dimension (GED), as the basis of its search termination condition. Compared to the current state-of-the-art method, experimental results show that our heuristic approach is able to obtain significant improvements in both the number of candidates and the running time, while losing very little in the accuracy of the query results.

11:00 Discovery of Causal Rules Using Partial Association short_paper) echo(" (Short)");?>
Zhou Jin, Jiuyong Li, Lin Liu, Thuc Duy Le, Bingyu Sun, and Rujing Wang
DM413

Discovering causal relationships in large databases of observational data is challenging. The pioneering work in this area was rooted in the theory of Bayesian network (BN) learning, which however, is a NP-complete problem. Hence several constraint-based algorithms have been developed to efficiently discover causations in large databases. These methods usually use the idea of BN learning, directly or indirectly, and are focused on causal relationships with single cause variables. In this paper, we propose an approach to mine causal rules in large databases of binary variables. Our method expands the scope of causality discovery to causal relationships with multiple cause variables, and we utilise partial association tests to exclude noncausal associations, to ensure the high reliability of discovered causal rules. Furthermore an efficient algorithm is designed for the tests in large databases. We assess the method with a set of real-world diagnostic data. The results show that our method can effectively discover interesting causal rules in large databases.

11:20 Outlier Detection in Arbitrarily Oriented Subspaces short_paper) echo(" (Short)");?>
Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek
DM727

In this paper, we propose a novel outlier detection model to find outliers that deviate from the generating mechanisms of normal instances by considering combinations of different subsets of attributes, as they occur when there are local correlations in the data set. Our model enables to search for outliers in arbitrarily oriented subspaces of the original feature space. We show how in addition to an outlier score, our model also derives an explanation of the outlierness that is useful in investigating the results. Our experiments suggest that our novel method can find different outliers than existing work and can be seen as a complement of those approaches.

11:40 Outlier Ranking via Subspace Analysis in Multiple Views of the Data short_paper) echo(" (Short)");?>
Emmanuel Müller, Ira Assent, Patricia Iglesias, Yvonne Mülle, and Klemens Böhm
DM740

Outlier mining is an important task for finding anomalous objects. In practice, however, there is not always a clear distinction between outliers and regular objects as objects have different roles w.r.t. different attribute sets. An object may deviate in one subspace, i.e. a subset of attributes. And the same object might appear perfectly regular in other subspaces. One can think of subspaces as multiple views on one database. Traditional methods consider only one view (the full attribute space). Thus, they miss complex outliers that are hidden in multiple subspaces. In this work, we propose Outrank, a novel outlier ranking concept. Outrank exploits subspace analysis to determine the degree of outlierness. It considers different subsets of the attributes as individual outlier properties. It compares clustered regions in arbitrary subspaces and derives an outlierness score for each object. Its principled integration of multiple views into an outlierness measure uncovers outliers that are not detectable in the full attribute space. Our experimental evaluation demonstrates that Outrank successfully determines a high quality outlier ranking, and outperforms state-of-the-art outlierness measures.

Social Networks 2

Wednesday, 12 December
10:00 – 12:00
Room: Salle des nations II
Session Chair: Jie Tang

10:00 Sequential Network Change Detection with Its Applications to Ad Impact Relation Analysis short_paper) echo(" (Short)");?>
Yu Hayashi and Kenji Yamanishi
DM300

We are concerned with the issue of tracking changes of variable dependencies from multivariate time series. Conventionally, this issue has been addressed in the batch scenario where the whole data set is given at once, and the change detection must be done in a retrospective way. This paper addresses this issue in a sequential scenario where multivariate data are sequentially input and the detection must be done in a sequential fashion. We propose a new method for sequential tracking of variable dependencies. In it we employ a Bayesian network as a representation of variable dependencies. The key ideas of our method are, 1) we extend the theory of dynamic model selection (DMS), which has been developed in the batch-learning scenario, into the sequential setting, and apply it to our issue, 2) we conduct the change detection sequentially using dynamic programming per a window where we employ the Hoeffding's bound to automatically determine the window size. We empirically demonstrate that our proposed method is able to perform change detection more efficiently than a conventional batch method. Further, we give a new framework of an application of variable dependency change detection, which we call Ad Impact Relation analysis (AIR). In it, we detect the time point when a commercial message advertisement has given an impact on the market and effectively visulaize the impact through network changes. We employ real data sets to demonstrate the validity of AIR.

10:20 GUISE: Uniform Sampling of Graphlets for Large Graph Analysis short_paper) echo(" (Short)");?>
Mansurul A Bhuiyan, Mahmudur Rahman, Mahmuda Rahman, and Mohammad Al Hasan
DM947

Graphlet frequency distribution (GFD) has recently become popular for characterizing large networks. However, the computation of GFD for a network requires the exact count of embedded graphlets in that network, which is a computationally expensive task. As a result, it is practically infeasible to compute the GFD for even a moderately large network. In this paper, we propose GUISE, which uses a Markov Chain Monte Carlo (MCMC) sampling method for constructing the approximate GFD of a large network. Our experiments on networks with millions of nodes show that GUISE obtains the GFD within few minutes, whereas the exhaustive counting based approach takes several days.

10:40 Reliable Clustering on Uncertain Graphs short_paper) echo(" (Short)");?>
Lin Liu, Ruoming Jin, Charu Aggarwal, and Yelong Shen
DM871

Many graphs in practical applications are not deterministic, but are probabilistic in nature because the existence of the edges is inferred with the use of a variety of statistical approaches. In this paper, we will examine the problem of clustering uncertain graphs. Uncertain graphs are best clustered with the use of a possible worlds model in which the most reliable clusters are discovered in the presence of uncertainty. Reliable clusters are those which are not likely to be disconnected in the context of different instantiations of the uncertain graph. We present experimental results which illustrate the effectiveness of our model and approach.

11:00 Towards Annotating Media Contents through Social Diffusion Analysis short_paper) echo(" (Short)");?>
Tong Xu, Dong Liu, Enhong Chen, Huanhuan Cao, and Jilei Tian
DM333

Recently, the boom of media contents on the Internet raises challenges in managing them effectively and thus requires automatic media annotation techniques. Motivated by the observation that media contents are usually shared frequently in online communities and thus have a lot of social diffusion records, we propose a novel media annotating approach depending on these social diffusion records instead of metadata. The basic assumption is that the social diffusion records reflect the common interests (CI) between users, which can be analyzed for generating annotations. With this assumption, we present a novel CI-based social diffusion model and translate the automatic annotating task into the CI-based diffusion maximization (CIDM) problem. Moreover, we propose to solve the CIDM problem through two optimization tasks, corresponding to the training and test stages in supervised learning. Extensive experiments on real-world data sets show that our approach can effectively generate high quality annotations, and thus demonstrate the capability of social diffusion analysis in annotating media.

11:12 Inferring the Underlying Structure of Information Cascades short_paper) echo(" (Short)");?>
Bo Zong, Yinghui Wu, Ambuj K. Singh, and Xifeng Yan
DM385

In social networks, information and influence diffuse among users as cascades. While the importance of studying cascades has been recognized in various applications, it is difficult to observe the complete structure of cascades in practice. In this paper we study the cascade inference problem following the independent cascade model, and provide a full treatment from complexity to algorithms: (a) we propose the idea of consistent trees as the inferred structures for cascades, these trees connect source nodes and observed nodes with paths satisfying the constraints from the observed temporal information. (b) We introduce metrics to measure the likelihood of consistent trees as inferred cascades, as well as several optimization problems for finding them. (c) We show that the decision problems for consistent trees are in general NP-complete, and that the optimization problems are hard to approximate. (d) We provide approximation algorithms with performance guarantees on the quality of the inferred cascades, as well as heuristics. We experimentally verify the efficiency and effectiveness of our inference algorithms, using real and synthetic data.

11:24 Detecting Spam and Promoting Campaigns in the Twitter Social Network short_paper) echo(" (Short)");?>
Xianchao Zhang, Shaoping Zhu, and Wenxin Liang
DM440

The Twitter social network has become a target platform for both promoters and stammers to disseminate their target messages. There are a large number of campaigns containing coordinated spam or promoting accounts in Twitter, which are more harmful than the traditional methods, such as email spamming. Since traditional solutions mainly check individual accounts or messages, it is an urgent task to detect spam and promoting campaigns in Twitter. In this paper, we propose a scalable framework to detect both spam and promoting campaigns. Our framework consists of three steps: firstly linking accounts who post URLs for similar purposes, secondly extracting candidate campaigns which may exist for spam or promoting purpose and finally distinguishing their intents. One salient aspect of the framework is introducing a URL-driven estimation method to measure the similarity between accounts' purposes of posting URLs, the other one is proposing multiple features to distinguish the candidate campaigns based on a machine learning method. Over a large-scale dataset from Twitter, we can extract the actual campaigns with high precision and recall and distinguish the majority of the candidate campaigns correctly.

11:36 Towards Automatic Image Understanding and Mining via Social Curation short_paper) echo(" (Short)");?>
Katsuhiko Ishiguro, Akisato Kimura, and Koh Takeuchi
DM533

The amount and variety of multimedia data such as images, movies and music available on over social networks are increasing rapidly. However, the ability to analyze and exploit these unorganized multimedia data remains inadequate, even with state-of-the-art media processing techniques. Our finding in this paper is that the emerging social curation service is a promising information source for the automatic understanding and mining of images distributed and exchanged via social media. One remarkable virtue of social curation service datasets is that they are weakly supervised: the content in the service is manually collected, selected and maintained by users. This is very different from other social information sources, and we can utilize this characteristics for media content mining without expensive media processing techniques. In this paper we present a machine learning system for predicting view counts of images in social curation data as the first step to automatic image content evaluation. Our experiments confirm that the simple features extracted from a social curation corpus are much superior in terms of count prediction than the gold-standard image features of computer vision research.

11:48 IRIE: Scalable and Robust Influence Maximization in Social Networks short_paper) echo(" (Short)");?>
Kyomin Jung, Wooram Heo, and Wei Chen
DM613

Influence maximization is the problem of selecting top k seed nodes in a social network to maximize their influence coverage under certain influence diffusion models. In this paper, we propose a novel algorithm IRIE that integrates the advantages of influence ranking (IR) and influence estimation (IE) methods for influence maximization in both the independent cascade (IC) model and its extension IC-N that incorporates negative opinion propagations. Through extensive experiments, we demonstrate that IRIE matches the influence coverage of other algorithms while scales much better than all other algorithms. Moreover IRIE is much more robust and stable than other algorithms both in running time and memory usage for various density of networks and cascade size. It runs up to two orders of magnitude faster than other state-of-the-art algorithms such as PMIA for large networks with tens of millions of nodes and edges, while using only a fraction of memory.

Classification 3

Thursday, 13 December
10:00 – 12:00
Room: Rembrandt & Permeke
Session Chair: Srivatsan Laxman

10:00 Transductive Representation Learning for Cross-Lingual Text Classification short_paper) echo(" (Short)");?>
Yuhong Guo and Min Xiao
DM409

In cross-lingual text classification problems, it is costly and time-consuming to annotate documents for each individual language. To avoid the expensive re-labeling process, domain adaptation techniques can be applied to adapt a learning system trained in one language domain to another language domain. In this paper we develop a transductive subspace representation learning method to address domain adaptation for cross-lingual text classifications. The proposed approach is formulated as a nonnegative matrix factorization problem and solved using an iterative optimization procedure. Our empirical study on cross-lingual text classification tasks shows the proposed approach consistently outperforms a number of comparison methods.

10:12 Learning Target Predictive Function without Target Labels short_paper) echo(" (Short)");?>
Chun-Wei Seah, Ivor Wai-Hung Tsang, Yew-Soon Ong, and Qi Mao
DM427

In the absence of the labeled samples in a domain referred to as target domain, Domain Adaptation (DA) techniques come in handy. Generally, DA techniques assume there are available source domains that share similar predictive function with the target domain. Two core challenges of DA typically arise, variance that exists between source and target domains, and the inherent source hypothesis bias. In this paper, we first propose a Stability Transfer criterion for selecting relevant source domains with low variance. With this criterion, we introduce a TARget learning Assisted by Source Classifier Adaptation (TARASCA) method to address the two core challenges that have impeded the performances of DA techniques. To verify the robustness of TARASCA, extensive experimental studies are carried out with comparison to several state-of-the-art DA methods on the real-world Sentiment and Newsgroups datasets, where various settings for the class ratios of the source and target domains are considered.

10:24 Fast Kernel Sparse Representation Approaches for Classification short_paper) echo(" (Short)");?>
Yifeng Li and Alioune Ngom
DM316

Sparse representation involves two relevant procedures - sparse coding and dictionary learning. Learning a dictionary from data provides a concise knowledge representation. Learning a dictionary in a higher feature space might allow a better representation of a signal. However, it is usually computationally expensive to learn a dictionary if the numbers of training data and(or) dimensions are very large using existing algorithms. In this paper, we propose a kernel dictionary learning framework for three models. We reveal that the optimization has dimension-free and parallel properties. We devise fast active-set algorithms for this framework. We investigated their performance on classification. Experimental results show that our kernel sparse representation approaches can obtain better accuracy than their linear counterparts. Furthermore, our active-set algorithms are faster than the existing interior-point and proximal algorithms.

10:36 Learning Attitudes and Attributes from Multi-Aspect Reviews short_paper) echo(" (Short)");?>
Julian McAuley, Jure Leskovec, and Dan Jurafsky
DM476

Most online reviews consist of plain-text feedback together with a single numeric score. However, understanding the multiple `aspects' that contribute to users' ratings may help us to better understand their individual preferences. For example, a user's impression of an audio book presumably depends on aspects such as the story and the narrator, and knowing their opinions on these aspects may help us to recommend better products. In this paper, we build models for rating systems in which such dimensions are explicit, in the sense that users leave separate ratings for each aspect of a product. By introducing new corpora consisting of five million reviews, rated with between three and six aspects, we evaluate our models on three prediction tasks: First, we uncover which parts of a review discuss which of the rated aspects. Second, we summarize reviews by finding the sentences that best explain a user's rating. Finally, since aspect ratings are optional in many of the datasets we consider, we recover ratings that are missing from a user's evaluation. Our model matches state-of-the-art approaches on existing small-scale datasets, while scaling to the real-world datasets we introduce. Moreover, our model is able to `disentangle' content and sentiment words: we automatically learn content words that are indicative of a particular aspect as well as the aspect-specific sentiment words that are indicative of a particular rating.

10:48 Simultaneously Combing Multi-View Multi-Label Learning with Maximum Margin Classification short_paper) echo(" (Short)");?>
Zheng Fang and Zhongfei (Mark) Zhang
DM511

Multiple feature views arise in various important data classification scenarios. However, finding a consensus feature view from multiple feature views for a classifier is still a challenging task. We present a new classification framework using the multi-label correlation information to address the problem of simultaneously combining multiple feature views and maximum margin classification. Under this framework, we propose a novel algorithm that iteratively computes the multiple view feature mapping matrices, the consensus feature view representation, and the coefficients of the classifier. Extensive experimental evaluations demonstrate the effectiveness and promise of this framework as well as the algorithm for discovering a consensus view from multiple feature views.

11:00 Decision Theory for Discrimination-aware Classification short_paper) echo(" (Short)");?>
Faisal Kamiran, Asim Karim, and Xiangliang Zhang
DM532

Social discrimination (e.g., against females) arising from data mining techniques is a growing concern worldwide. In recent years, several methods have been proposed for making classifiers learned over discriminatory data discrimination-aware. However, these methods suffer from two major shortcomings: (1) They require either modifying the discriminatory data or tweaking a specific classification algorithm and (2) They are not flexible w.r.t. discrimination control and multiple sensitive attribute handling. In this paper, we present two solutions for discrimination-aware classification that neither require data modification nor classifier tweaking. Our first and second solutions exploit, respectively, the reject option of probabilistic classifier(s) and the disagreement region of general classifier ensembles to reduce discrimination. We relate both solutions with decision theory for better understanding of the process. Our experiments using real-world datasets demonstrate that our solutions outperform existing state-of-the-art methods, especially at low discrimination which is a significant advantage. The superior performance coupled with flexible control over discrimination and easy applicability to multiple sensitive attributes makes our solutions an important step forward in practical discrimination-aware classification.

11:12 Active Label Correction short_paper) echo(" (Short)");?>
Umaa Rebbapragada, Carla E. Brodley, Damien Sulla-Menashe, and Mark Friedl
DM741

Active Label Correction (ALC) is an interactive method that cleans an established training set of mislabeled examples in conjunction with a domain expert. ALC presumes that the expert who conducts this review is either more accurate than the original annotator or has access to additional resources that ensure a high quality label. A high-cost re-review is possible because ALC proceeds iteratively, scoring the full training set but selecting only small batches of examples that are likely mislabeled. The expert reviews each batch and corrects any mislabeled examples, after which the classifier is retrained and the process repeats until the expert terminates it. We compare several instantiations of ALC to fully-automated methods that attempt to discard or correct label noise in a single pass. Our empirical results show that ALC outperforms single-pass methods in terms of selection efficiency and classifier accuracy. We evaluate the best ALC instantiation on our motivating task of detecting mislabeled and poorly formulated sites within a land cover classification training set from the geography domain.

11:24 Sparse Bayesian Adversarial Learning Using Relevance Vector Machine Ensembles short_paper) echo(" (Short)");?>
Yan Zhou, Murat Kantarcioglu, and Bhavani Thuraisingham
DM611

Data mining tasks are made more complicated when adversaries attack by modifying malicious data to evade detection. The main challenge lies in finding a robust learning model that is insensitive to unpredictable malicious data distribution. In this paper, we present a sparse relevance vector machine ensemble for adversarial learning. The novelty of our work is the use of individualized kernel parameters to model potential adversarial attacks during model training. We allow the kernel parameters to drift in the direction that minimizes the likelihood of the positive data. This step is interleaved with learning the weights and the weight priors of a relevance vector machine. Our empirical results demonstrate that an ensemble of such relevance vector machine models is more robust to adversarial attacks.

11:36 An AdaBoost Algorithm for Multiclass Semi-Supervised Learning short_paper) echo(" (Short)");?>
Jafar Tanha, Maarten van Someren, and Hamideh Afsarmanesh
DM792

We present an algorithm for multiclass Semi-Supervised learning which is learning from a limited amount of labeled data and plenty of unlabeled data. Existing semi-supervised algorithms use approaches such as one-versus-all to convert the multiclass problem to several binary classification problems which is not optimal. We propose a multiclass semi-supervised boosting algorithm that solves multiclass classification problems directly. The algorithm is based on a novel multiclass loss function consisting of the margin cost on labeled data and two regularization terms on labeled and unlabeled data. Experimental results on a number of UCI datasets show that the proposed algorithm performs better than the state-of-the-art boosting algorithms for multiclass semi-supervised learning.

11:48 A Classification Based Framework For Concept Summarization short_paper) echo(" (Short)");?>
Dhruv Mahajan, Sundararajan Sellamanickam, Subhajit Sanyal, and Amit Madaan
DM517

In this paper we propose a novel classification based framework for finding a small number of images that summarize a given concept. Our method exploits metadata in- formation available with the images to get category information using Latent Dirichlet Allocation. Using this category infor- mation for each image, we solve the underlying classification problem by building a sparse classifier model for each concept. We demonstrate that the images that specify the sparse model form a good summary. In particular, our summary satisfies important properties such as likelihood, diversity and balance in both visual and semantic sense. Furthermore, the framework allows users to specify desired distributions over categories to create personalized summaries. Experimental results on seven broad query types show that the proposed method performs better than state-of-the-art methods.

Clustering 3

Thursday, 13 December
10:00 – 12:00
Room: Alto & Mezzo & Tempo
Session Chair: Emmanuel Müller

10:00 Robust Ensemble Clustering by Matrix Completion short_paper) echo(" (Short)");?>
Jinfeng Yi, Tianbao Yang, Rong Jin, Anil K. Jain, and Mehrdad Mahdavi
DM288

Data clustering is an important task and has found applications in numerous real-world problems. Since no single clustering algorithm is able to identify all different types of cluster shapes and structures, ensemble clustering was proposed to combine different partitions of the same data generated by multiple clustering algorithms. The key idea of most ensemble clustering algorithms is to find a partition that is consistent with most of the available partitions of the input data. One problem with these algorithms is their inability to handle uncertain data pairs, i.e. data pairs for which about half of the partitions put them into the same cluster and the other half do the opposite. When the number of uncertain data pairs is large, they can mislead the ensemble clustering algorithm in generating the final partition. To overcome this limitation, we propose an ensemble clustering approach based on the technique of matrix completion. The proposed algorithm constructs a partially observed similarity matrix based on the data pairs whose cluster memberships are agreed upon by most of the clustering algorithms in the ensemble. It then deploys the matrix completion algorithm to complete the similarity matrix. The final data partition is computed by applying an efficient spectral clustering algorithm to the completed matrix. Our empirical studies with multiple real-world datasets show that the proposed algorithm performs significantly better than the state-of-the-art algorithms for ensemble clustering.

10:12 Sparse Subspace Representation for Spectral Document Clustering short_paper) echo(" (Short)");?>
Budhaditya Saha, Dinh Phung, Duc Son Pham, and Svetha Venkatesh
DM319

We present a novel method for document clustering using sparse representation of documents in conjunction with spectral clustering. An l1 - norm optimization formulation is posed to learn the sparse representation of each document, allowing us to characterize the affinity between documents by considering the overall information instead of traditional pair wise similarities. This document affinity is encoded through a graph on which spectral clustering is performed. The decomposition into multiple subspaces allows documents to be part of a sub-group that shares a smaller set of similar vocabulary, thus allowing for cleaner clusters. Extensive experimental evaluations on two real-world datasets from Reuters-21578 and 20Newsgroup corpora show that our proposed method consistently outperforms state-of-the-art algorithms. Significantly, the performance improvement over other methods is prominent for this datasets.

10:24 Labels vs. Pairwise Constraints: A Unified View of Label Propagation and Constrained Spectral Clustering short_paper) echo(" (Short)");?>
Xiang Wang, Buyue Qian, and Ian Davidson
DM347

In many real-world applications we can model the data as a graph with each node being an instance and the edges indicating a degree of similarity. Side information is often available in the form of labels for a small subset of instances, which gives rise to two problem settings and two types of algorithms. In the label propagation style algorithms, the known labels are propagated to the unlabeled nodes. In the constrained clustering style algorithms, known labels are first converted to pair wise constraints (Must-Link and Cannot-Link), then a constrained cut is computed as a tradeoff between minimizing the cut cost and maximizing the constraint satisfaction. Both techniques are evaluated by their ability to recover the ground truth labeling, i.e. by 0/1 loss function either directly on the labels or on the pair wise relations derived from the labels. These two fields have developed separately, but in this paper, we show that they are indeed related. This insight allows us to propose a novel way to generate constraints from the propagated labels, which our empirical study shows outperforms and is more stable than the state-of-the-art label propagation and constrained spectral clustering algorithms.

10:36 A Cluster-Oriented Genetic Algorithm for Alternative Clustering short_paper) echo(" (Short)");?>
Duy Tin Truong and Roberto Battiti
DM543

Supervised alternative clusterings is the problem of finding a set of clusterings which are of high quality and different from a given negative clustering. The task is therefore a clear multi-objective optimization problem. Optimizing two conflicting objectives requires dealing with trade-offs. Most approaches in the literature optimize these objectives sequentially or indirectly, resulting in solutions which are dominated. We develop a multi-objective algorithm, called COGNAC, able to optimize the objectives directly and simultaneously and producing solutions approximating the Pareto front. COGNAC performs the recombination operator at the cluster level instead of the object level as in traditional genetic algorithms. It can accept arbitrary clustering quality and dissimilarity objectives and provide solutions dominating those of other state-of-the-art algorithms. COGNAC can also be used to generate a sequence of alternative clusterings, each of which is guaranteed to be different from all previous ones.

10:48 Low Dimensional Localized Clustering (LDLC) short_paper) echo(" (Short)");?>
Pooyan Khajehpour Tadavani and Ali Ghodsi
DM891

In the space of high-dimensional data, it is generally reasonable to assume that the data points are on (or close to) one or more submanifolds. Each of these submanifolds can be modeled by a number of linear subspaces. This is in fact the main intuition behind a majority of subspace clustering algorithms. In many cases, however, the subspaces computed by these algorithms consist of disconnected subsets of the underlying submanifolds and therefore, do not form localized and compact clusters. To address this problem, we propose "Low Dimensional Localized Clustering (LDLC)", a new method for subspace clustering. Unlike existing methods, LDLC respects the topology of the underling submanifolds and assigns the data points to localized clusters such that the total reconstruction error is minimized. This is a valuable property in many tasks, such as semi-supervised classification, data visualization and local dimensionality reduction. We establish connections between LDLC, K-Means, and VQPCA from different perspectives, and validate our method through various experiments on synthetic and real data sets.

11:00 Exclusive Row Biclustering for Gene Expression Using a Combintorial Auction Approach short_paper) echo(" (Short)");?>
Amichai Painsky and Saharon Rosset
DM576

The availability of large microarray data has led to a growing interest in biclustering methods in the past decade. Several algorithms have been proposed to identify subsets of genes and conditions according to different similarity measures and under varying constraints. In this paper we focus on the exclusive row biclusteing problem (also known as projected clustering) for gene expression data sets, in which each row can only be a member of a single bicluster while columns can participate in multiple clusters. This type of biclustering may be adequate, for example, for clustering groups of cancer patients where each patient (row) is expected to be carrying only a single type of cancer, while each cancer type is associated with multiple (and possibly overlapping) genes (columns). In this paper we present a novel method to identify these exclusive row biclusters through a combination of existing biclustering algorithms and combinatorial auction techniques. We devise an approach for tuning the threshold for our algorithm based on comparison to a null model in the spirit of the Gap statistic approach. We demonstrate our approach on both synthetic and real-world gene expression data and show its power in identifying large span non-overlapping rows sub matrices, while considering their unique nature. The Gap statistic approach succeeds in identifying appropriate thresholds in all our examples.

11:12 Co-clustering of Multi-View Datasets: a Parallelizable Approach short_paper) echo(" (Short)");?>
Bisson Gilles and Clément Grimal
DM806

In many applications, entities of the domain are described through different views that clustering methods often process one by one. We introduce here the architecture MVSim, that is able to deal simultaneously with all the information contained in such multi-view datasets by using several instances of a co-similarity algorithm. We show that this architecture provides better results than both single-view and multi-view approaches and that it can be easily parallelized thus reducing both time and space complexities of the computations.

11:24 Clustering by Learning Constraints Priorities short_paper) echo(" (Short)");?>
Masayuki Okabe and Seiji Yamada
DM950

A method for creating a constrained clustering ensemble by learning the priorities of pair wise constraints is proposed in this paper. This method integrates multiple clusters produced by using a simple constrained K-means algorithm that we modify to utilize the constraints priorities. The cluster ensemble is executed according to a boosting framework, which adaptively learns the constraints priorities and provides them for the modified constrained K-means to create diverse clusters that finally improve the clustering performance. The experimental results show that our proposed method outperforms the original constrained K-means and is comparable to several state-of-the-art constrained clustering methods.

11:36 Deep Learning to Hash with Multiple Representations short_paper) echo(" (Short)");?>
Yoonseop Kang, Saehoon Kim, and Seungjin Choi
DM350

Hashing seeks an embedding of high-dimensional objects into a similarity-preserving low-dimensional Hamming space such that similar objects are indexed by binary codes with small Hamming distances. A variety of hashing methods have been developed, but most of them resort to a single view (representation) of data. However, objects are often described by multiple representations. For instance, images are described by a few different visual descriptors (such as SIFT, GIST, and HOG), so it is desirable to incorporate multiple representations into hashing, leading to multi-view hashing. In this paper we present a deep network for multi-view hashing, referred to as deep multi-view hashing, where each layer of hidden nodes is composed of view-specific and shared hidden nodes, in order to learn individual and shared hidden spaces from multiple views of data. Numerical experiments on image datasets demonstrate the useful behavior of our deep multi-view hashing (DMVH), compared to recently-proposed multi-modal deep network as well as existing shallow models of hashing.

11:48 A Similarity Model and Segmentation Algorithm for White Matter Fiber Tracks short_paper) echo(" (Short)");?>
Son Mai, Sebastian Goebl, and Claudia Plant
DM504

Recently, fiber segmentation has become an emerging technique in neuroscience. Grouping fiber tracts into anatomical meaningful bundles allows to study the structure of the brain and to investigate onset and progression of neurodegenerative and mental diseases. In this paper, we propose a novel technique for fiber tracts based on shape similarity and connection similarity. For shape similarity, we propose some new techniques adapted from existing similarity measures for trajectory data. We also propose a new technique called Warped Longest Common Subsequence (WLCS) for which we additionally developed a lower-bounding distance function to speed up the segmentation process. Our segmentation is based on an outlier-robust density-based clustering algorithm. Extensive experiments on diffusion tensor images demonstrate the efficiency and effectiveness of our technique.

Social Networks 3

Thursday, 13 December
10:00 – 12:00
Room: Salle des nations I & II
Session Chair: Feida ZHU

10:00 Topic-aware Social Influence Propagation Models short_paper) echo(" (Short)");?>
Nicola Barbieri, Francesco Bonchi, and Giuseppe Manco
DM600

We study social influence from a topic modeling perspective. We introduce novel topic-aware influence-driven propagation models that experimentally result to be more accurate in describing real-world cascades than the standard propagation models studied in the literature. In particular, we first propose simple topic-aware extensions of the well-known Independent Cascade and Linear Threshold models. Next, we propose a different approach explicitly modeling authoritativeness, influence and relevance under a topic-aware perspective. We devise methods to learn the parameters of the models from a dataset of past propagations. Our experimentation confirms the high accuracy of the proposed models and learning schemes.

10:20 Profit Maximization over Social Networks short_paper) echo(" (Short)");?>
Wei Lu and Laks V. S. Lakshmanan
DM834

Influence maximization is the problem of finding a set of influential users in a social network such that the expected spread of influence under a certain propagation model is maximized. Much of the previous work has neglected the important distinction between social influence and actual product adoption. However, as recognized in the management science literature, an individual who gets influenced by social acquaintances may not necessarily adopt a product (or technology), due, e.g., to monetary concerns. In this work, we distinguish between influence and adoption by explicitly modeling the states of being influenced and of adopting a product. We extend the classical Linear Threshold (LT) model to incorporate prices and valuations, and factor them into users' decision-making process of adopting a product. We show that the expected profit function under our proposed model maintains submodularity under certain conditions, but no longer exhibits monotonicity, unlike the expected influence spread function. To maximize the expected profit under our extended LT model, we employ an unbudgeted greedy framework to propose three profit maximization algorithms. The results of our detailed experimental study on three real-world datasets demonstrate that of the three algorithms, PAGE, which assigns prices dynamically based on the profit potential of each candidate seed, has the best performance both in the expected profit achieved and in running time.

10:40 Diffusion of Information in Social Networks: Is It All Local? short_paper) echo(" (Short)");?>
Ceren Budak, Divyakant Agrawal, and Amr El Abbadi
DM429

Recent studies on the diffusion of information in social networks have largely focused on models based on the influence of local friends. In this paper, we challenge the generalizability of this approach and revive theories introduced by social scientists in the context of diffusion of innovations to model user behavior. To this end, we study various diffusion models in two different online social networks, Digg and Twitter. We first evaluate the applicability of two representative local influence models and show that the behavior of most social networks users are not captured by these local models. Next, driven by theories introduced in the diffusion of innovations research, we introduce a novel diffusion model called Gaussian Logit Curve Model (GLCM) that models user behavior with respect to the behavior of the general population. Our analysis shows that GLCM captures user behavior significantly better than local models, especially in the context of Digg. Aiming to capture both the local and global signals, we introduce various hybrid models and evaluate them through statistical methods. Our methodology models each user separately, automatically determining which users are driven by their local relations and which users are better defined through adopter categories, therefore capturing the complexity of human behavior.

11:00 Clash of the Contagions: Cooperation and Competition in Information Diffusion short_paper) echo(" (Short)");?>
Seth A. Myers and Jure Leskovec
DM908

In networks, contagions such as information, purchasing behaviors, and diseases, spread and diffuse from node to node over the edges of the network. Moreover, in real-world scenarios multiple contagions spread through the network simultaneously. These contagions not only propagate at the same time but they also interact and compete with each other as they spread over the network. While traditional empirical studies and models of diffusion consider individual contagions as independent and thus spreading in isolation, we study how different contagions interact with each other as they spread through the network. We develop a statistical model that allows for competition as well as cooperation of different contagions in information diffusion. Competing contagions decrease each other's probability of spreading, while cooperating contagions help each other in being adopted throughout the network. We evaluate our model on 18,000 contagions simultaneously spreading through the Twitter network. Our model learns how different contagions interact with each other and then uses these interactions to more accurately predict the diffusion of a contagion through the network. Moreover, the model also provides a compelling hypothesis for the principles that govern content interaction in information diffusion. Most importantly, we find very strong effects of interactions between contagions. Interactions cause a relative change in the spreading probability of a contagion by em 71% on the average.

11:20 Link Prediction and Recommendation across Heterogenous Social Networks short_paper) echo(" (Short)");?>
Yuxiao Dong, Jie Tang, Sen Wu, Jilei Tian, Nitesh V. Chawla, Jinghai Rao, and Huanhuan Cao
DM838

Link prediction and recommendation is a fundamental problem in social network analysis. The key challenge of link prediction comes from the sparsity of networks due to the strong disproportion of links that they have potential to form to links that do form. Most previous work tries to solve the problem in single network, few research focus on capturing the general principles of link formation across heterogeneous networks. In this work, we give a formal definition of link recommendation across heterogeneous networks. Then we propose a ranking factor graph model (RFG) for predicting links in social networks, which effectively improves the predictive performance. Motivated by the intuition that people make friends in different networks with similar principles, we find several social patterns that are general across heterogeneous networks. With the general social patterns, we develop a transfer-based RFG model that combines them with network structure information. This model provides us insight into fundamental principles that drive the link formation and network evolution. Finally, we verify the predictive performance of the presented transfer model on 12 pairs of transfer cases. Our experimental results demonstrate that the transfer of general social patterns indeed help the prediction of links.

11:40 Predicting Links in Multi-Relational and Heterogeneous Networks short_paper) echo(" (Short)");?>
Yang Yang, Nitesh V. Chawla, Yizhou Sun, and Jiawei Han
DM955

Link prediction is an important task in network analysis, benefiting researchers and organizations in a variety of fields. Many networks in the real world, for example social networks, are heterogeneous, having multiple types of links and complex dependency structures. Link prediction in such networks must model the influence propagating between heterogeneous relationships to achieve better link prediction performance than in homogeneous networks. In this paper, we introduce Multi-Relational Influence Propagation (MRIP), a novel probabilistic method for heterogeneous networks. We demonstrate that MRIP is useful for predicting links in sparse networks, which present a significant challenge due to the severe disproportion of the number of potential links to the number of real formed links. We also explore some factors that can inform the task of classification yet remain unexplored, such as temporal information. In this paper we make use of the temporal-related features by carefully investigating the issues of feasibility and generality. In accordance with our work in unsupervised learning, we further design an appropriate supervised approach in heterogeneous networks. Our experiments on co-authorship prediction demonstrate the effectiveness of our approach.

Efficient Algorithms 2

Thursday, 13 December
13:30 – 15:30
Room: Salle des nations I & II
Session Chair: Tijl De Bie

13:30 Efficient Learning for Hashing Proportional Data short_paper) echo(" (Short)");?>
Zhao Xu, Kristian Kersting, and Christian Bauckhage
DM398

Spectral hashing (SH) seeks compact binary codes of data points so that Hamming distances between codes correlate with data similarity. Quickly learning such codes typically boils down to principle component analysis (PCA). However, this is only justified for normally distributed data. For proportional data (normalized histograms), this is not the case. Due to the sum-to-unity constraint, features that are as independent as possible will not all be uncorrelated. In this paper, we show that a linear-time transformation efficiently copes with sum-to-unity constraints: first, we select a small number K of diverse data points by maximizing the volume of the simplex spanned by these prototypes; second, we represent each data point by means of its cosine similarities to the K selected prototypes. This maximum volume hashing is sensible since each dimension in the transformed space is likely to follow a von Mises (vM) distribution, and, in very high dimensions, the vM distribution closely resembles a Gaussian distribution. This justifies to employ PCA on the transformed data. Our extensive experiments validate this: maximum volume hashing outperforms spectral hashing and other state of the art techniques.

13:50 Nested Subtree Hash Kernels for Large-scale Graph Classification over Streams short_paper) echo(" (Short)");?>
Bin Li, Xingquan Zhu, Lianhua Chi, and Chengqi Zhang
DM713

Most studies on graph classification focus on designing fast and effective kernels. Several fast subtree kernels have achieved a linear time-complexity w.r.t. the number of edges under the condition that a common feature space (e.g., a subtree pattern list) is needed to represent all graphs. This will be infeasible when graphs are presented in a stream with rapidly emerging subtree patterns. In this case, computing a kernel matrix for graphs over the entire stream is difficult since the graphs in the expired chunks cannot be projected onto the unlimitedly expanding feature space again. This leads to a big trouble for graph classification over streams -- Different portions of graphs have different feature spaces. In this paper, we aim to enable large-scale graph classification over streams using the classical ensemble learning framework, which requires the data in different chunks to be in the same feature space. To this end, we propose a Nested Subtree Hashing (NSH) algorithm to recursively project the multi-resolution subtree patterns of different chunks onto a set of common low-dimensional feature spaces. We theoretically analyze the derived NSH kernel and obtain a number of favorable properties: 1) The NSH kernel is an unbiased and highly concentrated estimator of the fast subtree kernel. 2) The bound of convergence rate tends to be tighter as the NSH algorithm steps into a higher resolution. 3) The NSH kernel is robust in tolerating concept drift between chunks over a stream. We also empirically test the NSH kernel on both a large-scale synthetic graph data set and a real-world chemical compounds data set for anticancer activity prediction. The experimental results validate that the NSH kernel is indeed efficient and robust for graph classification over streams.

14:10 Efficient Behavior Targeting Using SVM Ensemble Indexing short_paper) echo(" (Short)");?>
Jun Li, Peng Zhang, Yanan Cao, Ping Liu, and Li Guo
DM977

Behavior targeting (BT) is a promising tool for online advertising. The state-of-the-art BT methods, which are mainly based on regression models, have two limitations. First, learning regression models for behavior targeting is difficult since user clicks are typically several orders of magnitude fewer than views. Second, the user interests are not fixed, but often transient and influenced by media and pop culture. In this paper, we propose to formulate behavior targeting as a classification problem. Specifically, we propose to use an SVM ensemble for behavior prediction. The challenge of using ensemble SVM for BT stems from the computational complexity (it takes 53 minutes in our experiments to predict behavior for 32 million users, which is inadequate for online application). To this end, we propose a fast ensemble SVM prediction framework, which builds an indexing structure for SVM ensemble to achieve sub-linear prediction time complexity. Experimental results on real-world large scale behavior targeting data demonstrate that the proposed method is efficient and outperforms existing linear regression based BT models.

14:30 Active Evaluation of Classifiers on Large Datasets short_paper) echo(" (Short)");?>
Namit Katariya, Arun Iyer, and Sunita Sarawagi,
DM706

The goal of this work is to estimate the accuracy of a classifier on a large unlabeled dataset based on a small labeled set and a human labeler. We seek to estimate accuracy and select instances for labeling in a loop via a continuously refined stratified sampling strategy. For stratifying data we develop a novel strategy of learning r bit hash functions to preserve similarity in accuracy values. We show that our algorithm provides better accuracy estimates than existing methods for learning distance preserving hash functions. Experiments on a wide spectrum of real datasets show that our estimates achieve between 15% and 62% relative reduction in error compared to existing approaches. We show how to perform stratified sampling on unlabeled data that is so large that in an interactive setting even a single sequential scan is impractical. We present an optimal algorithm for performing importance sampling on a static index over the data that achieves close to exact estimates while reading three orders of magnitude less data.

14:50 High Performance Offline & Online Distributed Collaborative Filtering short_paper) echo(" (Short)");?>
Ankur Narang, Abhinav Srivastav, and Naga Praveen Kumar Katta
DM929

Big data analytics is a hot research area both in academia and industry. It envisages processing massive amounts of data at high rates to generate new insights leading to positive impact (for both users and providers) of industries such as E-commerce, Telecom, Finance, Life Sciences and so forth. We consider collaborative filtering (CF) and Clustering algorithms that are key fundamental analytics kernels that help in achieving these aims. High throughput CF and co-clustering on highly sparse and massive datasets, along with a high prediction accuracy, is a computationally challenging problem. In this paper, we present a novel hierarchical design for soft real-time (less than 1-minute.) distributed co-clustering based collaborative filtering algorithm. We study both the online and offline variants of this algorithm. Theoretical analysis of the time complexity of our algorithm proves the efficacy of our approach. Further, we present the impact of load balancing based optimizations on multi-core cluster architectures. Using the Netflix dataset(900M training ratings with replication) as well as the Yahoo KDD Cup(2.3B training ratings with replication) datasets, we demonstrate the performance and scalability of our algorithm on a large multi-core cluster architecture. In offline mode, our distributed algorithm demonstrates around 4x better performance (on Blue Gene/P) as compared to the best prior work, along with high accuracy. In online mode, we demonstrated around 3x better performance compared to baseline MPI implementation. To the best of our knowledge, our algorithm provides the best known online and offline performance and scalability results with high accuracy on multi-core cluster architectures.

15:10 Distributed Matrix Completion short_paper) echo(" (Short)");?>
Christina Teflioudi, Faraz Makari, and Rainer Gemulla
DM743

We discuss parallel and distributed algorithms for large-scale matrix completion on problems with millions of rows, millions of columns, and billions of revealed entries. We focus on in-memory algorithms that run on a small cluster of commodity nodes, even very large problems can be handled effectively in such a setup. Our DALS, ASGD, and DSGD++ algorithms are novel variants of the popular alternating least squares and stochastic gradient descent algorithms, they exploit thread-level parallelism, in-memory processing, and asynchronous communication. We provide some guidance on the asymptotic performance of each algorithm and investigate the performance of both our algorithms and previously proposed Map Reduce algorithms in large-scale experiments. We found that DSGD++ outperforms competing methods in terms of overall runtime, memory consumption, and scalability. Using DSGD++, we can factor a matrix with 10B entries on 16 compute nodes in around 40 minutes.

Feature Selection and Text Mining

Thursday, 13 December
13:30 – 15:30
Room: Alto & Mezzo & Tempo
Session Chair: Latifur Khan

13:30 Dimensionality Reduction on Heterogeneous Feature Space short_paper) echo(" (Short)");?>
Xiaoxiao Shi and Philip S. Yu
DM244

Combining correlated data sources may help improve the learning performance of a given task. For example, in recommendation problems, one can combine (1) user profile database (e.g. genders, age, etc.), (2) users' log data (e.g., click through data, purchasing records, etc.), and (3) users' social network (useful in social targeting) to build a recommendation model. All these data sources provide informative but heterogeneous features. For instance, user profile database usually has nominal features reflecting users' background, log data provides term-based features about users' historical behaviors, and social network database has graph relational features. Given multiple heterogeneous data sources, one important challenge is to find a unified feature subspace that captures the knowledge from all sources. To this aim, we propose a principle of collective component analysis (CoCA), in order to handle dimensionality reduction across a mixture of vector-based features and graph relational features. The CoCA principle is to find a feature subspace with maximal variance under two constraints. First, there should be consensus among the projections from different feature spaces. Second, the similarity between connected data (in any of the network databases) should be maximized. The optimal solution is obtained by solving an eigenvalue problem. Moreover, we discuss how to use prior knowledge to distinguish informative data sources, and optimally weight them in CoCA. Since there is no previous model that can be directly applied to solve the problem, we devised a straightforward comparison method by performing dimension reduction on the concatenation of the data sources. Three sets of experiments show that CoCA substantially outperforms the comparison method.

13:50 Isometric multi-manifold learning for feature extraction short_paper) echo(" (Short)");?>
Mingyu Fan, Hong Qiao, Bo Zhang, and Xiaoqin Zhang
DM620

Manifold learning is an important topic in pattern recognition and computer vision. However, most manifold learning algorithms implicitly assume the data are aligned on a single manifold, which is too strict in actual applications. Isometric feature mapping (Isomap), as a promising manifold learning method, fails to work on data which distribute on clusters in a single manifold or manifolds. In this paper, we propose a new multi-manifold learning algorithm (M-Isomap). The algorithm first discovers the data manifolds and then reduces the dimensionality of the manifolds separately. Meanwhile, a skeleton representing the global structure of whole data set is built and kept in low-dimensional space. Secondly, by referring to the low-dimensional representation of the skeleton, the embeddings of the manifolds are relocated to a global coordinate system. Compared with previous methods, these algorithms can keep both of the intra and inter manifolds geodesics faithfully. The features and effectiveness of the proposed multi-manifold learning algorithms are demonstrated and compared through experiments.

14:10 Feature Weighting and Selection Using Hypothesis Margin of Boosting short_paper) echo(" (Short)");?>
Malak Alshawabkeh, Javed A. Aslam, Jennifer G. Dy, and David Kaeli
DM877

Utilizing the concept of hypothesis margins to measure the quality of a set of features has been a growing line of research in the last decade. However, most previous algorithms have been developed under the large hypothesis margin principles of the 1-NN algorithm, such as Simba. Little attention has been paid so far to exploiting the hypothesis margins of boosting to evaluate features. Boosting is well known to maximize the training examples' hypothesis margins, in particular, the average margins which are known to be the first statistics that considers the whole margin distribution. In this paper, we describe how to utilize the training examples' mean margins of boosting to select features. A weight criterion, termed Margin Fraction (MF), is assigned to each feature that contributes to the average margin distribution combined in the final output produced by boosting. Applying the idea of MF to a sequential backward selection method, a new embedded selection algorithm is proposed, called SBS-MF. Experimentation is carried out using different data sets, which compares the proposed SBS-MF with two boosting based feature selection approaches, as well as to Simba. The results show that SBS-MF is effective in most of the cases.

14:30 An Approach to Evaluate the Local Completeness of Event Logs short_paper) echo(" (Short)");?>
Hedong Yang, Lijie Wen, and Jianmin Wang
DM232

Process mining links traditional model-driven Business Process Management and data mining by means of deriving knowledge from event logs to improve operational business processes. As an impact factor of the quality of process mining results, the degree of completeness of the given event log should be necessarily measured. In this paper an approach is proposed in the context of mining control-flow dependencies to evaluate the local completeness of an event log without knowing any information about the original process model. Experiment results show that the proposed approach works robustly and gives better estimation than approaches available.

14:42 Cross-Language Opinion Target Extraction in Review Texts short_paper) echo(" (Short)");?>
Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao
DM512

Opinion target extraction is a subtask of opinion mining which is very useful in many applications. In this study, we investigate the problem in a cross-language scenario which leverages the rich labeled data in a source language for opinion target extraction in a different target language. The English labeled corpus is used as training set. We generate two Chinese training datasets with different features. Two labeling models for Chinese opinion target extraction are learned based on Conditional Random Fields (CRF). After that, we use a monolingual co-training algorithm to improve the performance of both models by leveraging the enormous unlabeled Chinese review texts on the web. Experimental results show the effectiveness of our proposed approach.

14:54 Inductive Model Generation for Text Categorization using a Bipartite Heterogeneous Network short_paper) echo(" (Short)");?>
Rafael Geraldeli Rossi, Thiago de Paulo Faleiros, Alneu de Andrade Lopes, and Solange Oliveira Rezende
DM868

Usually, algorithms for categorization of numeric data have been applied for text categorization after a preprocessing phase which assigns weights for textual terms deemed as attributes. However, due to characteristics of textual data, some algorithms for data categorization are not efficient for text categorization. Characteristics of textual data such as sparsity and high dimensionality sometimes impair the quality of general purpose classifiers. Here, we propose a text classifier based on a bipartite heterogeneous network used to represent textual document collections. Such algorithm induces a classification model assigning weights to objects that represents terms of the textual document collection. The induced weights correspond to the influence of the terms in the classification of documents they appear. The least-mean-square algorithm is used in the inductive process. Empirical evaluation using a large amount of textual document collections shows that the proposed IMBHN algorithm produces significantly better results than the k-NN, C4.5, SVM and Naïve Bayes algorithms.

15:06 Learning to Refine an Automatically Extracted Knowledge Base using Markov Logic short_paper) echo(" (Short)");?>
Shangpu Jiang, Daniel Lowd, and Dejing Dou
DM973

A number of text mining and information extraction projects such as Text Runner and NELL seek to automatically build knowledge bases from the rapidly growing amount of information on the web. In order to scale to the size of the web, these projects often employ ad hoc heuristics to reason about uncertain and contradictory information rather than reasoning jointly about all candidate facts. In this paper, we present a Markov logic-based system for cleaning an extracted knowledge base. This allows a scalable system such as NELL to take advantage of joint probabilistic inference, or, conversely, allows Markov logic to be applied to a web scale problem. Our system uses only the ontological constraints and confidence values of the original system, along with human-labeled data if available. The labeled data can be used to calibrate the confidence scores from the original system or learn the effectiveness of individual extraction patterns. To achieve scalability, we introduce a neighborhood grounding method that only instantiates the part of the network most relevant to the given query. This allows us to partition the knowledge cleaning task into tractable pieces that can be solved individually. In experiments on NELL's knowledge base, we evaluate several variants of our approach and find that they improve both F1 and area under the precision-recall curve.

15:18 Semantic Aspect Discovery For Online Reviews short_paper) echo(" (Short)");?>
Md. Hijbul Alam and SangKeun Lee
DM559

The number of opinions and reviews about different products and services is growing online. Users frequently look for important aspects of a product or service in the reviews. Usually, they are interested in semantic (i.e., sentiment-oriented) aspects. However, extracting semantic aspects with supervised methods is very expensive. We propose a domain independent unsupervised model to extract semantic aspects, and conduct qualitative and quantitative experiments to evaluate the extracted aspects. The experiments show that our model effectively extracts semantic aspects with correlated top words. In addition, the conducted evaluation on aspect sentiment classification shows that our model outperforms other models by 5-7% in terms of macro-average F1.

Matrices, Graphs and Metrics

Thursday, 13 December
13:30 – 15:30
Room: Rembrandt & Permeke
Session Chair: Jan Ramon

13:30 Bounded Matrix Low Rank Approximation short_paper) echo(" (Short)");?>
Ramakrishnan Kannan, Mariya Ishteva, and Haesun Park
DM861

Matrix lower rank approximations such as non-negative matrix factorization (NMF) have been successfully used to solve many data mining tasks. In this paper, we propose a new matrix lower rank approximation called Bounded Matrix Low Rank Approximation (BMA) which imposes a lower and an upper bound on every element of a lower rank matrix that best approximates a given matrix with missing elements. This new approximation models many real world problems, such as recommender systems, and performs better than other methods, such as singular value decompositions (SVD) or NMF. We present an efficient algorithm to solve BMA based on coordinate descent method. BMA is different from NMF as it imposes bounds on the approximation itself rather than on each of the low rank factors. We show that our algorithm is scalable for large matrices with missing elements on multi core systems with low memory. We present substantial experimental results illustrating that the proposed method outperforms the state of the art algorithms for recommender systems such as Stochastic Gradient Descent, Alternating Least Squares with regularization, SVD++, Bias-SVD on real world data sets such as Jester, Movie lens, Book crossing, Online dating and Netflix.

13:50 Robust Matrix Completion via Joint Schatten p-Norm and lp-Norm Minimization short_paper) echo(" (Short)");?>
Feiping Nie, Hua Wang, Xiao Cai, Heng Huang, and Chris Ding
DM943

The low-rank matrix completion problem is a fundamental machine learning problem with many important applications. The standard low-rank matrix completion methods relax the rank minimization problem by the trace norm minimization. However, this relaxation may make the solution seriously deviate from the original solution. Meanwhile, most completion methods minimize the squared prediction errors on the observed entries, which is sensitive to outliers. In this paper, we propose a new robust matrix completion method to address these two problems. The joint Schatten p-norm and lp-norm are used to better approximate the rank minimization problem and enhance the robustness to outliers. The extensive experiments are performed on both synthetic data and real world applications in collaborative filtering and social network link prediction. All empirical results show our new method outperforms the standard matrix completion methods.

14:10 Sequential Alternating Proximal Method for Scalable Sparse Structural SVMs short_paper) echo(" (Short)");?>
P. Balamurugan, Shirish Shevade, and T. Ravindra Babu
DM644

Structural Support Vector Machines (SSVMs) have recently gained wide prominence in classifying structured and complex objects like parse-trees, image segments and Part-of-Speech (POS) tags. Typical learning algorithms used in training SSVMs result in model parameters which are vectors residing in a large-dimensional feature space. Such a high-dimensional model parameter vector contains many non-zero components which often lead to slow prediction and storage issues. Hence there is a need for sparse parameter vectors which contain a very small number of non-zero components. L1-regularizer and elastic net regularizer have been traditionally used to get sparse model parameters. Though L1-regularized structural SVMs have been studied in the past, the use of elastic net regularizer for structural SVMs has not been explored yet. In this work, we formulate the elastic net SSVM and propose a sequential alternating proximal algorithm to solve the dual formulation. We compare the proposed method with existing methods for L1-regularized Structural SVMs. Experiments on large-scale benchmark datasets show that the proposed dual elastic net SSVM trained using the sequential alternating proximal algorithm scales well and results in highly sparse model parameters while achieving a comparable generalization performance. Hence the proposed sequential alternating proximal algorithm is a competitive method to achieve sparse model parameters and a comparable generalization performance when elastic net regularized Structural SVMs are used on very large datasets.

14:30 A New Anomaly Detection Algorithm based on Quantum Mechanics short_paper) echo(" (Short)");?>
Hao Huang, Hong Qin, Shinjae Yoo, and Dantong Yu
DM213

The primary originality of this paper lies at the fact that we have made the first attempt to apply quantum mechanics theory to anomaly (outlier) detection in high-dimensional datasets for data mining. We propose Fermi Density Descriptor (FDD) which represents the probability of measuring a fermion at a specific location for anomaly detection. We also quantify and examine different Laplacian normalization effects and choose the best one for anomaly detection. Both theoretical proof and quantitative experiments demonstrate that our proposed FDD is substantially more discriminative and robust than the commonly-used algorithms.

14:42 A Semi-Definite Positive Linear Discriminant Analysis and its Applications short_paper) echo(" (Short)");?>
Deguang Kong and Chris Ding
DM855

Linear Discriminant Analysis (LDA) is widely used for dimension reduction in classification tasks. However, standard LDA formulation is not semi definite positive (s.d.p), and thus it is difficult to obtain the global optimal solution when standard LDA formulation is combined with other loss functions or graph embedding. In this paper, we present an alternative approach to LDA. We rewrite the LDA criterion as a convex formulation (semi-definite positive LDA, i.e., sdpLDA) using the largest eigen-value of the generalized eigen-value problem of standard LDA. We give applications by incorporating sdpLDA as a regularization term into discriminant regression analysis. Another application is to incorporate sdpLDA into standard Laplacian embedding, which utilizes the supervised information to improve the Laplacian embedding performance. Proposed sdpLDA formulation can be used for both multi-class classification tasks. Extensive experiments results on 10 multi-class datasets indicate promising results of proposed method.

14:54 Towards Active Learning on Graphs: An Error Bound Minimization Approach short_paper) echo(" (Short)");?>
Quanquan Gu and Jiawei Han
DM361

Active learning on graphs has received increasing interest in the past years. In this paper, we propose a nonadaptive active learning approach on graphs, based on generalization error bound minimization. In particular, we present a data-dependent error bound for a graph-based learning method, namely learning with local and global consistency (LLGC). We show that the empirical transductive Rademacher complexity of the function class for LLGC provides a natural criterion for active learning. The resulting active learning approach is to select a subset of nodes on a graph such that the empirical transductive Rademacher complexity of LLGC is minimized. We propose a simple yet effective sequential optimization algorithm to solve it. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art active learning methods on graphs.

15:06 Rough Set Subspace Error-Correcting Output Codes short_paper) echo(" (Short)");?>
Mohammad Ali Bagheri, Qigang Gao, and Sergio Escalera
DM793

Among the proposed methods to deal with multi-class classification problems, the Error-Correcting Output Codes (ECOC) represents a powerful framework. The key factor in designing any ECOC matrix is the independency of the binary classifiers, without which the ECOC method would be ineffective. This paper proposes an efficient new approach to the ECOC framework in order to improve independency among classifiers. The underlying rationale for our work is that we design three-dimensional codematrix, where the third dimension is the feature space of the problem domain. Using rough set-based feature selection, a new algorithm, named "Rough Set Subspace ECOC (RSS-ECOC)" is proposed. We introduce the Quick Multiple Reduct algorithm in order to generate a set of reducts for a binary problem, where each reduct is used to train a dichotomizer. In addition to creating more independent classifiers, ECOC matrices with longer codes can be built. The numerical experiments in this study compare the classification accuracy of the proposed RSS-ECOC with classical ECOC, one-versus-one, and one-versus-all methods on 24 UCI datasets. The results show that the proposed technique increases the classification accuracy in comparison with the state of the art coding methods.

15:18 Metric Learning From Relative Comparisons by Minimizing Squared Residual short_paper) echo(" (Short)");?>
Eric Yi Liu, Zhishan Guo, Xiang Zhang, Vladimir Jojic, and Wei Wang
DM357

Recent studies [1]-[5] have suggested using constraints in the form of relative distance comparisons to represent domain knowledge: d(a,b) < d(c,d) where d(·) is the distance function and a, b, c, d are data objects. Such constraints are readily available in many problems where pairwise constraints are not natural to obtain. In this paper we consider the problem of learning a Mahalanobis distance metric from supervision in the form of relative distance comparisons. We propose a simple, yet effective, algorithm that minimizes a convex objective function corresponding to the sum of squared residuals of constraints. We also extend our model and algorithm to promote sparsity in the learned metric matrix. Experimental results suggest that our method consistently outperforms existing methods in terms of clustering accuracy. Furthermore, the sparsity extension leads to more stable estimation when the dimension is high and only a small amount of supervision is given.

Dynamics

Thursday, 13 December
16:00 – 18:00
Room: Salle des nations I
Session Chair: Nikolaj Tatti

16:00 Efficient Episode Mining of Dynamic Event Streams short_paper) echo(" (Short)");?>
Debprakash Patnaik, Srivatsan Laxman, Badrish Chandramouli, and Naren Ramakrishnan
DM477

Discovering frequent episodes over event sequences is an important data mining problem. Existing methods typically require multiple passes over the data, rendering them unsuitable for streaming contexts. We present the first streaming algorithm for mining frequent episodes over a window of recent events in the stream. We derive approximation guarantees for our algorithm in terms of: (i) the separation of frequent episodes from infrequent ones, and (ii) the rate of change of stream characteristics. Our parameterization of the problem provides a new sweet spot in the tradeoff between making distributional assumptions over the stream and algorithmic efficiencies of mining. We illustrate how this yields significant benefits when mining practical streams from neuroscience and telecommunications logs.

16:20 Dynamic Boolean Matrix Factorizations short_paper) echo(" (Short)");?>
Pauli Miettinen
DM880

Boolean matrix factorization is a method to decompose a binary matrix into two binary factor matrices. Akin to other matrix factorizations, the factor matrices can be used for various data analysis tasks. Many (if not most) real-world data sets are dynamic, though, meaning that new information is recorded over time. Incorporating this new information into the factorization can require a re-computation of the factorization -- something we cannot do if we want to keep our factorization up-to-date after each update. This paper proposes a method to dynamically update the Boolean matrix factorization when new data is added to the data base. This method is extended with a mechanism to improve the factorization with a trade-off in speed of computation. The method is tested with a number of real-world and synthetic data sets including studying its efficiency against off-line methods. The results show that with good initialization the proposed online and dynamic methods can beat the state-of-the-art offline Boolean matrix factorization algorithms.

16:40 Local and Global Algorithms for Learning Dynamic Bayesian Networks short_paper) echo(" (Short)");?>
Nguyen Xuan Vinh, Madhu Chetty, Ross Coppel, and Pramod P. Wangikar
DM230

Learning optimal Bayesian networks (BN) from data is NP-hard in general. Nevertheless, certain BN classes with additional topological constraints, such as the dynamic BN (DBN) models, widely applied in specific fields such as systems biology, can be efficiently learned in polynomial time. Such algorithms have been developed for the Bayesian-Dirichlet (BD), Minimum Description Length (MDL), and Mutual Information Test (MIT) scoring metrics. The BD-based algorithm admits a large polynomial bound, hence it is impractical for even modestly sized networks. The MDL-and MIT-based algorithms admit much smaller bounds, but require a very restrictive assumption that all variables have the same cardinality, thus significantly limiting their applicability. In this paper, we first propose an improvement to the MDL-and MIT-based algorithms, dropping the equicardinality constraint, thus significantly enhancing their generality. We also explore local Markov blanket based algorithms for constructing BN in the context of DBN, and show an interesting result: under the faithfulness assumption, the mutual information test based local Markov blanket algorithms yield the same network as learned by the global optimization MIT-based algorithm. Experimental validation on small and large scale genetic networks demonstrates the effectiveness of our proposed approaches.

17:00 Dynamic Multi-Relational Chinese Restaurant Process for Analyzing Influences on Users in Social Media short_paper) echo(" (Short)");?>
Himabindu Lakkaraju, Indrajit Bhattacharya, and Chiranjib Bhattacharyya
DM829

We study the problem of analyzing influence of various factors affecting individual messages posted in social media. The problem is challenging because of various types of influences propagating through the social media network that act simultaneously on any user. Additionally, the topic composition of the influencing factors and the susceptibility of users to these influences evolve over time. This problem has not been studied before, and off-the-shelf models are unsuitable for this purpose. To capture the complex interplay of these various factors, we propose a new non-parametric model called the Dynamic Multi-Relational Chinese Restaurant Process. This accounts for the user network for data generation and also allows the parameters to evolve over time. Designing inference algorithms for this model suited for large scale social-media data is another challenge. To this end, we propose a scalable and multi-threaded inference algorithm based on online Gibbs Sampling. Extensive evaluations on large-scale Twitter and Face book data show that the extracted topics when applied to authorship and commenting prediction outperform state-of-the-art baselines. More importantly, our model produces valuable insights on topic trends and user personality trends beyond the capability of existing approaches.

17:20 Time Constrained Influence Maximization in Social Networks short_paper) echo(" (Short)");?>
Bo Liu, Gao Cong, Dong Xu, and Yifeng Zeng
DM382

Influence maximization is a fundamental research problem in social networks. Viral marketing, one of its applications, is to get a small number of users to adopt a product, which subsequently triggers a large cascade of further adoptions by utilizing "Word-of-Mouth"; effect in social networks. Influence maximization problem has been extensively studied recently. However, none of the previous work considers the time constraint in the influence maximization problem. In this paper, we propose the time constrained influence maximization problem. We show that the problem is NP-hard, and prove the monotonicity and submodularity of the time constrained influence spread function. Based on this, we develop a greedy algorithm with performance guarantees. To improve the algorithm scalability, we propose two Influence Spreading Path based methods. Extensive experiments conducted over four public available datasets demonstrate the efficiency and effectiveness of the Influence Spreading Path based methods.

17:40 Computational Television Advertising short_paper) echo(" (Short)");?>
Suhrid Balakrishnan, Sumit Chopra, David Applegate, and Simon Urbanek
DM923

Ever wonder why that Kia Ad ran during Iron Chef? Traditional advertising methodology on television is a fascinating mix of marketing, branding, measurement, and predictive modeling. While still a robust business, it is at risk with the recent growth of online and time-shifted (recorded) television. A particular issue is that traditional methods for television advertising are far less efficient than their counterparts in the online world which employ highly sophisticated computational techniques. This paper formalizes an approach to eliminate some of these inefficiencies by recasting the process of television advertising media campaign generation in a computational framework. We describe efficient mathematical approaches to solve for the task of finding optimal campaigns for specific target audiences. In two case studies, our campaigns report gains in key operational metrics of up to 56% compared to campaigns generated by traditional methods.

Sequences and Mobility

Thursday, 13 December
16:00 – 18:00
Room: Salle des nations II
Session Chair: Li Xiaoli

16:00 Online Induction of Probabilistic Real Time Automata short_paper) echo(" (Short)");?>
Jana Schmidt and Stefan Kramer
DM342

Probabilistic real time automata (PRTAs) are a representation of dynamic processes arising in the sciences and industry. Currently, the induction of automata is divided into two steps: the creation of the prefix tree acceptor (PTA) and the merge procedure based on clustering of the states. These two steps can be very time intensive when a PRTA is to be induced for massive or even unbounded data sets. The latter one can be efficiently processed, as there exist scalable online clustering algorithms. However, the creation of the PTA still can be very time consuming. To overcome this problem, we propose a genuine online PRTA induction approach that incorporates new instances by first collapsing them and then using a maximum frequent pattern based clustering. The approach is tested against a predefined synthetic automaton and real-world data sets, for which the approach is scalable and stable. Moreover, we present a broad evaluation on a real world disease group data set that shows the applicability of such a model to the analysis of medical processes.

16:20 Online Maritime Abnormality Detection using Gaussian Processes and Extreme Value Theory short_paper) echo(" (Short)");?>
Mark Smith, Stephen Reece, Stephen Roberts, and Iead Rezek
DM755

Novelty, or abnormality, detection aims to identify patterns within data streams that do not conform to expected behaviour. This paper introduces a novelty detection technique using a combination of Gaussian Processes and extreme value theory to identify anomalous behaviour in streaming data. The proposed combination of continuous and count stochastic processes is a principled approach towards dynamic extreme value modeling that accounts for the dynamics in the time series, the streaming nature of its observation as well as its sampling process. The approach is tested on both synthetic and real data, showing itself to be effective in our primary application of maritime vessel track analysis.

16:40 Utilizing Real-World Transportation Data for Accurate Traffic Prediction short_paper) echo(" (Short)");?>
Bei Pan, Ugur Demiryurek, and Cyrus Shahabi
DM279

For the first time, real-time high-fidelity spatiotemporal data on transportation networks of major cities have become available. This gold mine of data can be utilized to learn about traffic behavior at different times and locations, potentially resulting in major savings in time and fuel, the two important commodities of 21st century. As a first step towards the utilization of this data, in this paper, we study the real-world data collected from Los Angeles County transportation network in order to incorporate the data's intrinsic behavior into a time-series mining technique to enhance its accuracy for traffic prediction. In particular, we utilized the spatiotemporal behaviors of rush hours and events to perform a more accurate prediction of both short-term and long-term average speed on road-segments, even in the presence of infrequent events (e.g., accidents). Our result shows that taking historical rush-hour behavior we can improve the accuracy of traditional predictors by up to 67% and 78% in short-term and long-term predictions, respectively. Moreover, we can incorporate the impact of an accident to improve the prediction accuracy by up to 91%.

17:00 Granger Causality for Time-Series Anomaly Detection short_paper) echo(" (Short)");?>
Huida Qiu, Yan Liu, Niranjan A Subrahmanya, and Weichang Li
DM274

Recent developments in industrial systems provide us with a large amount of time series data from sensors, logs, system settings and physical measurements, etc. These data are extremely valuable for providing insights about the complex systems and could be used to detect anomalies at early stages. However, the special characteristics of these time series data, such as high dimensions and complex dependencies between variables, as well as its massive volume, pose great challenges to existing anomaly detection algorithms. In this paper, we propose Granger graphical models as an effective and scalable approach for anomaly detection whose results can be readily interpreted. Specifically, Granger graphical models are a family of graphical models that exploit the temporal dependencies between variables by applying L1-regularized learning to Granger causality. Our goal is to efficiently compute a robust "correlation anomaly" score for each variable via Granger graphical models that can provide insights on the possible reasons of anomalies. We evaluate the effectiveness of our proposed algorithms on both synthetic and application datasets. The results show the proposed algorithm achieves significantly better performance than other baseline algorithms and is scalable for large-scale applications.

17:12 Progressive Mining of Transition Dynamics for Autonomous Control short_paper) echo(" (Short)");?>
Steven Loscalzo, Robert Wright, Kevin Acunto, and Lei Yu
DM760

Autonomous agents are emerging in diverse areas and many rely on reinforcement learning (RL) to learn optimal control policies by acting in the environment. This form of learning generates large amounts of transition dynamics data, which can be mined to improve the agent's understanding of the environment. There could be many uses for this data, here we focus on mining it to identify a relevant feature subspace. This is vital since RL performs poorly in high-dimensional spaces, such as those that autonomous agents would commonly face in real-world problems. This paper demonstrates the necessity and feasibility of integrating data mining into the learning process while an agent is learning, enabling it to learn to act by both acting and understanding. Doing so requires overcoming challenges regarding data quantity and quality, and difficulty measuring feature relevance with respect to the control policy. We propose the progressive mining framework to address these challenges by relying on cyclic interaction between data mining and RL. We show that a feature selection algorithm developed under this framework, PROFESS, can improve RL scalability better than a competing approach.

17:24 Socialized Gaussian Process Model for Human Behavior Prediction in a Health Social Network short_paper) echo(" (Short)");?>
Yelong Shen, Ruoming Jin, Dejing Dou, Nafisa Chowdhury, Junfeng Sun, Brigitta Piniewski, and David Kil
DM927

Modeling and predicting human behaviors, such as the activity level and intensity, is the key to prevent the cascades of obesity, and help spread wellness and healthy behavior in a social network. In this work, we propose a Socialized Gaussian Process (SGP) for socialized human behavior modeling. In the proposed SGP model, we naturally incorporates human's personal behavior factor and social correlation factor into a unified model, where basic Gaussian Process model is leveraged to capture individual's personal behavior pattern. Furthermore, we extend the Gaussian Process Model to socialized Gaussian Process (SGP) which aims to capture social correlation phenomena in the social network. The detailed experimental evaluation has shown the SGP model achieves the best prediction accuracy compared with other baseline methods.

17:36 Mining Personal Context-Aware Preferences for Mobile Users short_paper) echo(" (Short)");?>
Hengshu Zhu, Enhong Chen, Kuifei Yu, Huanhuan Cao, Hui Xiong, and Jilei Tian
DM419

In this paper, we illustrate how to extract personal context-aware preferences from the context-rich device logs (i.e., context logs) for building novel personalized context-aware recommender systems. A critical challenge along this line is that the context log of each individual user may not contain sufficient data for mining his/her context-aware preferences. Therefore, we propose to first learn common context-aware preferences from the context logs of many users. Then, the preference of each user can be represented as a distribution of these common context-aware preferences. Specifically, we develop two approaches for mining common context-aware preferences based on two different assumptions, namely, context independent and context dependent assumptions, which can fit into different application scenarios. Finally, extensive experiments on a real-world data set show that both approaches are effective and outperform baselines with respect to mining personal context-aware preferences for mobile users.

17:48 Topic Models Over Spoken Language short_paper) echo(" (Short)");?>
Niketan Pansare, Chris Jermaine, Peter Haas, and Nitendra Rajput
DM894

Virtually all work on topic modeling has assumed that the topics are to be learned over a text-based document corpus. However, there exist important applications where topic models must be learned over an audio corpus of spoken language. Unfortunately, speech-to-text programs can have very low accuracy. We therefore propose a novel topic model for spoken language that incorporates a statistical model of speech-to-text software behavior. Crucially, our model exploits the uncertainty numbers returned by the software. Our ideas apply to any domain in which it would be useful to build a topic model over data in which uncertainties are explicitly represented.

Graphs and Networks

Thursday, 13 December
16:00 – 18:00
Room: Rembrandt & Permeke
Session Chair: George Karypis

16:00 CT-IC: Continuously activated and Time-restricted Independent Cascade Model for Viral Marketing short_paper) echo(" (Short)");?>
Wonyeol Lee, Jinha Kim, and Hwanjo Yu
DM832

Influence maximization problem with applications to viral marketing has gained much attention. Underlying influence diffusion models affect influence maximizing nodes because they focus on difference aspect of influence diffusion. Nevertheless, existing diffusion models overlook two important aspects of real-world marketing - continuous trials and time restriction. This paper proposes a new realistic influence diffusion model called Continously activated and Time-restricted IC (CT-IC) model which generalizes the IC model by embedding the above two aspects. We first prove that CT-IC model satisfies two crucial properties -- monotonicity and submodularity. We then provide an efficient method for calculating exact influence spread when a social network is restricted to a directed tree and a simple path. Finally, we propose a scalable algorithm for influence maximization under CT-IC model called CT-IPA. Our experiments show that CT-IC model provides seeds of higher influence spread than IC model and CT-IPA is four orders of magnitude faster than the greedy algorithm while providing similar influence spread to the greedy algorithm.

16:12 Community-Affiliation Graph Model for Overlapping Network Community Detection short_paper) echo(" (Short)");?>
Jaewon Yang and Jure Leskovec
DM917

One of the main organizing principles in real-world networks is that of network communities, where sets of nodes organize into densely linked clusters. Communities in networks often overlap as nodes can belong to multiple communities at once. Identifying such overlapping communities is crucial for the understanding the structure as well as the function of real-world networks. Even though community structure in networks has been widely studied in the past, practically all research makes an implicit assumption that overlaps between communities are less densely connected than the non-overlapping parts themselves. Here we validate this assumption on 6 large scale social, collaboration and information networks where nodes explicitly state their community memberships. By examining such ground-truth communities we find that the community overlaps are more densely connected than the non-overlapping parts, which is in sharp contrast to the conventional wisdom that community overlaps are more sparsely connected than the communities themselves. Practically all existing community detection methods fail to detect communities with dense overlaps. We propose Community-Affiliation Graph Model, a model-based community detection method that builds on bipartite node-community affiliation networks. Our method successfully captures overlapping, non-overlapping as well as hierarchically nested communities, and identifies relevant communities more accurately than the state-of-the-art methods in networks ranging from biological to social and information networks.

16:24 Predicting Directed Links using Nondiagonal Matrix Decompositions short_paper) echo(" (Short)");?>
Jérôme Kunegis and Jörg Fliege
DM489

We present a method for trust prediction based on no diagonal decompositions of the asymmetric adjacency matrix of a directed network. The method we propose is based on a no diagonal decomposition into directed components (DEDICOM), which we use to learn the coefficients of a matrix polynomial of the network's adjacency matrix. We show that our method can be used to compute better low-rank approximations to a polynomial of a network's adjacency matrix than using the singular value decomposition, and that a higher precision can be achieved at the task of predicting directed links than by undirected or bipartite methods.

16:36 Mining User Mobility Features for Next Place Prediction in Location-based Services short_paper) echo(" (Short)");?>
Anastasios Noulas, Salvatore Scellato, Neal Lathia, and Cecilia Mascolo
DM746

Mobile location-based services are thriving, providing an unprecedented opportunity to collect fine grained spatio-temporal data about the places users visit. This multi-dimensional source of data offers new possibilities to tackle established research problems on human mobility, but it also opens avenues for the development of novel mobile applications and services. In this work we study the problem of predicting the next venue a mobile user will visit, by exploring the predictive power offered by different facets of user behavior. We first analyze about 35 million check-ins made by about 1 million Foursquare users in over 5 million venues across the globe, spanning a period of five months. We then propose a set of features that aim to capture the factors that may drive users' movements. Our features exploit information on transitions between types of places, mobility flows between venues, and spatio-temporal characteristics of user check-in patterns. We further extend our study combining all individual features in two supervised learning models, based on linear regression and M5 model trees, resulting in a higher overall prediction accuracy. We find that the supervised methodology based on the combination of multiple features offers the highest levels of prediction accuracy: M5 model trees are able to rank in the top fifty venues one in two user check-ins, amongst thousands of candidate items in the prediction list.

16:48 Spatial Interpolation using Multiple Regression short_paper) echo(" (Short)");?>
Orlando Ohashi and Luís Torgo
DM785

Many real world data mining applications involve analyzing geo-referenced data. Frequently, this type of data sets are incomplete in the sense that not all geographical coordinates have measured values of the variable(s) of interest. This incompleteness may be caused by poor data collection, measurement errors, costs management and many other factors. These missing values may cause several difficulties in many applications. Spatial imputation/interpolation methods try to fill in these unknown values in geo-referenced data sets. In this paper we propose a new spatial imputation method based on machine learning algorithms and a series of data pre-processing steps. The key distinguishing factor of this method is allowing the use of data from faraway regions, contrary to the state of the art on spatial data mining. Images (e.g. from a satellite or video surveillance cameras) may also suffer from this incompleteness where some pixels are missing, which again may be caused by many factors. An image can be seen as a spatial data set in a Cartesian coordinates system, where each pixel (location) registers some value (e.g. degree of gray on a black and white image). Being able to recover the original image from a partial or incomplete version of the reality is a key application in many domains (e.g. surveillance, security, etc.). In this paper we evaluate our general methodology for spatial interpolation on this type of problems. Namely, we check the ability of our method to fill in unknown pixels on several images. We compare it to state of the art methods and provide strong experimental evidence of the advantages of our proposal.