You are here

Session Details

Efficient Algorithms 2

Thursday, 13 December
13:30 – 15:30
Room: Salle des nations I & II
Session Chair: Tijl De Bie

13:30 Efficient Learning for Hashing Proportional Data short_paper) echo(" (Short)");?>
Zhao Xu, Kristian Kersting, and Christian Bauckhage
DM398

Spectral hashing (SH) seeks compact binary codes of data points so that Hamming distances between codes correlate with data similarity. Quickly learning such codes typically boils down to principle component analysis (PCA). However, this is only justified for normally distributed data. For proportional data (normalized histograms), this is not the case. Due to the sum-to-unity constraint, features that are as independent as possible will not all be uncorrelated. In this paper, we show that a linear-time transformation efficiently copes with sum-to-unity constraints: first, we select a small number K of diverse data points by maximizing the volume of the simplex spanned by these prototypes; second, we represent each data point by means of its cosine similarities to the K selected prototypes. This maximum volume hashing is sensible since each dimension in the transformed space is likely to follow a von Mises (vM) distribution, and, in very high dimensions, the vM distribution closely resembles a Gaussian distribution. This justifies to employ PCA on the transformed data. Our extensive experiments validate this: maximum volume hashing outperforms spectral hashing and other state of the art techniques.

13:50 Nested Subtree Hash Kernels for Large-scale Graph Classification over Streams short_paper) echo(" (Short)");?>
Bin Li, Xingquan Zhu, Lianhua Chi, and Chengqi Zhang
DM713

Most studies on graph classification focus on designing fast and effective kernels. Several fast subtree kernels have achieved a linear time-complexity w.r.t. the number of edges under the condition that a common feature space (e.g., a subtree pattern list) is needed to represent all graphs. This will be infeasible when graphs are presented in a stream with rapidly emerging subtree patterns. In this case, computing a kernel matrix for graphs over the entire stream is difficult since the graphs in the expired chunks cannot be projected onto the unlimitedly expanding feature space again. This leads to a big trouble for graph classification over streams -- Different portions of graphs have different feature spaces. In this paper, we aim to enable large-scale graph classification over streams using the classical ensemble learning framework, which requires the data in different chunks to be in the same feature space. To this end, we propose a Nested Subtree Hashing (NSH) algorithm to recursively project the multi-resolution subtree patterns of different chunks onto a set of common low-dimensional feature spaces. We theoretically analyze the derived NSH kernel and obtain a number of favorable properties: 1) The NSH kernel is an unbiased and highly concentrated estimator of the fast subtree kernel. 2) The bound of convergence rate tends to be tighter as the NSH algorithm steps into a higher resolution. 3) The NSH kernel is robust in tolerating concept drift between chunks over a stream. We also empirically test the NSH kernel on both a large-scale synthetic graph data set and a real-world chemical compounds data set for anticancer activity prediction. The experimental results validate that the NSH kernel is indeed efficient and robust for graph classification over streams.

14:10 Efficient Behavior Targeting Using SVM Ensemble Indexing short_paper) echo(" (Short)");?>
Jun Li, Peng Zhang, Yanan Cao, Ping Liu, and Li Guo
DM977

Behavior targeting (BT) is a promising tool for online advertising. The state-of-the-art BT methods, which are mainly based on regression models, have two limitations. First, learning regression models for behavior targeting is difficult since user clicks are typically several orders of magnitude fewer than views. Second, the user interests are not fixed, but often transient and influenced by media and pop culture. In this paper, we propose to formulate behavior targeting as a classification problem. Specifically, we propose to use an SVM ensemble for behavior prediction. The challenge of using ensemble SVM for BT stems from the computational complexity (it takes 53 minutes in our experiments to predict behavior for 32 million users, which is inadequate for online application). To this end, we propose a fast ensemble SVM prediction framework, which builds an indexing structure for SVM ensemble to achieve sub-linear prediction time complexity. Experimental results on real-world large scale behavior targeting data demonstrate that the proposed method is efficient and outperforms existing linear regression based BT models.

14:30 Active Evaluation of Classifiers on Large Datasets short_paper) echo(" (Short)");?>
Namit Katariya, Arun Iyer, and Sunita Sarawagi,
DM706

The goal of this work is to estimate the accuracy of a classifier on a large unlabeled dataset based on a small labeled set and a human labeler. We seek to estimate accuracy and select instances for labeling in a loop via a continuously refined stratified sampling strategy. For stratifying data we develop a novel strategy of learning r bit hash functions to preserve similarity in accuracy values. We show that our algorithm provides better accuracy estimates than existing methods for learning distance preserving hash functions. Experiments on a wide spectrum of real datasets show that our estimates achieve between 15% and 62% relative reduction in error compared to existing approaches. We show how to perform stratified sampling on unlabeled data that is so large that in an interactive setting even a single sequential scan is impractical. We present an optimal algorithm for performing importance sampling on a static index over the data that achieves close to exact estimates while reading three orders of magnitude less data.

14:50 High Performance Offline & Online Distributed Collaborative Filtering short_paper) echo(" (Short)");?>
Ankur Narang, Abhinav Srivastav, and Naga Praveen Kumar Katta
DM929

Big data analytics is a hot research area both in academia and industry. It envisages processing massive amounts of data at high rates to generate new insights leading to positive impact (for both users and providers) of industries such as E-commerce, Telecom, Finance, Life Sciences and so forth. We consider collaborative filtering (CF) and Clustering algorithms that are key fundamental analytics kernels that help in achieving these aims. High throughput CF and co-clustering on highly sparse and massive datasets, along with a high prediction accuracy, is a computationally challenging problem. In this paper, we present a novel hierarchical design for soft real-time (less than 1-minute.) distributed co-clustering based collaborative filtering algorithm. We study both the online and offline variants of this algorithm. Theoretical analysis of the time complexity of our algorithm proves the efficacy of our approach. Further, we present the impact of load balancing based optimizations on multi-core cluster architectures. Using the Netflix dataset(900M training ratings with replication) as well as the Yahoo KDD Cup(2.3B training ratings with replication) datasets, we demonstrate the performance and scalability of our algorithm on a large multi-core cluster architecture. In offline mode, our distributed algorithm demonstrates around 4x better performance (on Blue Gene/P) as compared to the best prior work, along with high accuracy. In online mode, we demonstrated around 3x better performance compared to baseline MPI implementation. To the best of our knowledge, our algorithm provides the best known online and offline performance and scalability results with high accuracy on multi-core cluster architectures.

15:10 Distributed Matrix Completion short_paper) echo(" (Short)");?>
Christina Teflioudi, Faraz Makari, and Rainer Gemulla
DM743

We discuss parallel and distributed algorithms for large-scale matrix completion on problems with millions of rows, millions of columns, and billions of revealed entries. We focus on in-memory algorithms that run on a small cluster of commodity nodes, even very large problems can be handled effectively in such a setup. Our DALS, ASGD, and DSGD++ algorithms are novel variants of the popular alternating least squares and stochastic gradient descent algorithms, they exploit thread-level parallelism, in-memory processing, and asynchronous communication. We provide some guidance on the asymptotic performance of each algorithm and investigate the performance of both our algorithms and previously proposed Map Reduce algorithms in large-scale experiments. We found that DSGD++ outperforms competing methods in terms of overall runtime, memory consumption, and scalability. Using DSGD++, we can factor a matrix with 10B entries on 16 compute nodes in around 40 minutes.