You are here

Session Details

Data Pre-Processing

Tuesday, 11 December
16:00 – 18:00
Room: Salle des nations II
Session Chair: Pauli Miettinen

16:00 Estimating Local Information Trustworthiness via Multi-Source Joint Matrix Factorization short_paper) echo(" (Short)");?>
Liang Ge, Jing Gao, Xiao Yu, Wei Fan, and Aidong Zhang
DM784

We investigate how to estimate information trustworthiness by considering multiple information sources jointly in a latent matrix space. We particularly focus on user review and recommendation systems, as there are multiple platforms where people can rate items and services that they have purchased, and many potential customers rely on these opinions to make decisions. Information trustworthiness is a serious problem because ratings are generated freely by end-users so that many stammers take advantage of freedom of speech to promote their business or damage reputation of competitors. We propose to simply use customer ratings to estimate each individual source's reliability by exploring correlations among multiple sources. Ratings of items are provided by users of diverse tastes and styles, and thus may appear noisy and conflicting across sources, however, they share some underlying common behavior. Therefore, we can group users based on their opinions, and a source is reliable on an item if its opinions given by latent groups are consistent across platforms. Inspired by this observation, we solve the problem by a two-step model -- a joint matrix factorization procedure followed by reliability score computation. We propose two effective approaches to decompose rating matrices as the products of group membership and group rating matrices, and then compute consistency degrees from group rating matrices as source reliability scores. We conduct experiments on both synthetic data and real user ratings collected from Orbitz, Priceline and Trip Advisor on all the hotels in Las Vegas and New York City. Results show that the proposed method is able to give accurate estimates of source reliability and thus successfully identify inconsistent, conflicting and unreliable information.

16:12 Heterogeneous Constraint Propagation with Constrained Sparse Representation short_paper) echo(" (Short)");?>
Zhiwu Lu and Yuxin Peng
DM568

This paper presents a graph-based method for heterogeneous constraint propagation on multi-modal data using constrained sparse representation. Since heterogeneous pair wise constraints are defined over pairs of data points from different modalities, heterogeneous constraint propagation is more challenging than the transitional homogeneous constraint propagation on single-modal data which has been studied extensively in previous work. The main difficulty of heterogeneous constraint propagation lies in how to effectively propagate heterogeneous pair wise constraints across different modalities. To address this issue, we decompose heterogeneous constraint propagation into semi-supervised learning sub problems which can then be efficiently solved by graph-based label propagation. Moreover, we develop a constrained sparse representation method for graph construction over each modality using homogeneous pair wise constraints. The experimental results in cross-modal retrieval have shown the superior performance of our heterogeneous constraint propagation.

16:24 Collaborative Filtering with Aspect-based Opinion Mining: A Tensor Factorization Approach short_paper) echo(" (Short)");?>
Yuanhong Wang, Yang Liu, and Xiaohui Yu
DM974

Collaborative filtering (CF) aims to produce user specific recommendations based on other users' ratings of items. Most existing CF methods rely only on users' overall ratings of items, ignoring the variety of opinions users may have towards different aspects of the items. Using the movie domain as a case study, we propose a framework that is able to capture users' opinions on different aspects from the textual reviews, and use that information to improve the effectiveness of CF. This framework has two components, an opinion mining component and a rating inference component. The former extracts and summarizes the opinions on multiple aspects from the reviews, generating ratings on the various aspects. The latter component, on the other hand, infers the overall ratings of items based on the aspect ratings, which forms the basis for item recommendation. Our core contribution is in the proposal of a tensor factorization approach for the rating inference. Operating on the tensor composed of the overall and aspect ratings, this approach is able to capture the intrinsic relationships between users, items, and aspects, and provide accurate predictions on unknown ratings. Experiments on a movie dataset show that our proposal significantly improves the prediction accuracy compared with two baseline methods.

16:36 Signal Disaggregation via Sparse Coding with Featured Discriminative Dictionary short_paper) echo(" (Short)");?>
Bingsheng Wang, Feng Chen, Haili Dong, Arnold P. Boedihardjo, Chang-Tien Lu
DM954

As the issue of freshwater shortage is increasing daily, it's critical to take effective measures for water conservation. Based on previous studies, device level consumption could lead to significant conservation of freshwater. However, current smart meter deployments only produce low sample rate aggregated data. In this paper, we examine the task of separating whole-home water consumption into its component appliances. A key challenge is to address the unique features of low sample rate data. To this end, we propose Sparse Coding with Featured Discriminative Dictionary (SCFDD) by incorporating inherent shape and activation features to capture the discriminative characteristics of devices. In addition, extensive experiments were performed to validate the effectiveness of SCFDD.

16:48 Hashing with Generalized Nystrom Approximation short_paper) echo(" (Short)");?>
Jeong-Min Yun, Saehoon Kim, and Seungjin Choi
DM376

Hashing, which involves learning binary codes to embed high-dimensional data into a similarity-preserving low-dimensional Hamming space, is often formulated as linear dimensionality reduction followed by binary quantization. Linear dimensionality reduction, based on maximum variance formulation, requires leading eigenvectors of data covariance or graph Laplacian matrix. Computing leading singular vectors or eigenvectors in the case of high-dimension and large sample size, is a main bottleneck in most of data-driven hashing methods. In this paper we address the use of generalized Nyström method where a subset of rows and columns are used to approximately compute leading singular vectors of the data matrix, in order to improve the scalability of hashing methods in the case of high-dimensional data with large sample size. Especially we validate the useful behavior of generalized Nyström approximation with uniform sampling, in the case of a recently-developed hashing method based on principal component analysis (PCA) followed by an iterative quantization, referred to as PCA+ITQ, developed by Gong and Lazebnik. We compare the performance of generalized Nyström approximation with uniform and non-uniform sampling, to the full singular value decomposition (SVD) method, confirming that the uniform sampling improves the computational and space complexities dramatically, while the performance is not much sacrificed. In addition we present low-rank approximation error bounds for generalized Nyström approximation with uniform sampling, which is not a trivial extension of available results on the non-uniform sampling case.

17:00 Low-Rank Transfer Subspace Learning short_paper) echo(" (Short)");?>
Ming Shao, Carlos Castillo, Zhenghong Gu, and Yun Fu
DM459

One of the most important challenges in machine learning is performing effective learning when there are limited training data available. However, there is an important case when there are sufficient training data coming from other domains (source). Transfer learning aims at finding ways to transfer knowledge learned from a source domain to a target domain by handling the subtle differences between the source and target. In this paper, we propose a novel framework to solve the aforementioned knowledge transfer problem via low-rank representation constraints. This is achieved by finding an optimal subspace where each datum in the target domain can be linearly represented by the corresponding subspace in the source domain. Extensive experiments on several databases, i.e., Yale B, CMU PIE, UB Kin Face databases validate the effectiveness of the proposed approach and show the superiority to the existing, well-established methods.

17:12 Geodesic based semi-supervised multi-manifold feature extraction short_paper) echo(" (Short)");?>
Mingyu Fan, Xiaoqin Zhang, Zhouchen Lin, Zhongfei Zhang, and Hujun Bao
DM538

Manifold learning is an important feature extraction approach in data mining. This paper presents a new semi-supervised manifold learning algorithm, called Multi-Manifold Discriminative Analysis (Multi-MDA). The proposed method is designed to explore the discriminative information hidden in geodesic distances. The main contributions of the proposed method are: 1) we propose a semi-supervised graph construction method which can effectively capture the multiple manifolds structure of the data, 2) each data point is replaced with an associated feature vector whose elements are the graph distances from it to the other data points. Information of the nonlinear structure is contained in the feature vectors which are helpful for classification, 3) we propose a new semi-supervised linear dimension reduction method for feature vectors which introduces the class information into the manifold learning process and establishes an explicit dimension reduction mapping. Experiments on benchmark data sets are conducted to show the effectiveness of the proposed method.

17:24 Learning Heterogeneous Similarity Measures for Hybrid-Recommendations in Meta-Mining short_paper) echo(" (Short)");?>
Phong Nguyen, Jun Wang, Melanie Hilario, and Alexandros Kalousis
DM663

The notion of meta-mining has appeared recently and extends traditional meta-learning in two ways. First it provides support for the whole data-mining process. Second it pries open the so called algorithm black-box approach where algorithms and workflows also have descriptors. With the availability of descriptors both for datasets and datamining workflows we are faced with a problem the nature of which is much more similar to those appearing in recommendation systems. In order to account for the metamining specificities we derive a novel metric-based-learning recommender approach. Our method learns two homogeneous metrics, one in the dataset and one in the workflow space, and a heterogeneous one in the dataset-workflow space. All learned metrics reflect similarities established from the datasetworkflow preference matrix. The latter is constructed from the performance results obtained by the application of workflows to datasets. We demonstrate our method on meta-mining over biological (microarray datasets) problems. The application of our method is not limited to the meta-mining problem, its formulation is general enough so that it can be applied on problems with similar requirements.

17:36 IceCube: Efficient Targeted Mining in Data Cubes short_paper) echo(" (Short)");?>
Shrutendra K Harsola, Prasad M Deshpande, and Jayant R Haritsa
DM932

We address the problem of mining targeted association rules over multidimensional market-basket data. Here, each transaction has, in addition to the set of purchased items, ancillary dimension attributes associated with it. Based on these dimensions, transactions can be visualized as distributed over cells of an n-dimensional cube. In this framework, a targeted association rule is of the form {X - > Y}R, where R is a convex region in the cube and X - > Y is a traditional association rule within region R. We first describe the TOARM algorithm, based on classical techniques, for identifying targeted association rules. Then, we discuss the concepts of bottom-up aggregation and cubing, leading to the Cell Union technique. This approach is further extended, using notions of cube-count interleaving and credit-based pruning, to derive the Ice Cube algorithm. Our experiments demonstrate that Ice Cube consistently provides the best execution time performance, especially for large and complex data cubes.

17:48 Direct Discovery of High Utility Itemsets without Candidate Generation short_paper) echo(" (Short)");?>
Junqiang Liu, Ke Wang, and Benjamin C.M. Fung
DM554

Utility mining emerged recently to address the limitation of frequent itemset mining by introducing interestingness measures that reflect both the statistical significance and the user's expectation. Among utility mining problems, utility mining with the itemset share framework is a hard one as no anti-monotone property holds with the interestingness measure. The state-of-the-art works on this problem all employ a two-phase, candidate generation approach, which suffers from the scalability issue due to the huge number of candidates. This paper proposes a high utility itemset growth approach that works in a single phase without generating candidates. Our basic approach is to enumerate itemsets by prefix extensions, to prune search space by utility upper bounding, and to maintain original utility information in the mining process by a novel data structure. Such a data structure enables us to compute a tight bound for powerful pruning and to directly identify high utility itemsets in an efficient and scalable way. We further enhance the efficiency significantly by introducing recursive irrelevant item filtering with sparse data, and a lookahead strategy with dense data. Extensive experiments on sparse and dense, synthetic and real data suggest that our algorithm outperforms the state-of-the-art algorithms over one order of magnitude.