You are here

Session Details

Foundations and Quality Control

Tuesday, 11 December
16:00 – 18:00
Room: Rembrandt & Permeke
Session Chair: Jilles Vreeken

16:00 Hierarchical Multi-task Learning with Application to Wafer Quality Prediction short_paper) echo(" (Short)");?>
Jingrui He and Yada Zhu
DM416

Many real problems of multi-task learning exhibit hierarchical task relatedness. In other words, the tasks are partitioned into multiple groups. Different tasks within the same group are related on the task-level, whereas different groups are related on the group-level. For example, in semiconductor manufacturing, the problem of wafer quality prediction can be considered as hierarchical multi-task learning, where each task corresponds to a single side of a chamber with side-level relatedness, and a group of tasks corresponds to a chamber of multiple sides with chamber-level relatedness. Motivated by this application, in this paper, we propose an optimization framework for hierarchical multi-task learning, which partitions all the input features into 2 sets based on their characteristics, and models task-level and group-level relatedness by imposing different constraints on the coefficient vectors of the 2 sets. This is different from existing work on task clustering where the goal is to uncover the grouping of tasks, the tasks do not exhibit group-level relatedness, and the input features are not discriminated in the prediction model to model task-level and group-level relatedness. To solve this framework, we propose the hear algorithm based on block coordinate descent, and demonstrate its effectiveness on both synthetic and real data sets from domains of semiconductor manufacturing and document classification.

16:20 Stream Classification with Recurring and Novel Class Detection using Class-Based Ensemble short_paper) echo(" (Short)");?>
Tahseen Al-Khateeb, Mohammad M. Masud, Latifur Khan, Charu Aggarwal, Jiawei Han, and Bhavani Thuraisingham
DM284

Concept-evolution has recently received a lot of attention in the context of mining data streams. Concept-evolution occurs when a new class evolves in the stream. Although many recent studies address this issue, most of them do not consider the scenario of recurring classes in the stream. A class is called recurring if it appears in the stream, disappears for a while, and then reappears again. Existing data stream classification techniques either misclassify the recurring class instances as another class, or falsely identify the recurring classes as novel. This increases the prediction error of the classifiers, and in some cases causes unnecessary waste in memory and computational resources. In this paper we address the recurring class issue by proposing a novel "class-based" ensemble technique, which substitutes the traditional "chunk-based" ensemble approaches and correctly distinguishes between a recurring class and a novel one. We analytically and experimentally confirm the superiority of our method over state-of-the-art techniques.

16:40 Assessing the Significance of Data Mining Results on Graphs with Feature Vectors short_paper) echo(" (Short)");?>
Stephan Günnemann, Phuong Dao, Mohsen Jamali, and Martin Ester
DM773

Assessing the significance of data mining results is an important step in the knowledge discovery process. While results might appear interesting at a first glance, they can often be explained by already known characteristics of the data. Randomization is an established technique for significance testing, and methods to assess data mining results on vector data or network data have been proposed. In many applications, however, both sources are simultaneously given. Since these sources are rarely independent of each other but highly correlated, naively applying existing randomization methods on each source separately is questionable. In this work, we present a method to assess the significance of mining results on graphs with binary features vectors. We propose a novel null model that preserves correlation information between both sources. Our randomization exploits an adaptive Metropolis sampling and interweaves attribute randomization and graph randomization steps. In thorough experiments, we demonstrate the application of our technique. Our results indicate that while simultaneously using both sources is beneficial, often one source of information is dominant for determining the mining results.

17:00 Scaling Inference for Markov Logic via Dual Decomposition short_paper) echo(" (Short)");?>
Feng Niu, Ce Zhang, Christopher Ré, and Jude Shavlik
DM377

Markov logic is a knowledge-representation language that allows one to specify large graphical models. However, the resulting large graphical models can make inference for Markov logic a computationally challenging problem. Recently, dual decomposition (DD) has become a popular approach for scalable inference on graphical models. We study how to apply DD to scale up inference in Markov logic. A standard approach for DD first partitions a graphical model into multiple tree-structured sub problems. We apply this approach to Markov logic and show that DD can outperform prior inference approaches. Nevertheless, we observe that the standard approach for DD is suboptimal as it does not exploit the rich structure often present in the Markov logic program. Thus, we describe a novel decomposition strategy that partitions a Markov logic program into parts based on its structure. A crucial advantage of our approach is that we can use specialized algorithms for portions of the input problem -- some of which have been studied for decades, e.g., coreference resolution. Empirically, we show that our program-level decomposition approach outperforms both non-decomposition and graphical model-based decomposition approaches to Markov logic inference on several data-mining tasks.

17:12 Estimating the Expected Effectiveness of Text Classification Solutions under Subclass Distribution Shifts short_paper) echo(" (Short)");?>
Nedim Lipka, Benno Stein, and James G. Shanahan
DM786

Automated text classification is one of the most important learning technologies to fight information overload. However, the information society is not only confronted with an information flood but also with an increase in "information volatility", by which we understand the fact that kind and distribution of a data source's emissions can significantly vary. In this paper we show how to estimate the expected effectiveness of a classification solution when the underlying data source undergoes a shift in the distribution of its subclasses (modes). Subclass distribution shifts are observed among others in online media such as tweets, blogs, or news articles, where document emissions follow topic popularity. To estimate the expected effectiveness of a classification solution we partition a test sample by means of clustering. Then, using repetitive resampling with different margin distributions over the clustering, the effectiveness characteristics is studied. We show that the effectiveness is normally distributed and introduce a probabilistic lower bound that is used for model selection. We analyze the relation between our notion of expected effectiveness and the mean effectiveness over the clustering both theoretically and on standard text corpora. An important result is a heuristic for expected effectiveness estimation that is solely based on the initial test sample and that can be computed without resampling.

17:24 Adapting Component Analysis short_paper) echo(" (Short)");?>
Fatemeh Dorri and Ali Ghodsi
DM747

A main problem in machine learning is to predict the response variables of a test set given the training data and its corresponding response variables. A predictive model can perform satisfactorily only if the training data is an appropriate representative of the test data. This is usually reflected in the assumption that the training data and the test data are drawn from the same underlying probability distribution. However, the assumption may not be correct in many applications for various reasons. We propose a method based on kernel distribution embedding and Hilbert-Schmidt Independence Criterion (HSIC) to address this problem. The proposed method explores a new representation of the data in a new feature space with two properties: (i) the distributions of the training and the test data sets are as close as possible in the new feature space, and (ii) the important structural information of the data is preserved. The algorithm can reduce the dimensionality of the data while it preserves the aforementioned properties and therefore it can be seen as a dimensionality reduction method as well. Our method has a closed-form solution and the experimental results show that it works well in practice.

17:36 Cost-Sensitive Online Classification and Its Application to Online Anomaly Detection short_paper) echo(" (Short)");?>
Jialei Wang, Peilin Zhao, and Steven C.H. Hoi
DM468

Both cost-sensitive classification and online learning have been studied extensively in data mining and machine learning communities, respectively. It is a bit surprising that there was very limited comprehensive study for addressing an important intersecting problem, that is, cost-sensitive online classification. In this paper, we formally study this problem, and propose a new framework for cost-sensitive online classification by exploiting the idea of online gradient descent techniques. Based on the framework, we propose a family of cost-sensitive online classification algorithms, which are designed to directly optimize two well-known cost-sensitive measures: (i) maximization of weighted sum of sensitivity and specificity, and (ii) minimization of weighted misclassification cost. We analyze the theoretical bounds of the cost-sensitive measures made by the proposed algorithms, and extensively examine their empirical performance on a variety of cost-sensitive online classification tasks.

17:48 Self-Taught Active Learning from Crowds short_paper) echo(" (Short)");?>
Meng Fang, Xingquan Zhu, Bin Li, Wei Ding, and Xindong Wu
DM826

The emergence of social tagging and crowdsourcing systems provides a unique platform where multiple weak labelers can form a crowd to fulfill a labeling task. Yet crowd labelers are often noisy, inaccurate, and have limited labeling knowledge, and worst of all, they act independently without seeking complementary knowledge from each other to improve labeling performance. In this paper, we propose a Self-Taught Active Learning (STAL) paradigm, where imperfect labelers are able to learn complementary knowledge from one another to expand their knowledge sets and benefit the underlying active learner. We employ a probabilistic model to characterize the knowledge of each labeler through which a weak labeler can learn complementary knowledge from a stronger peer. As a result, the self-taught active learning process eventually helps achieve high classification accuracy with minimized labeling costs and labeling errors.