You are here

Session Details

Feature Selection and Text Mining

Thursday, 13 December
13:30 – 15:30
Room: Alto & Mezzo & Tempo
Session Chair: Latifur Khan

13:30 Dimensionality Reduction on Heterogeneous Feature Space short_paper) echo(" (Short)");?>
Xiaoxiao Shi and Philip S. Yu
DM244

Combining correlated data sources may help improve the learning performance of a given task. For example, in recommendation problems, one can combine (1) user profile database (e.g. genders, age, etc.), (2) users' log data (e.g., click through data, purchasing records, etc.), and (3) users' social network (useful in social targeting) to build a recommendation model. All these data sources provide informative but heterogeneous features. For instance, user profile database usually has nominal features reflecting users' background, log data provides term-based features about users' historical behaviors, and social network database has graph relational features. Given multiple heterogeneous data sources, one important challenge is to find a unified feature subspace that captures the knowledge from all sources. To this aim, we propose a principle of collective component analysis (CoCA), in order to handle dimensionality reduction across a mixture of vector-based features and graph relational features. The CoCA principle is to find a feature subspace with maximal variance under two constraints. First, there should be consensus among the projections from different feature spaces. Second, the similarity between connected data (in any of the network databases) should be maximized. The optimal solution is obtained by solving an eigenvalue problem. Moreover, we discuss how to use prior knowledge to distinguish informative data sources, and optimally weight them in CoCA. Since there is no previous model that can be directly applied to solve the problem, we devised a straightforward comparison method by performing dimension reduction on the concatenation of the data sources. Three sets of experiments show that CoCA substantially outperforms the comparison method.

13:50 Isometric multi-manifold learning for feature extraction short_paper) echo(" (Short)");?>
Mingyu Fan, Hong Qiao, Bo Zhang, and Xiaoqin Zhang
DM620

Manifold learning is an important topic in pattern recognition and computer vision. However, most manifold learning algorithms implicitly assume the data are aligned on a single manifold, which is too strict in actual applications. Isometric feature mapping (Isomap), as a promising manifold learning method, fails to work on data which distribute on clusters in a single manifold or manifolds. In this paper, we propose a new multi-manifold learning algorithm (M-Isomap). The algorithm first discovers the data manifolds and then reduces the dimensionality of the manifolds separately. Meanwhile, a skeleton representing the global structure of whole data set is built and kept in low-dimensional space. Secondly, by referring to the low-dimensional representation of the skeleton, the embeddings of the manifolds are relocated to a global coordinate system. Compared with previous methods, these algorithms can keep both of the intra and inter manifolds geodesics faithfully. The features and effectiveness of the proposed multi-manifold learning algorithms are demonstrated and compared through experiments.

14:10 Feature Weighting and Selection Using Hypothesis Margin of Boosting short_paper) echo(" (Short)");?>
Malak Alshawabkeh, Javed A. Aslam, Jennifer G. Dy, and David Kaeli
DM877

Utilizing the concept of hypothesis margins to measure the quality of a set of features has been a growing line of research in the last decade. However, most previous algorithms have been developed under the large hypothesis margin principles of the 1-NN algorithm, such as Simba. Little attention has been paid so far to exploiting the hypothesis margins of boosting to evaluate features. Boosting is well known to maximize the training examples' hypothesis margins, in particular, the average margins which are known to be the first statistics that considers the whole margin distribution. In this paper, we describe how to utilize the training examples' mean margins of boosting to select features. A weight criterion, termed Margin Fraction (MF), is assigned to each feature that contributes to the average margin distribution combined in the final output produced by boosting. Applying the idea of MF to a sequential backward selection method, a new embedded selection algorithm is proposed, called SBS-MF. Experimentation is carried out using different data sets, which compares the proposed SBS-MF with two boosting based feature selection approaches, as well as to Simba. The results show that SBS-MF is effective in most of the cases.

14:30 An Approach to Evaluate the Local Completeness of Event Logs short_paper) echo(" (Short)");?>
Hedong Yang, Lijie Wen, and Jianmin Wang
DM232

Process mining links traditional model-driven Business Process Management and data mining by means of deriving knowledge from event logs to improve operational business processes. As an impact factor of the quality of process mining results, the degree of completeness of the given event log should be necessarily measured. In this paper an approach is proposed in the context of mining control-flow dependencies to evaluate the local completeness of an event log without knowing any information about the original process model. Experiment results show that the proposed approach works robustly and gives better estimation than approaches available.

14:42 Cross-Language Opinion Target Extraction in Review Texts short_paper) echo(" (Short)");?>
Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao
DM512

Opinion target extraction is a subtask of opinion mining which is very useful in many applications. In this study, we investigate the problem in a cross-language scenario which leverages the rich labeled data in a source language for opinion target extraction in a different target language. The English labeled corpus is used as training set. We generate two Chinese training datasets with different features. Two labeling models for Chinese opinion target extraction are learned based on Conditional Random Fields (CRF). After that, we use a monolingual co-training algorithm to improve the performance of both models by leveraging the enormous unlabeled Chinese review texts on the web. Experimental results show the effectiveness of our proposed approach.

14:54 Inductive Model Generation for Text Categorization using a Bipartite Heterogeneous Network short_paper) echo(" (Short)");?>
Rafael Geraldeli Rossi, Thiago de Paulo Faleiros, Alneu de Andrade Lopes, and Solange Oliveira Rezende
DM868

Usually, algorithms for categorization of numeric data have been applied for text categorization after a preprocessing phase which assigns weights for textual terms deemed as attributes. However, due to characteristics of textual data, some algorithms for data categorization are not efficient for text categorization. Characteristics of textual data such as sparsity and high dimensionality sometimes impair the quality of general purpose classifiers. Here, we propose a text classifier based on a bipartite heterogeneous network used to represent textual document collections. Such algorithm induces a classification model assigning weights to objects that represents terms of the textual document collection. The induced weights correspond to the influence of the terms in the classification of documents they appear. The least-mean-square algorithm is used in the inductive process. Empirical evaluation using a large amount of textual document collections shows that the proposed IMBHN algorithm produces significantly better results than the k-NN, C4.5, SVM and Naïve Bayes algorithms.

15:06 Learning to Refine an Automatically Extracted Knowledge Base using Markov Logic short_paper) echo(" (Short)");?>
Shangpu Jiang, Daniel Lowd, and Dejing Dou
DM973

A number of text mining and information extraction projects such as Text Runner and NELL seek to automatically build knowledge bases from the rapidly growing amount of information on the web. In order to scale to the size of the web, these projects often employ ad hoc heuristics to reason about uncertain and contradictory information rather than reasoning jointly about all candidate facts. In this paper, we present a Markov logic-based system for cleaning an extracted knowledge base. This allows a scalable system such as NELL to take advantage of joint probabilistic inference, or, conversely, allows Markov logic to be applied to a web scale problem. Our system uses only the ontological constraints and confidence values of the original system, along with human-labeled data if available. The labeled data can be used to calibrate the confidence scores from the original system or learn the effectiveness of individual extraction patterns. To achieve scalability, we introduce a neighborhood grounding method that only instantiates the part of the network most relevant to the given query. This allows us to partition the knowledge cleaning task into tractable pieces that can be solved individually. In experiments on NELL's knowledge base, we evaluate several variants of our approach and find that they improve both F1 and area under the precision-recall curve.

15:18 Semantic Aspect Discovery For Online Reviews short_paper) echo(" (Short)");?>
Md. Hijbul Alam and SangKeun Lee
DM559

The number of opinions and reviews about different products and services is growing online. Users frequently look for important aspects of a product or service in the reviews. Usually, they are interested in semantic (i.e., sentiment-oriented) aspects. However, extracting semantic aspects with supervised methods is very expensive. We propose a domain independent unsupervised model to extract semantic aspects, and conduct qualitative and quantitative experiments to evaluate the extracted aspects. The experiments show that our model effectively extracts semantic aspects with correlated top words. In addition, the conducted evaluation on aspect sentiment classification shows that our model outperforms other models by 5-7% in terms of macro-average F1.