You are here

Session Details

Models and Algorithms for New Applications

Tuesday, 11 December
16:00 – 18:00
Room: Salle des nations I
Session Chair: Matthijs van Leeuwen

16:00 ETM: Entity Topic Models for Mining Documents Associated with Entities short_paper) echo(" (Short)");?>
Hyungsul Kim, Yizhou Sun, Julia Hockenmaier, and Jiawei Han
DM822

Topic models, which factor each document into different topics and represent each topic as a distribution of terms, have been widely and successfully used to better understand collections of text documents. However, documents are also associated with further information, such as the set of real-world entities mentioned in them. For example, news articles are usually related to several people, organizations, countries or locations. Since those associated entities carry rich information, it is highly desirable to build more expressive, entity-based topic models, which can capture the term distributions for each topic, each entity, as well as each topic-entity pair. In this paper, we therefore introduce a novel Entity Topic Model (ETM) for documents that are associated with a set of entities. ETM not only models the generative process of a term given its topic and entity information, but also models the correlation of entity term distributions and topic term distributions. A Gibbs sampling-based algorithm is proposed to learn the model. Experiments on real datasets demonstrate the effectiveness of our approach over several state-of-the-art baselines.

16:20 Efficient Algorithms for Finding Richer Subgroup Descriptions in Numeric and Nominal Data short_paper) echo(" (Short)");?>
Michael Mampaey, Siegfried Nijssen, Ad Feelders, and Arno Knobbe
DM694

Subgroup discovery systems are concerned with finding interesting patterns in labeled data. How these systems deal with numeric and nominal data has a large impact on the quality of their results. In this paper, we consider two ways to extend the standard pattern language of subgroup discovery: using conditions that test for interval membership for numeric attributes, and value set membership for nominal attributes. We assume a greedy search setting, that is, iteratively refining a given subgroup, with respect to a (convex) quality measure. For numeric attributes, we propose an algorithm that finds the optimal interval in linear (rather than quadratic) time, with respect to the number of examples and split points. Similarly, for nominal attributes, we show that finding the optimal set of values can be achieved in linear (rather than exponential) time, with respect to the number of examples and the size of the domain of the attribute. These algorithms operate by only considering subgroup refinements that lie on a convex hull in ROC space, thus significantly narrowing down the search space. We further provide efficient algorithms specifically for the popular Weighted Relative Accuracy quality measure, taking advantage of some of its properties. Our algorithms are shown to perform well in practice, and furthermore provide additional expressive power leading to higher-quality results.

16:40 Sparse Group Selection on Fused Lasso Components for Identifying Group-specific DNA Copy Number Variations short_paper) echo(" (Short)");?>
Ze Tian, Huanan Zhang, and Rui Kuang
DM380

Detecting DNA copy number variations (CNVs) from arrayCGH or genotyping-array data to correlate with cancer outcomes is crucial for understanding the molecular mechanisms underlying cancer. Previous methods either focus on detecting CNVs in each individual patient sample or common CNVs across all the patient samples. These methods ignore the discrepancies introduced by the heterogeneity in the patient samples, which implies that common CNVs might only be shared within some groups of samples instead of all samples. In this paper, we propose a latent feature model that couples sparse sample group selection with fused lasso on CNV components to identify group-specific CNVs. Assuming a given group structure on patient samples by clinical information, sparse group selection on fused lasso (SGS-FL) identifies the optimal latent CNV components, each of which is specific to the samples in one or several groups. The group selection for each CNV component is determined dynamically by an adaptive algorithm to achieve a desired sparsity. Simulation results show that SGS-FL can more accurately identify the latent CNV components when there is a reliable underlying group structure in the samples. In the experiments on arrayCGH breast cancer and bladder cancer datasets, SGS-FL detected CNV regions that are more relevant to cancer, and provided latent feature weights that can be used for better sample classification.

17:00 Multiplicative Algorithms for Constrained Non-negative Matrix Factorization short_paper) echo(" (Short)");?>
Chengbin Peng, Ka-Chun Wong, Alyn Rockwood, Xiangliang Zhang, Jinling Jiang, and David E. Keyes
DM334

Non-negative matrix factorization (NMF) provides the advantage of parts-based data representation through additive only combinations. It has been widely adopted in areas like item recommending, text mining, data clustering, speech denoising, etc. In this paper, we provide an algorithm that allows the factorization to have linear or approximately linear constraints with respect to each factor. We prove that if the constraint function is linear, algorithms within our multiplicative framework will converge. This theory supports a large variety of equality and inequality constraints, and can facilitate application of NMF to a much larger domain. Taking the recommender system as an example, we demonstrate how a specialized weighted and constrained NMF algorithm can be developed to fit exactly for the problem, and the tests justify that our constraints improve the performance for both weighted and unweighted NMF algorithms under several different metrics. In particular, on the Movie lens data with 94% of items, the Constrained NMF improves recall rate 3% compared to SVD50 and 45% compared to SVD150, which were reported as the best two in the top-N metric.

17:12 Rapid and Robust Denoising of Pyrosequenced Amplicons for Metagenomics short_paper) echo(" (Short)");?>
Byunghan Lee, Joonhong Park, and Sungroh Yoon
DM446

Metagenomic sequencing has become a crucial tool for obtaining a gene catalogue of operational taxonomic units (OTUs) in a microbial community. High-throughput pyrosequencing is a next-generation sequencing technique very popular in microbial community analysis due to its longer read length compared to alternative methods. Computational tools are inevitable to process raw data from pyrosequencers, and in particular, noise removal is a critical data-mining step to obtain robust sequence reads. However, the slow rate of existing denoisers has bottlenecked the whole pyrosequencing process, let alone hindering efforts to improve robustness. To address these, we propose a new approach that can accelerate the denoising process substantially. By using our approach, it now takes only about 2 hours to denoise 62,873 pyrosequenced amplicons from a mixture of 91 full-length 16S rRNA clones. It would otherwise take nearly 2.5 days if existing software tools were used. Furthermore, our approach can effectively reduce overestimating the number of OTUs, producing 6.7 times fewer species-level OTUs on average than a state-of-the-art alternative under the same condition. Leveraged by our approach, we hope that metagenomic sequencing will become an even more appealing tool for microbial community analysis.

17:24 Multi-task Learning for Classifying Proteins with Dual Hierarchies short_paper) echo(" (Short)");?>
Anveshi Charuvaka and Huzefa Rangwala
DM724

Several biological databases organize information in taxonomies/hierarchies. These databases differ in terms of curation process, input data, coverage and annotation errors. SCOP and CATH are examples of two databases that classify proteins hierarchically into structurally related groups based on experimentally determined structures. Given the large number of protein sequences with unavailable structure, there is a need to develop prediction methods to classify protein sequences into structural classes. We have developed a novel classification approach that utilizes the underlying relationships across multiple hierarchical source databases within a multi-task learning (MTL) framework. MTL is used to simultaneously learn multiple related tasks, and has been shown to improve generalization performance. Specifically, we have developed and evaluated an MTL approach for predicting the structural class, as defined by two hierarchical databases, CATH and SCOP, using protein sequence information only. We define one task per node of the hierarchies and formulate the MTL problem as a combination of these binary classification tasks. Our experimental evaluation demonstrates that the MTL approach that integrates both the hierarchies outperforms the base-line approach that trains independent models per task, as well as a MTL approach that integrates tasks across a single hierarchical database. We also performed extensive experiments that evaluate different regularization penalties and incorporate different task relationships that achieve superior classification performance.

17:36 A New Proposal for Score Normalization in Biometric Signature Recognition Based on Client Threshold Prediction short_paper) echo(" (Short)");?>
Carlos Vivaracho-Pascual, Arancha Simon-Hurtado, Esperanza Manso-Martinez, and Juan Pascual-Gaspar
DM579

Score Normalization is a usual technique in pattern recognition to standardize the classifier output ranges so as to, for example, fuse these outputs. The use of score normalization in biometric recognition is a very important part of the system, specially in those based on behavioral traits, such as written signature or voice, conditioning the final system performance. Then, many works can be found that focus on the problem. A successful new approach for client threshold prediction, based on Multiple Linear Prediction, has been presented in recent works. Here, a new approach for score normalization, based on this proposal for biometric manuscript signature user verification, is shown. This proposal is compared with the state of the art methods, achieving an improvement of 19% and 16% for Equal Error Rate (EER) and 60% and 26% for Detection Cost Function (DCF) performance measures, for random and skilled forgeries, respectively.

17:48 Analysis of Temporal High-Dimensional Gene Expression Data for Identifying Informative Biomarker Candidates short_paper) echo(" (Short)");?>
Qiang Lou and Zoran Obradovic
DM859

Identifying informative biomarkers from a large pool of candidates is the key step for accurate prediction of an individual's health status. In clinical applications traditional static feature selection methods that flatten the temporal data cannot be directly applied since the patient's observed clinical condition is a temporal multivariate time series where different variables can capture various stages of temporal change in the patient's health status. In this study, in order to identify informative genes in temporal microarray data, a margin based feature selection filter is proposed. The proposed method is based on well-established machine learning techniques without any assumptions about the data distribution. The objective function of temporal margin-based feature selection is defined to maximize each subject's temporal margin in its own relevant subspace. In the objective function, the uncertainty in calculating nearest neighbors is taken into account by considering the change in feature weights in each iteration. A fixed-point gradient descent method is proposed to solve the formulated objective function. The experimental results on both synthetic and real data provide evidence that the proposed method can identify more informative features than the alternatives that flatten the temporal data in advance.