You are here

Session Details

Social Networks 1

Tuesday, 11 December
10:00 – 12:00
Room: Salle des nations II
Session Chair: Ruoming Jin

10:00 Defining and Evaluating Network Communities based on Ground-truth short_paper) echo(" (Short)");?>
Jaewon Yang and Jure Leskovec
DM911

Nodes in real-world networks organize into densely linked communities where edges appear with high concentration among the members of the community. Identifying such communities of nodes has proven to be a challenging task mainly due to a plethora of definitions of a community, intractability of algorithms, issues with evaluation and the lack of a reliable gold-standard ground-truth. In this paper we study a set of 230 large real-world social, collaboration and information networks where nodes explicitly state their group memberships. For example, in social networks nodes explicitly join various interest based social groups. We use such groups to define a reliable and robust notion of ground-truth communities. We then propose a methodology which allows us to compare and quantitatively evaluate how different structural definitions of network communities correspond to ground-truth communities. We choose 13 commonly used structural definitions of network communities and examine their sensitivity, robustness and performance in identifying the ground-truth. We show that the 13 structural definitions are heavily correlated and naturally group into four classes. We find that two of these definitions, Conductance and Triad-participation-ratio, consistently give the best performance in identifying ground-truth communities. We also investigate a task of detecting communities given a single seed node. We extend the local spectral clustering algorithm into a heuristic parameter-free community detection method that easily scales to networks with more than hundred million nodes. The proposed method achieves 30% relative improvement over current local clustering methods.

10:20 Community Preserving Lossy Compression of Social Networks short_paper) echo(" (Short)");?>
Hossein Maserrat and Jian Pei
DM381

Compression plays an important role in social network analysis from both practical and theoretical points of view. Although there are a few pioneering studies on social network compression, they mainly focus on lossless approaches. In this paper, we tackle the novel problem of community preserving lossy compression of social networks. The trade-off between space and information preserved in a lossy compression presents an interesting angle for social network analysis, and, at the same time, makes the problem very challenging. We propose a sequence graph compression approach, discuss the design of objective functions towards community preservation, and present an interesting and practically effective greedy algorithm. Our experimental results on both real data sets and synthetic data sets demonstrate the promise of our method.

10:40 Spotting Culprits in Epidemics: How many and Which ones? short_paper) echo(" (Short)");?>
B. Aditya Prakash, Jilles Vreeken, and Christos Faloutsos
DM399

Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? In this paper we answer this question affirmatively, and give an efficient method called NETSLEUTH for the well-known Susceptible-Infected virus propagation model. Essentially, we are after that set of seed nodes that best explain the given snapshot. We propose to employ the Minimum Description Length principle to identify the best set of seed nodes and virus propagation ripple, as the one by which we can most succinctly describe the infected graph. We give an highly efficient algorithm to identify likely sets of seed nodes given a snapshot. Then, given these seed nodes, we show we can optimize the virus propagation ripple in a principled way by maximizing likelihood. With all three combined, NETSLEUTH can automatically identify the correct number of seed nodes, as well as which nodes are the culprits. Experimentation on our method shows high accuracy in the detection of seed nodes, in addition to the correct automatic identification of their number. Moreover, we show NETSLEUTH scales linearly in the number of nodes of the graph.

11:00 Detecting Anomalies in Bipartite Graphs with Mutual Dependency Principles short_paper) echo(" (Short)");?>
Hanbo DAI, Feida ZHU, Ee-Peng LIM, and HweeHwa Pang
DM545

Bipartite graphs can model many real life applications including users-rating-products in online marketplaces, users-clicking-webpages on the World Wide Web and users referring- users in social networks. In these graphs, the anomalousness of nodes in one partite often depends on that of their connected nodes in the other partite. Previous studies have shown that this dependency can be positive (the anomalousness of a node in one partite increases or decreases along with that of its connected nodes in the other partite) or negative (the anomalousness of a node in one partite rises or falls in opposite direction to that of its connected nodes in the other partite). In this paper, we unify both positive and negative mutual dependency relationships in an unsupervised framework for detecting anomalous nodes in bipartite graphs. This is the first work that integrates both mutual dependency principles to model the complete set of anomalous behaviors of nodes that cannot be identified by either principle alone. We formulate our principles and design an iterative algorithm to simultaneously compute the anomaly scores of nodes in both partites. Moreover, we mathematically prove that the ranking of nodes by anomaly scores in each partite converges. Our framework is examined on synthetic graphs and the results show that our model outperforms existing models with only positive or negative mutual dependency principles. We also apply our framework to two real life datasets: Goodreads as a users-rating-books setting and Buzzcity as a users-clicking advertisements setting. The results show that our method is able to detect suspected spamming users and spammed books in Goodreads and achieve higher precision in identifying fraudulent advertisement publishers than existing approaches.

11:20 RankTopic: Ranking Based Topic Modeling short_paper) echo(" (Short)");?>
Dongsheng Duan, Yuhua Li, Ruixuan Li, Rui Zhang, and Aiming Wen
DM864

Topic modeling has become a widely used tool for document management due to its superior performance. However, there are few topic models distinguishing the importance of documents on different topics. In this paper, we investigate how to utilize the importance of documents to improve topic modeling and propose to incorporate link based ranking into topic modeling. Specifically, topical pagerank is used to compute the topic level ranking of documents, which indicates the importance of documents on different topics. By retreating the topical ranking of a document as the probability of the document involved in corresponding topic, a generalized relation is built between ranking and topic modeling. Based on the relation, a ranking based topic model Rank Topic is proposed. With Rank Topic, a mutual enhancement framework is established between ranking and topic modeling. Extensive experiments on paper citation data and Twitter data are conducted to compare the performance of Rank Topic with that of some state-of-the-art topic models. Experimental results show that Rank Topic performs much better than some baseline models and is comparable with the state-of-the-art link combined relational topic model (RTM) in generalization performance, document clustering and classification by setting a proper balancing parameter. It is also demonstrated in both quantitative and qualitative ways that topics detected by Rank Topic are more interpretable than those detected by some baseline models and still competitive with RTM.

11:40 Automatically Discovering Talented Musicians with Acoustic Analysis of YouTube Videos short_paper) echo(" (Short)");?>
Eric Nichols, Charles DuHadway, Hrishikesh Aradhye, and Richard Lyon
DM856

Online video presents a great opportunity for up-and-coming singers and artists to be visible to a worldwide audience. However, the sheer quantity of video makes it difficult to discover promising musicians. We present a novel algorithm to automatically identify talented musicians using machine learning and acoustic analysis on a large set of "home singing" videos. We describe how candidate musician videos are identified and ranked by singing quality. To this end, we present new audio features specifically designed to directly capture singing quality. We evaluate these vis-a-vis a large set of generic audio features and demonstrate that the proposed features have good predictive performance. We also show that this algorithm performs well when videos are normalized for production quality.