Logout succeed
Logout succeed. See you again!

Slide PDF
Preview Slide
ACM SIGKDD Conference Tutorial, Washington, D.C., July 25, 2010 MMiinniinngg HHeetteerrooggeenneeoouuss IInnffoorrmmaattiioonn NNeettwwoorrkkss Jiawei Han† Yizhou Sun† Xifeng Yan§ Philip S. Yu‡ †Univers ity of Illinois a t Urbana‐Cham paign § University of California at Santa Barbara ‡University of Illinois at Chicago Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), Microsoft, IBM, Yahoo!, Google, HP Lab & Boeing July 12, 2010 1 OOuuttlliinnee (cid:131) Motivation: Why Mining Heterogeneous Information Networks? (cid:131) Part I: Clustering, Ranking and Classification (cid:131) Clustering and Ranking in Information Networks (cid:131) Classification of Information Networks (cid:131) Part II: Data Quality and Search in Information Networks (cid:131) Data Cleaning and Data Validation by InfoNet Analysis (cid:131) Similarity Search in Information Networks (cid:131) Part III: Advanced Topics on Information Network Analysis (cid:131) Role Discovery and OLAP in Information Networks (cid:131) Mining Evolution and Dynamics of Information Networks (cid:131) Conclusions 2 OOuuttlliinnee (cid:131) Motivation: Why Mining Heterogeneous Information Networks? (cid:131) Part I: Clustering, Ranking and Classification (cid:131) Clustering and Ranking in Information Networks (cid:131) Classification of Information Networks (cid:131) Part II: Data Quality and Search in Information Networks (cid:131) Data Cleaning and Data Validation by InfoNet Analysis (cid:131) Similarity Search in Information Networks (cid:131) Part III: Advanced Topics on Information Network Analysis (cid:131) Role Discovery and OLAP in Information Networks (cid:131) Mining Evolution and Dynamics of Information Networks (cid:131) Conclusions 3 WWhhaatt AArree IInnffoorrmmaattiioonn NNeettwwoorrkkss?? (cid:131) Information network: A network where each node represents an entity (e.g., actor in a social network) and each link (e.g., tie) a relationship between entities (cid:131) Each node/link may have attributes, labels, and weights (cid:131) Link may carry rich semantic information (cid:131) Homogeneous vs. heterogeneous networks (cid:131) Homogeneous networks (cid:131) Single object type and single link type (cid:131) Single model social networks (e.g., friends) (cid:131) WWW: a collection of linked Web pages (cid:131) Heterogeneous, multi‐typed networks (cid:131) Multiple object and link types (cid:131) Medical network: patients, doctors, disease, contacts, treatments (cid:131) Bibliographic network: publications, authors, venues 4 UUbbiiqquuiittoouuss IInnffoorrmmaattiioonn NNeettwwoorrkkss (cid:131) Graphs and substructures (cid:131) Chemical compounds, computer vision objects, circuits, XML (cid:131) Biological networks (cid:131) Bibliographic networks: DBLP, ArXiv, PubMed, … (cid:131) Social networks: Facebook >100 million active users (cid:131) World Wide Web (WWW): > 3 billion nodes, > 50 billion arcs (cid:131) Cyber‐physical networks Yeast protein An Internet Web Co-author network Social network sites interaction network 5 HHoommooggeenneeoouuss vvss.. HHeetteerrooggeenneeoouuss NNeettwwoorrkkss Co-author Network Conference-Author Network 6 DDBBLLPP:: AAnn IInntteerreessttiinngg aanndd FFaammiilliiaarr NNeettwwoorrkk (cid:131) DBLP: A computer science publication bibliographic database (cid:131) 1.4 M records (papers), 0.7 M authors, 5 K conferences, … (cid:131) Will this database disclose interesting knowledge about computer science research? (cid:131) What are the popular research fields/subfields in CS? (cid:131) Who are the leading researchers on DB or XQueries? (cid:131) How do the authors in this subfield collaborate and evolve? (cid:131) How many Wei Wang’s in DBLP, which paper done by which? (cid:131) Who is Sergy Brin’s supervisor and when? (cid:131) Who are very similar to Christos Faloutsos? …… (cid:131) All these kinds of questions, and potentially much more, can be nicely answered by the DBLP‐InfoNet (cid:131) How? Exploring the power of links in information networks! 7 HHoommoo.. vvss.. HHeetteerroo..:: DDiiffffeerreenncceess iinn DDBB‐‐IInnffooNNeett MMiinniinngg (cid:131) Homogeneous networks can often be derived from their original heterogeneous networks (cid:131) Coauthor networks can be derived from author‐paper‐ conference networks by projection on authors only (cid:131) Paper citation networks can be derived from a complete bibliographic network with papers and citations projected (cid:131) Heterogeneous DB‐InfoNet carries richer information than its corresponding projected homogeneous networks (cid:131) Typed heterogeneous InfoNet vs. non‐typed hetero. InfoNet (i.e., not distinguishing different types of nodes) (cid:131) Typed nodes and links imply a more structured InfoNet, and thus often lead to more informative discovery (cid:131) Our emphasis: Mining “structured” information networks! 8 WWhhyy MMiinniinngg HHeetteerrooggeenneeoouuss IInnffoorrmmaattiioonn NNeettwwoorrkkss?? (cid:131) Most datasets can be “organized” or “transformed” into a “structured” heterogeneous information network! (cid:131) Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, … (cid:131) Structures can be progressively extracted from less organized data sets by information network analysis (cid:131) Information‐rich, inter‐related, organized data sets form one or a set of gigantic, interconnected, multi‐typed heterogeneous information networks (cid:131) Surprisingly rich knowledge can be derived from such structured heterogeneous information networks (cid:131) Our goal: Uncover knowledge hidden from “organized” data (cid:131) Exploring the power of multi‐typed, heterogeneous links (cid:131) Mining “structured” heterogeneous information networks! 9 OOuuttlliinnee (cid:131) Motivation: Why Mining Heterogeneous Information Networks? (cid:131) Part I: Clustering, Ranking and Classification (cid:131) Clustering and Ranking in Information Networks (cid:131) Classification of Information Networks (cid:131) Part II: Data Quality and Search in Information Networks (cid:131) Data Cleaning and Data Validation by InfoNet Analysis (cid:131) Similarity Search in Information Networks (cid:131) Part III: Advanced Topics on Information Network Analysis (cid:131) Role Discovery and OLAP in Information Networks (cid:131) Mining Evolution and Dynamics of Information Networks (cid:131) Conclusions 10