Jure Leskovec, Computer Science Dept., Stanford

Jure Leskovec, Computer Science Dept., Stanford Includes joint work with Jaewon Yang, Manuel Gomez-Rodriguez, Jon Kleinberg, Lars Backstrom, and Andreas Krause http://memetracker.org

Jure Leskovec (jure@cs.stanford.edu) 2 Global vs. Local effects: Interaction of global effects from mass media and local effects carried by the social structure (e.g., blogs, Twitter) Internet, blogs, social media: Social media means the dichotomy between global and local influence is evaporating Speed of media reporting and discussion has intensified: very rapid progression of stories How does information transmitted by the media interact with social networks?

Jure Leskovec (jure@cs.stanford.edu) 3 In principle, we can collect nearly all (online) news media content: 10 million articles/day (50GB of data) Collecting data since Aug 08 ~10TB Could study media ecosystem at large Challenges: Humans don t scale Develop automatic computational methods What are basic units of information? Units that propagate between the nodes

[w/ Backstrom-Kleinberg, KDD 09] Jure Leskovec (jure@cs.stanford.edu) 4 Would like units that: Correspond to aggregates of articles, vary over the order of days, and can be handled at terabyte scale Plan: Identify textual fragments, phrases, memes that travel relatively intact through many articles Things that don t work: Cascasding hyper-links to articles: too fine-grained Topics as probabilistic term mixtures: too coarse-grained Named entities: too coarse-grained Common sequence of words: too noisy Idea: Quoted phrases:.* Are integral parts of journalistic practices Tend to follow iterations of a story as it evolves Are attributed to individuals and have time and location

Jure Leskovec (jure@cs.stanford.edu) [w/ Backstrom-Kleinberg, KDD 09] 5 Data from Spinn3r on the 3 months leading up to the 2008 U.S. Presidential Election: 1 million news articles and blog posts per day Essentially a complete online media coverage: 20,000 sites that are part of Google News 1.6 million blogs From August 1 to October 31 2008 90 million documents from 1.65 million sites, 390GB We extract 112 million quotes (phrases)

[w/ Backstrom-Kleinberg, KDD 09] Phrase: Our opponent is someone who sees America, it seems, as being so imperfect, imperfect enough that he s palling around with terrorists who would target their own country. 6

is periodic (weekly), no trends The bandwidth of the online media is constant 7

http://memetracker.org August October Volume over time of top 50 largest total volume phrases Jure Leskovec (jure@cs.stanford.edu) 8

Peak blog intensity comes about 2.5 hours after news peak. Using Google News we label: Mainstream media: 20,000 sites (44% vol.) Blog (everything else): 1.6 million sites (56% vol.) Jure Leskovec (jure@cs.stanford.edu) 10

Can classify individual sources by their typical timing relative to the peak aggregate intensity Professional blogs News media Jure Leskovec (jure@cs.stanford.edu) 11

Jure Leskovec (jure@cs.stanford.edu) 12 The oscillation of attention between mainstream and social media

Jure Leskovec (jure@cs.stanford.edu) [w/ Yang, ICDM 10] 13 Question: If New York Times mentions a meme at time t How many subsequent mentions of meme does this generate at time t+1, t+2,? Formulation: We want to predict the volume x(t) of phrase x at time t as a function of influences of sites that mentioned the meme before time t

Jure Leskovec (jure@cs.stanford.edu) [w/ Yang, ICDM 10] 14 LIM model: Given a volume over time x(t) of meme x And let: I A (t): influence curve of site A t A : time when A mentioned x Then we model: x(t+1) = W I W (t - t W ) For each site W estimate I W (t) It boils down to a least squares-like problem

[w/ Yang, ICDM 10] 15 Task: Predict volume x(t+1) of phrase x based on influences of sites that already mentioned x Setting: Using 1,000 phrases, and only 20 websites Improvement in L1 error over 1-time lag predictor By monitoring only 20 sites, we can reliably predict the overall future volume of a phrase (link, hashtag)

Jure Leskovec (jure@cs.stanford.edu) [w/ Yang, ICDM 10] 16 Business and politics are driven by mainstream media Entertainment (and sports) is driven by blogs and TV Newspapers and news agencies do not influence the volume

Jure Leskovec (jure@cs.stanford.edu) [w/ Gomez-Krause, KDD 10] 17 But how does information really spread? We only see the mentions but not the propagation Can we reconstruct (hidden) diffusion network?

[w/ Gomez-Krause, KDD 10] There is a hidden diffusion network: a b We only see times when nodes get infected: c 1 : (a,1), (c,2), (b,3), (e,4) c 2 : (c,1), (a,4), (b,5), (d,6) Want to infer who-infects-whom network The problem is NP-hard c e Our algorithm can do it near-optimally in O(N 2 ) Jure Leskovec (jure@cs.stanford.edu) d 18

[w/ Gomez-Krause, KDD 10] 5,000 news sites: Blogs Mainstream media Jure Leskovec (jure@cs.stanford.edu) 19

[w/ Gomez-Krause, 10] Blogs Mainstream media Jure Leskovec (jure@cs.stanford.edu) 20

Want to read things before others do. Detect blue & yellow soon but miss red. Jure Leskovec (jure@cs.stanford.edu) Detect all stories but late. 21

Jure Leskovec (jure@cs.stanford.edu) 22 Given a budget (e.g., of 3 blogs) Select sites to cover the most of the Web Bad news: Solving this exactly is NP-hard Good news: Theorem: Our algorithm can do it in linear time near-optimally Blogosphere

Question: Which websites should one read to catch big stories? Idea: Each blog covers part of the Web Each dot is a blog Proximity is based on the number of common cascades Jure Leskovec (jure@cs.stanford.edu) 23

Which blogs to read to be most up to date? Our solution % of stories detected (higher is better) In-links Out-links # posts (used by Technorati) Random Number of selected blogs www.blogcascades.org Jure Leskovec (jure@cs.stanford.edu) 24

Jure Leskovec (jure@cs.stanford.edu) 25

Jure Leskovec (jure@cs.stanford.edu) 26 Meme-tracking and the Dynamics of the News Cycle, by J. Leskovec, L. Backstrom, J. Kleinberg. KDD, 2009 http://cs.stanford.edu/people/jure/pubs/quoteskdd09.pdf Modeling Information Diffusion in Implicit Networks, by J. Yang, J. Leskovec, ICDM, 2010 http://cs.stanford.edu/people/jure/pubs/lim-icdm10.pdf Inferring networks of diffusion and influence, by M. Gomez-Rodriguez, J. Leskovec, A. Krause. KDD 2010 http://cs.stanford.edu/people/jure/pubs/netinf-kdd2010.pdf Covering the great recession by Pew research center's project for excellence in journalism, 2009 http://www.journalism.org/analysis_report/covering_great_recession Cost-effective Outbreak Detection in Networks by J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, N. Glance. KDD 2007. http://cs.stanford.edu/people/jure/pubs/detect-kdd07.pdf Cascading Behavior in Large Blog Graphs by J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, M. Hurst. SDM, 2007. http://cs.stanford.edu/~jure/pubs/blogs-sdm07.pdf