-- Final exam logistics -- Please fill out course evaluation forms (THANKS!!!)

-- Final exam logistics -- Please fill out course evaluation forms (THANKS!!!) CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2

Date: Tuesday, March 20, 3:30-6:30 pm Location: Gates B01 (Last name A-H) Skilling Auditorium (Last name I-N) NVIDIA Auditorium (Last name O-Z) SCPD: Your exam monitor will receive the exam this week You may come to Stanford to take the exam, or take it at most 48 hours before the exam time Email exam PDF to cs246.mmds@gmail.com by Tuesday, March 20, 11:59 pm Pacific Time Call Jessica at +1 561-543-1855 if you have questions during the exam, or email the course staff mailing list 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4 Final exam is open book and open notes A calculator or computer is REQUIRED You may only use your computer to do arithmetic calculations (i.e. the buttons found on a standard scientific calculator) You may also use your computer to read course notes or the textbook But no internet/google/python access is allowed Anyone who brings a power strip to the final exam gets 1 point extra credit for helping out their classmates

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6 Redundancy leads to a bad user experience Uncertainty around information need => don t put all eggs in one basket How do we optimize for diversity directly?

Monday, January 14, 2013 France intervenes Chuck for Defense Argo wins big Hagel expects fight 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7

Monday, January 14, 2013 France intervenes Chuck for Defense Argo wins big New gun proposals 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9 Idea: Encode diversity as coverage problem Example: Word cloud of news for a single day Want to select articles so that most words are covered

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11 Q: What is being covered? A: Concepts (In our case: Named entities) France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Hagel expects fight Q: Who is doing the covering? A: Documents

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12 Suppose we are given a set of documents D Each document d covers a set! " of words/topics/named entities W For a set of documents A Í D we define # $ = &! " " $ Goal: We want to max $ +, #($) Note: F(A) is a set function: # $ : 0123 N

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13 Given universe of elements! = {% &,, % ) } and sets + &,, +, Í! X 3 X 2 X 4 X 1 W Goal: Find k sets X i that cover the most of W More precisely: Find k sets X i whose size of the union is the largest Bad news: A known NP-complete problem

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14 Simple Heuristic: Greedy Algorithm: Start with! " = { } For ' = ( * Find set + that,-. /(! '1( {+}) Let! ' =! '1( È {+} Example: Eval. / + (,, /({+ 5 }), pick best (say + ( ) Eval. / + ( } {+ 6,, /({+ ( } {+ 5 }), pick best (say + 6 ) Eval. /({+ (, + 6 } {+ 7 }),, /({+ (, + 6 } {+ 5 }), pick best And so on /! = 8 9 + +!

Goal: Maximize the covered area 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15

Goal: Maximize the covered area 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

Goal: Maximize the covered area 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

Goal: Maximize the covered area 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18

Goal: Maximize the covered area 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20 A B C Goal: Maximize the size of the covered area Greedy first picks A and then C But the optimal way would be to pick B and C

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21 Greedy produces a solution A where: F(A) ³ (1-1/e)*OPT (F(A) ³ 0.63*OPT) [Nemhauser, Fisher, Wolsey 78] Claim holds for functions F( ) with 2 properties: F is monotone: (adding more docs doesn t decrease coverage) if A Í B then F(A) F(B) and F({})=0 F is submodular: adding an element to a set gives less improvement than adding it to one of its subsets

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22 Definition: Set function F( ) is called submodular if: For all A,BÍ W: F(A) + F(B) ³ F(AÈ B) + F(AÇ B) + ³ B A A È B A Ç B +

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23 Diminishing returns characterization Equivalent definition: Set function F( ) is called submodular if: For all AÍ B, dïb: F(A È d) F(A) F(B È d) F(B) Gain of adding d to a small set Gain of adding d to a large set B A + + d d Large improvement Small improvement

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24 F( ) is submodular: A Í B F(A È d) F(A) F(B È d) F(B) Gain of adding d to a small set Natural example: Sets! 1,,! % ' ( =! + + - (size of the covered area) Claim:.(0) is submodular! A Gain of adding d to a large set B d d

Submodularity is discrete analogue of concavity F( ) F(B È d) "A Í B F(A È d) F(B) F(A) Adding d to B helps less than adding it to A! Solution size A F(A È d) F(A) F(B È d) F(B) Gain of adding X d to a small set Gain of adding X d to a large set 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25

Marginal gain:! " # $ = " $ # "($) Submodular: " $ # " $ " + # "(+) Concavity:, - + #, -, / + #,(/) 0 2 3 5 F(A) 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26 A

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27 Let! "! $ be submodular and % " % $ > ' $ then! ( = + % +! + ( is submodular Submodularity is closed under non-negative linear combinations! This is an extremely useful fact: Average of submodular functions is submodular:! ( =, +! + ( + Multicriterion optimization:! ( = % +! + ( +

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28 Q: What is being covered? A: Concepts (In our case: Named entities) France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Hagel expects fight Q: Who is doing the covering? A: Documents

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29 Objective: pick k docs that cover most concepts France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Enthusiasm for Inauguration wanes Inauguration weekend F(A): the number of concepts covered by A Elements concepts, Sets concepts in docs F(A) is submodular and monotone! We can use greedy to optimize F

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30 Objective: pick k docs that cover most concepts France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Enthusiasm for Inauguration wanes Inauguration weekend The good: Penalizes redundancy Submodular The bad: Concept importance? All-or-nothing too harsh

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 32 Objective: pick k docs that cover most concepts France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Enthusiasm for Inauguration wanes Inauguration weekend Each concept! has importance weight "!

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33 Document coverage function probability document d covers concept c [e.g., how strongly d covers c] Obama Romney Enthusiasm for Inauguration wanes

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 34 Document coverage function: probability document d covers concept c Cover d (c) can also model how relevant is concept c for user u Set coverage function: Prob. that at least one document in A covers c Objective: max A: A applek F (A) =X c concept weights w c cover A (c)

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 35 max A: A applek F (A) =X c w c cover A (c) The objective function is also submodular Intuitive diminishing returns property Greedy algorithm leads to a (1 1/e) ~ 63% approximation, i.e., a near-optimal solution

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36 Objective: pick k docs that cover most concepts France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Enthusiasm for Inauguration wanes Inauguration weekend Each concept! has importance weight " # Documents partially cover concepts: $%&'( ) (+)

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 37

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 38 a b c d Greedy Marginal gain: F(AÈx)-F(A) Greedy algorithm is slow! At each iteration we need to re-evaluate marginal gains of all remaning documents Runtime!( $ &) for selecting & documents out of the set of $ of them e Add document with highest marginal gain

[Leskovec et al., KDD 07] In round!: So far we have "!#$ = {( $,, (!#$ } Now we pick,! = -./ 0-1 ( 3 4("!#$ {(}) 4("!#$ ) Greedy algorithm maximizes the marginal benefit 9! ( = 4("!#$ {(}) 4("!#$ ) By submodularity property: : ; < = : ; < : ;? = : ;? for @ < B Observation: By submodularity: For every ( C 9! (() 9 D (() for! < D since "! Í " D Marginal benefits 9! (() only shrink! (as i grows) D i (d) 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 39 d ³ D j (d) Selecting document d in step i covers more words than selecting d at step j (j>i)

[Leskovec et al., KDD 07] 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 40 Idea: Use D i as upper-bound on D j (j > i) Lazy Greedy: Keep an ordered list of marginal benefits D i from previous iteration Re-evaluate D i only for top element Re-sort and prune (Upper bound on) Marginal gain D 1 a b c d e A 1 ={a} F(A È {d}) F(A) F(B È {d}) F(B) A Í B

[Leskovec et al., KDD 07] 3/13/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 41 Idea: Use D i as upper-bound on D j (j > i) Lazy Greedy: Keep an ordered list of marginal benefits D i from previous iteration Re-evaluate D i only for top element Re-sort and prune Upper bound on Marginal gain D 2 a b c d e A 1 ={a} F(A È {d}) F(A) F(B È {d}) F(B) A Í B

[Leskovec et al., KDD 07] 3/13/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 42 Idea: Use D i as upper-bound on D j (j > i) Lazy Greedy: Keep an ordered list of marginal benefits D i from previous iteration Re-evaluate D i only for top element Re-sort and prune Upper bound on Marginal gain D 2 a d b e c A 1 ={a} A 2 ={a,b} F(A È {d}) F(A) F(B È {d}) F(B) A Í B

Summary so far: Diversity can be formulated as a set cover Set cover is submodular optimization problem Can be (approximately) solved using greedy algorithm Lazy-greedy gives significant speedup 400 Lower is better running time (seconds) 300 200 100 exhaustive search (all subsets) naive greedy Lazy 0 1 2 3 4 5 6 7 8 9 10 number of blogs selected 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44 But what about personalization? Election trouble model Songs of Syria Sandy delays Recommendations

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 45 We assumed same concept weighting for all users France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL France intervenes Chuck for Defense Argo wins big

Each user has different preferences over concepts France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL politico France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL movie buff 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 46

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 47 Assume each user u has different preference vector w c (u) over concepts c max A: A applek F (A) =X c w c cover A (c) max A: A applek F (A) =X c w (u) c cover A (c) Goal: Learn personal concept weights from user feedback

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 48 France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL France intervenes Chuck for Defense Argo wins big

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 49 Multiplicative Weights algorithm Assume each concept! has weight "! We recommend document # and receive feedback, say $ = +1 or -1 Update the weights: For each! # set "! = ' $ "! If concept c appears in doc d and we received positive feedback r=+1 then we increase the weight w c by multiplying it by ' (' > +) otherwise we decrease the weight (divide by ') Normalize weights so that! "! = +

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 50 Steps of the algorithm: 1. Identify items to recommend from 2. Identify concepts [what makes items redundant?] 3. Weigh concepts by general importance 4. Define item-concept coverage function 5. Select items using probabilistic set cover 6. Obtain feedback, update weights