-- Final exam logistics -- Please fill out course evaluation forms (THANKS!!!)

Similar documents
-- CS341 info session is on Thu 3/18 7pm in Gates Final exam logistics

CS 322: (Social and Information) Network Analysis Jure Leskovec Stanford University

Jure Leskovec, Computer Science Dept., Stanford

Week 5 Objectives. Subproblem structure Greedy algorithm Mathematical induction application Greedy correctness

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

Cloud Computing CS

CS 322: (Social and Information) Network Analysis Jure Leskovec Stanford University

STA Module 6 The Normal Distribution

STA Module 6 The Normal Distribution. Learning Objectives. Examples of Normal Curves

What makes a good muffin? Ivan Ivanov. CS229 Final Project

Incremental Record Linkage. Anja Gruenheid!! Xin Luna Dong!!! Divesh Srivastava

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Predicting Wine Quality

Lecture 9: Tuesday, February 10, 2015

Introduction to Management Science Midterm Exam October 29, 2002

An application of cumulative prospect theory to travel time variability

After your yearly checkup, the doctor has bad news and good news.

Mastering Measurements

Out of Home ROI and Optimization in the Media Mix Summary Report

Jure Leskovec Stanford University

MBA 503 Final Project Guidelines and Rubric

Planning: Regression Planning

Missing value imputation in SAS: an intro to Proc MI and MIANALYZE

AWRI Refrigeration Demand Calculator

Multiple Imputation for Missing Data in KLoSA

Archdiocese of New York Practice Items

Title: Evaluation of Apogee for Control of Runner Growth in Annual Plasticulture Strawberries

A Note on H-Cordial Graphs

Modeling Wine Quality Using Classification and Regression. Mario Wijaya MGT 8803 November 28, 2017

Fractions with Frosting

PRODUCT REGISTRATION: AN E-GUIDE

Ohio Grape-Wine Electronic Newsletter

Biocidal Product Families instead of Frame Formulations The right step forward? Sara Kirkham

Business Statistics /82 Spring 2011 Booth School of Business The University of Chicago Final Exam

DOC / KEURIG COFFEE MAKER NOT WORKING ARCHIVE

BPR Requirements for Treated Articles. A.I.S.E. Biocides WG First revision - December 2017

Investigation 1: Ratios and Proportions and Investigation 2: Comparing and Scaling Rates

Gail E. Potter, Timo Smieszek, and Kerstin Sailer. April 24, 2015

Investigation 1: Ratios and Proportions and Investigation 2: Comparing and Scaling Rates

Prepared Food 102D. Prepared food Prepared food is taxable. Prepared food means food that meets any of the following conditions: Taxable prepared food

Learning Connectivity Networks from High-Dimensional Point Processes

Managing Multiple Ontologies in Protégé

DEVELOPING PROBLEM-SOLVING ABILITIES FOR MIDDLE SCHOOL STUDENTS

North San Joaquin Valley Almond Day

Economics Homework 4 Fall 2006

EXECUTIVE SUMMARY OVERALL, WE FOUND THAT:

Thermal Hydraulic Analysis of 49-2 Swimming Pool Reactor with a. Passive Siphon Breaker

Weather Sensitive Adjustment Using the WSA Factor Method

Wine Rating Prediction

Instruction (Manual) Document

Semantic Web. Ontology Engineering. Gerd Gröner, Matthias Thimm. Institute for Web Science and Technologies (WeST) University of Koblenz-Landau

What Is This Module About?

Gasoline Empirical Analysis: Competition Bureau March 2005

Lecture 3: How Trade Creates Wealth. Benjamin Graham

longer any restriction order batching. All orders can be created in a single batch which means less work for the wine club manager.

Jake Bernstein Trading Webinar

Building Reliable Activity Models Using Hierarchical Shrinkage and Mined Ontology

Module 6. Yield and Fruit Size. Presenter: Stephan Verreynne

Credit Supply and Monetary Policy: Identifying the Bank Balance-Sheet Channel with Loan Applications. Web Appendix

AGREEMENT n LLP-LDV-TOI-10-IT-538 UNITS FRAMEWORK ABOUT THE MAITRE QUALIFICATION

Association Rule Mining

Guided Study Program in System Dynamics System Dynamics in Education Project System Dynamics Group MIT Sloan School of Management 1

2 Recommendation Engine 2.1 Data Collection. HapBeer: A Beer Recommendation Engine CS 229 Fall 2013 Final Project

Lesson 23: Newton s Law of Cooling

Word Embeddings for NLP in Python. Marco Bonzanini PyCon Italia 2017

Food Inspection Violation, Anticipating Risk (FIVAR) Montgomery County, MD

Citrus Fruits 2014 Summary

Bounty71 rootstock an update

2017 FINANCIAL REVIEW

Flexible Imputation of Missing Data

BREWERS ASSOCIATION CRAFT BREWER DEFINITION UPDATE FREQUENTLY ASKED QUESTIONS. December 18, 2018

About this Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Mahout

THOMSON REUTERS INDICES CONTINUOUS COMMODITY TOTAL RETURN INDEX

CONSEQUENCES OF THE BPR

Biocides IT training Helsinki - 27 September 2017 IUCLID 6

Team Davis Good Foods Lesson 2: Breakfast

Is Fair Trade Fair? ARKANSAS C3 TEACHERS HUB. 9-12th Grade Economics Inquiry. Supporting Questions

Injection, Modularity, and Testing

Missing Data Methods (Part I): Multiple Imputation. Advanced Multivariate Statistical Methods Workshop

What Makes a Cuisine Unique?

101 Cupcake, Cookie & Brownie Recipes (101 Cookbook Collection) By Gooseberry Patch READ ONLINE

Chapter 3 Labor Productivity and Comparative Advantage: The Ricardian Model

Characteristics of Wine Consumers in the Mid-Atlantic States: A Statistical Analysis

Since the cross price elasticity is positive, the two goods are substitutes.

Chapter 3. Labor Productivity and Comparative Advantage: The Ricardian Model. Pearson Education Limited All rights reserved.

Jure Leskovec Stanford University Including joint work with L. Backstrom, D. Huttenlocher, M. Gomez-Rodriguez, J. Kleinberg, J. McAuley, S.

Western Uganda s Arabica Opportunity. Kampala 20 th March, 2018

Lesson 11: Comparing Ratios Using Ratio Tables

Constipation on low carb diet

ED 2131/12. 1 May 2012 Original: English

Chapter 3. Labor Productivity and Comparative Advantage: The Ricardian Model

Preview. Chapter 3. Labor Productivity and Comparative Advantage: The Ricardian Model

Citrus: World Markets and Trade

Control of treated articles in the Biocidal Products Regulation ECHA Biocides Stakeholders Day 25 June 2013

Hispanic Retail Pilot Test Summary

Promote and support advanced computing to further Tier-One research and education at the University of Houston

DONOR PROSPECTUS March 2017

Lesson 1: Drink Detective

Structures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name:

L I V E W E L L, W O R K W E L L

Transcription:

-- Final exam logistics -- Please fill out course evaluation forms (THANKS!!!) CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2

Date: Tuesday, March 20, 3:30-6:30 pm Location: Gates B01 (Last name A-H) Skilling Auditorium (Last name I-N) NVIDIA Auditorium (Last name O-Z) SCPD: Your exam monitor will receive the exam this week You may come to Stanford to take the exam, or take it at most 48 hours before the exam time Email exam PDF to cs246.mmds@gmail.com by Tuesday, March 20, 11:59 pm Pacific Time Call Jessica at +1 561-543-1855 if you have questions during the exam, or email the course staff mailing list 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4 Final exam is open book and open notes A calculator or computer is REQUIRED You may only use your computer to do arithmetic calculations (i.e. the buttons found on a standard scientific calculator) You may also use your computer to read course notes or the textbook But no internet/google/python access is allowed Anyone who brings a power strip to the final exam gets 1 point extra credit for helping out their classmates

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6 Redundancy leads to a bad user experience Uncertainty around information need => don t put all eggs in one basket How do we optimize for diversity directly?

Monday, January 14, 2013 France intervenes Chuck for Defense Argo wins big Hagel expects fight 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7

Monday, January 14, 2013 France intervenes Chuck for Defense Argo wins big New gun proposals 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9 Idea: Encode diversity as coverage problem Example: Word cloud of news for a single day Want to select articles so that most words are covered

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11 Q: What is being covered? A: Concepts (In our case: Named entities) France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Hagel expects fight Q: Who is doing the covering? A: Documents

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12 Suppose we are given a set of documents D Each document d covers a set! " of words/topics/named entities W For a set of documents A Í D we define # $ = &! " " $ Goal: We want to max $ +, #($) Note: F(A) is a set function: # $ : 0123 N

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13 Given universe of elements! = {% &,, % ) } and sets + &,, +, Í! X 3 X 2 X 4 X 1 W Goal: Find k sets X i that cover the most of W More precisely: Find k sets X i whose size of the union is the largest Bad news: A known NP-complete problem

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14 Simple Heuristic: Greedy Algorithm: Start with! " = { } For ' = ( * Find set + that,-. /(! '1( {+}) Let! ' =! '1( È {+} Example: Eval. / + (,, /({+ 5 }), pick best (say + ( ) Eval. / + ( } {+ 6,, /({+ ( } {+ 5 }), pick best (say + 6 ) Eval. /({+ (, + 6 } {+ 7 }),, /({+ (, + 6 } {+ 5 }), pick best And so on /! = 8 9 + +!

Goal: Maximize the covered area 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15

Goal: Maximize the covered area 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

Goal: Maximize the covered area 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

Goal: Maximize the covered area 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18

Goal: Maximize the covered area 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20 A B C Goal: Maximize the size of the covered area Greedy first picks A and then C But the optimal way would be to pick B and C

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21 Greedy produces a solution A where: F(A) ³ (1-1/e)*OPT (F(A) ³ 0.63*OPT) [Nemhauser, Fisher, Wolsey 78] Claim holds for functions F( ) with 2 properties: F is monotone: (adding more docs doesn t decrease coverage) if A Í B then F(A) F(B) and F({})=0 F is submodular: adding an element to a set gives less improvement than adding it to one of its subsets

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22 Definition: Set function F( ) is called submodular if: For all A,BÍ W: F(A) + F(B) ³ F(AÈ B) + F(AÇ B) + ³ B A A È B A Ç B +

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23 Diminishing returns characterization Equivalent definition: Set function F( ) is called submodular if: For all AÍ B, dïb: F(A È d) F(A) F(B È d) F(B) Gain of adding d to a small set Gain of adding d to a large set B A + + d d Large improvement Small improvement

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24 F( ) is submodular: A Í B F(A È d) F(A) F(B È d) F(B) Gain of adding d to a small set Natural example: Sets! 1,,! % ' ( =! + + - (size of the covered area) Claim:.(0) is submodular! A Gain of adding d to a large set B d d

Submodularity is discrete analogue of concavity F( ) F(B È d) "A Í B F(A È d) F(B) F(A) Adding d to B helps less than adding it to A! Solution size A F(A È d) F(A) F(B È d) F(B) Gain of adding X d to a small set Gain of adding X d to a large set 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25

Marginal gain:! " # $ = " $ # "($) Submodular: " $ # " $ " + # "(+) Concavity:, - + #, -, / + #,(/) 0 2 3 5 F(A) 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26 A

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27 Let! "! $ be submodular and % " % $ > ' $ then! ( = + % +! + ( is submodular Submodularity is closed under non-negative linear combinations! This is an extremely useful fact: Average of submodular functions is submodular:! ( =, +! + ( + Multicriterion optimization:! ( = % +! + ( +

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28 Q: What is being covered? A: Concepts (In our case: Named entities) France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Hagel expects fight Q: Who is doing the covering? A: Documents

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29 Objective: pick k docs that cover most concepts France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Enthusiasm for Inauguration wanes Inauguration weekend F(A): the number of concepts covered by A Elements concepts, Sets concepts in docs F(A) is submodular and monotone! We can use greedy to optimize F

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 30 Objective: pick k docs that cover most concepts France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Enthusiasm for Inauguration wanes Inauguration weekend The good: Penalizes redundancy Submodular The bad: Concept importance? All-or-nothing too harsh

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 32 Objective: pick k docs that cover most concepts France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Enthusiasm for Inauguration wanes Inauguration weekend Each concept! has importance weight "!

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33 Document coverage function probability document d covers concept c [e.g., how strongly d covers c] Obama Romney Enthusiasm for Inauguration wanes

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 34 Document coverage function: probability document d covers concept c Cover d (c) can also model how relevant is concept c for user u Set coverage function: Prob. that at least one document in A covers c Objective: max A: A applek F (A) =X c concept weights w c cover A (c)

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 35 max A: A applek F (A) =X c w c cover A (c) The objective function is also submodular Intuitive diminishing returns property Greedy algorithm leads to a (1 1/e) ~ 63% approximation, i.e., a near-optimal solution

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36 Objective: pick k docs that cover most concepts France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL Enthusiasm for Inauguration wanes Inauguration weekend Each concept! has importance weight " # Documents partially cover concepts: $%&'( ) (+)

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 37

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 38 a b c d Greedy Marginal gain: F(AÈx)-F(A) Greedy algorithm is slow! At each iteration we need to re-evaluate marginal gains of all remaning documents Runtime!( $ &) for selecting & documents out of the set of $ of them e Add document with highest marginal gain

[Leskovec et al., KDD 07] In round!: So far we have "!#$ = {( $,, (!#$ } Now we pick,! = -./ 0-1 ( 3 4("!#$ {(}) 4("!#$ ) Greedy algorithm maximizes the marginal benefit 9! ( = 4("!#$ {(}) 4("!#$ ) By submodularity property: : ; < = : ; < : ;? = : ;? for @ < B Observation: By submodularity: For every ( C 9! (() 9 D (() for! < D since "! Í " D Marginal benefits 9! (() only shrink! (as i grows) D i (d) 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 39 d ³ D j (d) Selecting document d in step i covers more words than selecting d at step j (j>i)

[Leskovec et al., KDD 07] 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 40 Idea: Use D i as upper-bound on D j (j > i) Lazy Greedy: Keep an ordered list of marginal benefits D i from previous iteration Re-evaluate D i only for top element Re-sort and prune (Upper bound on) Marginal gain D 1 a b c d e A 1 ={a} F(A È {d}) F(A) F(B È {d}) F(B) A Í B

[Leskovec et al., KDD 07] 3/13/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 41 Idea: Use D i as upper-bound on D j (j > i) Lazy Greedy: Keep an ordered list of marginal benefits D i from previous iteration Re-evaluate D i only for top element Re-sort and prune Upper bound on Marginal gain D 2 a b c d e A 1 ={a} F(A È {d}) F(A) F(B È {d}) F(B) A Í B

[Leskovec et al., KDD 07] 3/13/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 42 Idea: Use D i as upper-bound on D j (j > i) Lazy Greedy: Keep an ordered list of marginal benefits D i from previous iteration Re-evaluate D i only for top element Re-sort and prune Upper bound on Marginal gain D 2 a d b e c A 1 ={a} A 2 ={a,b} F(A È {d}) F(A) F(B È {d}) F(B) A Í B

Summary so far: Diversity can be formulated as a set cover Set cover is submodular optimization problem Can be (approximately) solved using greedy algorithm Lazy-greedy gives significant speedup 400 Lower is better running time (seconds) 300 200 100 exhaustive search (all subsets) naive greedy Lazy 0 1 2 3 4 5 6 7 8 9 10 number of blogs selected 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44 But what about personalization? Election trouble model Songs of Syria Sandy delays Recommendations

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 45 We assumed same concept weighting for all users France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL France intervenes Chuck for Defense Argo wins big

Each user has different preferences over concepts France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL politico France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL movie buff 3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 46

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 47 Assume each user u has different preference vector w c (u) over concepts c max A: A applek F (A) =X c w c cover A (c) max A: A applek F (A) =X c w (u) c cover A (c) Goal: Learn personal concept weights from user feedback

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 48 France Mali Hagel Pentagon Obama Romney Zero Dark Thirty Argo NFL France intervenes Chuck for Defense Argo wins big

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 49 Multiplicative Weights algorithm Assume each concept! has weight "! We recommend document # and receive feedback, say $ = +1 or -1 Update the weights: For each! # set "! = ' $ "! If concept c appears in doc d and we received positive feedback r=+1 then we increase the weight w c by multiplying it by ' (' > +) otherwise we decrease the weight (divide by ') Normalize weights so that! "! = +

3/12/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 50 Steps of the algorithm: 1. Identify items to recommend from 2. Identify concepts [what makes items redundant?] 3. Weigh concepts by general importance 4. Define item-concept coverage function 5. Select items using probabilistic set cover 6. Obtain feedback, update weights