Big Data Integration Xin Luna Dong (Amazon) Divesh Srivastava (AT&T Labs-Research)
A Shameless Plug 2
A Lot of Information on the Web 3
Information Can Be Erroneous The story, marked Hold for release Do not use, was sent in error to the news service s thousands of corporate clients. 4
Case Study: Deep Web Quality [LDL+13] Study on two domains Belief of clean data Poor quality data can have big impact #Sources Period #Objects #Localattrs #Globalattrs Considered items Stock 55 7/2011 1000*20 333 153 16000*20 Flight 38 12/2011 1200*31 43 15 7200*31 5
Deep Web Quality Is the data consistent? Tolerance to 1% value difference, 15 min for time 6
Deep Web Quality Why such inconsistency? Semantic ambiguity Nasdaq Yahoo! Finance Day s Range: 93.80-95.71 52wk Range: 25.38-95.71 52 Wk: 25.38-93.72 7
Deep Web Quality Why such inconsistency? Unit errors 76.82B 76,821,000 8
Deep Web Quality Why such inconsistency? Pure errors FlightView FlightAware Orbitz 6:15 PM 6:15 PM 6:22 PM 9:40 PM 8:33 PM 9:54 PM 9
Deep Web Quality Why such inconsistency? Random sample of 20 data items + 5 items with largest # of values 10
Deep Web Quality Do sources copy from other sources? 11
Deep Web Quality Do sources copy from accurate sources? 12
Why Do We Need Big Data Integration? Building web-scale knowledge bases with correct information Google knowledge graph 13
Using KB in Social Media 14
Small Data Integration: What Is It? Data integration = solving lots of jigsaw puzzles Each jigsaw puzzle (e.g., Taj Mahal) is an integrated entity Each piece of a puzzle comes from some source Small data integra on solving small puzzles 15
BDI: Why is it Challenging? Data integration = solving lots of jigsaw puzzles Big data integra on big, messypuzzles E.g., missing, duplicate, damaged pieces 16
BDI: Why is it Challenging? Number of structured sources: Volume Millions of websites with domain specific structured data [DMP12] 154 million high quality relational tables on the web [CHW+08] 10s of millions of high quality deep web sources [MKK+08] 10s of millions of useful relational tables from web lists [EMH09] Challenges: Difficult to do schema alignment Expensive to warehouse all the integrated data Infeasible to support virtual integration 17
BDI: Why is it Challenging? Rate of change in structured sources: Velocity 43,000 96,000 deep web sources (with HTML forms) [B01] 450,000 databases, 1.25M query interfaces on the web [CHZ05] 10s of millions of high quality deep web sources [MKK+08] Many sources provide rapidly changing data, e.g., stock prices Challenges: Difficult to understand evolution of semantics Extremely expensive to warehouse data history Infeasible to capture rapid data changes in a timely fashion 18
BDI: Why is it Challenging? Representation differences among sources: Variety Free-text extractors 19
BDI: Why is it Challenging? Poor data quality of deep web sources [LDL+13]: Veracity 20
Outline Motivation Record linkage Data fusion Emerging topics 21
BDI: Record Linkage Volume: dealing with billions of records Map-reduce based record linkage [VCL10, KTR12] Adaptive record blocking [DNS+12, MKB12, VN12] Blocking in heterogeneous data spaces [PIP+12, PKP+13] Velocity Incremental record linkage [WGM10, WGM13, GDS14] 22
BDI: Record Linkage Variety Matching structured and unstructured data [KGA+11, KTT+12] Matching Web tables and catalogs [LSC10] Veracity Linking temporal records [LDM+11] Using crowdsourcing oracle [WLK+13, VBD14, FSS16] 23
Linking Temporal Records [LDM+11] How many Wei Wang s are in DBLP, with which publications? 24
Linking Temporal Records: Motivation Traditional record linkage Links records of an entity from multiple sources at a point in time Record linkage in Long Data Links records of an entity over a long time period Attribute values of an entity evolve over time Different entities across time may have the same attribute value Adam Smith (1723-1790) Adam Smith (1965-) 25
Linking Temporal Records: Challenges r1: Xin Dong R. Polytechnic Institute r2: Xin Dong University of Washington r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research 1991 2004 2005 2006 2007 2008 2009 2010 2011 -Who authored what? r7: Dong Xin University of Illinois r9: Dong Xin Microsoft Research r11: Dong Xin Microsoft Research r10: Dong Xin r8:dong Xin University of Illinois University of Illinois r12: Dong Xin Microsoft Research 26
Linking Temporal Records: Challenges r1: Xin Dong R. Polytechnic Institute r2: Xin Dong University of Washington r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research 1991 2004 2005 2006 2007 2008 2009 2010 2011 -Ground truth r7: Dong Xin University of Illinois r9: Dong Xin Microsoft Research r11: Dong Xin Microsoft Research r10: Dong Xin r8:dong Xin University of Illinois University of Illinois r12: Dong Xin Microsoft Research 27
Linking Temporal Records: Challenges r1: Xin Dong R. Polytechnic Institute r2: Xin Dong University of Washington r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research 1991 2004 2005 2006 2007 2008 2009 2010 2011 -Traditional solution 1: high value consistency r7: Dong Xin University of Illinois r9: Dong Xin Microsoft Research r11: Dong Xin Microsoft Research r10: Dong Xin r8:dong Xin University of Illinois University of Illinois r12: Dong Xin Microsoft Research 28
Linking Temporal Records: Challenges r1: Xin Dong R. Polytechnic Institute r2: Xin Dong University of Washington r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research 1991 2004 2005 2006 2007 2008 2009 2010 2011 -Traditional solution 2: using similar names r7: Dong Xin University of Illinois r9: Dong Xin Microsoft Research r11: Dong Xin Microsoft Research r10: Dong Xin r8:dong Xin University of Illinois University of Illinois r12: Dong Xin Microsoft Research 29
Linking Temporal Records: Opportunities Smooth transition in one attribute, despite evolution of another ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 30
Linking Temporal Records: Opportunities Erratic changes in an attribute value are quite unlikely ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 31
Linking Temporal Records: Opportunities Typically, there is continuity of history, i.e., no big gaps in time ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 32
Linking Temporal Records: Solution High penalty for value disagreement over a short time period ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 33
Linking Temporal Records: Solution Lower penalty for value disagreement over a long time period ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 34
Linking Temporal Records: Solution High reward for value agreement across a small time gap ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 35
Linking Temporal Records: Solution Lower reward for value agreement across a big time gap ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 36
Linking Temporal Records: Solution Consider records in time order for clustering ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 37
Outline Motivation Record linkage Data fusion Emerging topics 38
BDI: Data Fusion Veracity Using source trustworthiness [YHY08, GAM+10, PR11, YT11, GSH11, PR13] Combining source accuracy and copy detection [DBS09a, QAH+13] Multiple truth values [ZRG+12] Erroneous numeric data [ZH12] Experimental comparison on deep web data [LDL+13] 39
BDI: Data Fusion Volume: Online data fusion [LDO+11] Velocity Truth discovery for dynamic data [DBS09b, PRM+12] Variety Combining record linkage with data fusion [GDS+10] 40
Basic Solution: Naïve Voting Supports difference of opinion, allows conflict resolution Works well for independent sources that have similar accuracy When sources have different accuracies Need to give more weight to votes by knowledgeable sources When sources copy from other sources Need to reduce the weight of votes by copiers 41
Source Accuracy [YHY08, DBS09a] Need to give more weight to knowledgeable sources Computing source accuracy: A(S) = Avg v i(d) S Pr(v i (D) true Ф) v i (D) S : S provides value v i on data item D Ф: observations on all data items by sources S Pr(v i (D) true Ф) : probability of v i (D) being true How to compute Pr(v i (D) true Ф)? 42
Source Accuracy Input: data item D, val(d) = {v 0,v 1,,v n }, Ф Output: Pr(v i (D) true Ф), for i=0,, n (sum=1) Based on Bayes Rule, need Pr(Ф v i (D) true) Under independence, need Pr(Ф D (S) v i (D) true) If S provides v i : Pr(Ф D (S) v i (D) true) = A(S) If S does not : Pr(Ф D (S) v i (D) true) =(1-A(S))/n Challenge: Inter-dependence between source accuracy and value probability? 43
Value Vote Count Source Vote Count Value Probability Source Accuracy Source Accuracy Continue until source accuracy converges 44 ) ) ( Pr( ) ( ) ( Φ = D v Avg S A S D v ) ( 1 ) ( ln ) ( ' S A S na S A = = Φ ) ( )) ( ( )) ( ( 0 0 ) ) ( Pr( D val v D v C D v C e e D v = )) ( ( ) ( ' )) ( ( D v S S S A D v C
Copy Detection Are Source 1 and Source 2 dependent? Not necessarily Source 1 on USA Presidents: Source 2 on USA Presidents: 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama 45
Copy Detection Are Source 1 and Source 2 dependent? Very likely Source 1 on USA Presidents: Source 2 on USA Presidents: 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : Barack Obama 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : John McCain 46
Copy Detection: Bayesian Analysis Different Values O d Same Values TRUE O t FALSE O f S1 S2 Goal: Pr(S1 S2 Ф), Pr(S1 S2 Ф) (sum = 1) According to Bayes Rule, we need Pr(Ф S1 S2), Pr(Ф S1 S2) Key: compute Pr(Ф D S1 S2), Pr(Ф D S1 S2), for each D S1 S2 47
Copy Detection: Bayesian Analysis Different Values O d Same Values TRUE O t FALSE O f S1 S2 Pr Independence Copying O t 2 A < A c + A 2 (1 c) O f O d ( 1 A ) 2 n P d =1 A 2 (1 A)2 n << > (1 A) c + (1 A) n P d (1 c) 2 (1 c) 48
Iterative Process Typically converges when #objs >> #srcs Step 2 Truth Discovery Accuracy Computation Step 3 Copy Detection Step 1 49
Outline Motivation Record linkage Data fusion Emerging topics 50
BDI: Source Selection [DSS13, RDS14] How to select sources before integration to balance gain, cost? Source Selection Big Data Integration 51
BDI: Using Crowdsourcing [FSS16] Improving progressive quality of linkage using an oracle 52
Quality Diagnosis 53
Source Exploration Tool Data.gov 54
Integrate Data Over Time 55
Conclusions Big data integration is an important area of research Knowledge bases, linked data, geo-spatial fusion, scientific data Much interesting work has been done in this area Challenges due to volume, velocity, variety, veracity A lot more research needs to be done! 56