Big Data Integration. Xin Luna Dong (Amazon) Divesh Srivastava (AT&T Labs-Research)

Similar documents
Incremental Record Linkage. Anja Gruenheid!! Xin Luna Dong!!! Divesh Srivastava

Jure Leskovec, Computer Science Dept., Stanford

DOI /j. cnki 欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟欟. R Rapid Miner Mahout

Semantic Web. Ontology Engineering. Gerd Gröner, Matthias Thimm. Institute for Web Science and Technologies (WeST) University of Koblenz-Landau

Learning Connectivity Networks from High-Dimensional Point Processes

Imputation of multivariate continuous data with non-ignorable missingness

Réseau Vinicole Européen R&D d'excellence

National Ice Cream Day September 23 rd

Case Study A Year of Social-Local Success

Ice Cream. Ice Cream. 1 of 9. Copyright 2007, Exemplars, Inc. All rights reserved.

Cloud Computing CS

Find the wine you are looking for at the best prices.

The Roles of Social Media and Expert Reviews in the Market for High-End Goods: An Example Using Bordeaux and California Wines

Detecting Melamine Adulteration in Milk Powder

5 Populations Estimating Animal Populations by Using the Mark-Recapture Method

Decision making with incomplete information Some new developments. Rudolf Vetschera University of Vienna. Tamkang University May 15, 2017

An Examination of operating costs within a state s restaurant industry

Yelp Chanllenge. Tianshu Fan Xinhang Shao University of Washington. June 7, 2013

The Financing and Growth of Firms in China and India: Evidence from Capital Markets

Efficient Image Search and Identification: The Making of WINE-O.AI

CS 322: (Social and Information) Network Analysis Jure Leskovec Stanford University

IT tool training. Biocides Day. 25 th of October :30-11:15 IUCLID 11:30-13:00 SPC Editor 14:00-16:00 R4BP 3

Three Critical Steps to Improving Product Data Quality WHITE PAPER

Non-GMO Project Trademark Use Guide

IWC Online Resources. Introduction to Essay Writing: Format and Structure

1. Continuing the development and validation of mobile sensors. 3. Identifying and establishing variable rate management field trials

ACEF, June 2016

Wine-Tasting by Numbers: Using Binary Logistic Regression to Reveal the Preferences of Experts

Tamanend Wine Consulting

READING: What is a Vegan?

Hamburger Pork Chop Deli Ham Chicken Wing $6.46 $4.95 $4.03 $3.50 $1.83 $1.93 $1.71 $2.78

ARM4 Advances: Genetic Algorithm Improvements. Ed Downs & Gianluca Paganoni

MBA 503 Final Project Guidelines and Rubric

The R&D-patent relationship: An industry perspective

Structures of Life. Investigation 1: Origin of Seeds. Big Question: 3 rd Science Notebook. Name:

How LWIN helped to transform operations at LCB Vinothèque

What Is This Module About?

Sensory Approaches and New Methods for Developing Grain-Based Products. Symposia Oglethorpe CC Monday 26 October :40 a.m.

OIV Revised Proposal for the Harmonized System 2017 Edition

Selection bias in innovation studies: A simple test

Page1. Rename Fruits, Vegetables and Spices Written by GEF Staff. Grades: PreK-2 Subjects: Science, Math Time: 30 minutes

Development of smoke taint risk management tools for vignerons and land managers

Mischa Bassett F&N 453. Individual Project. Effect of Various Butters on the Physical Properties of Biscuits. November 20, 2006

Reliable Profiling for Chocolate and Cacao

MONTHLY COFFEE MARKET REPORT

Mastering Measurements

By Fiona Beckett Fiona Becketts Cheese Course: Styles, Wine Pairing, Plates & Boards, Recipes (2009) Hardcover

The Economic Impact of the Craft Brewing Industry in Maine. School of Economics Staff Paper SOE 630- February Andrew Crawley*^ and Sarah Welsh

IT 403 Project Beer Advocate Analysis

A.P. Environmental Science. Partners. Mark and Recapture Lab addi. Estimating Population Size

HONDURAS. A Quick Scan on Improving the Economic Viability of Coffee Farming A QUICK SCAN ON IMPROVING THE ECONOMIC VIABILITY OF COFFEE FARMING

Building Reliable Activity Models Using Hierarchical Shrinkage and Mined Ontology

Targeting Influential Nodes for Recovery in Bootstrap Percolation on Hyperbolic Networks

CENTRAL OTAGO WINEGROWERS ASSOCIATION (INC.)

Vegan Ice Cream with Similar Nutritional Value to Dairy-based Ice Cream

CLUB COFFEE RESEARCH STUDY SWANA 2017

Is Fair Trade Fair? ARKANSAS C3 TEACHERS HUB. 9-12th Grade Economics Inquiry. Supporting Questions

Feasibility of Shortening the. Germination and Fluorescence Test Period. Of Perennial Ryegrass

Experiment # Lemna minor (Duckweed) Population Growth

A Comparison of X, Y, and Boomer Generation Wine Consumers in California

13 COLONIES TRIVIA AND ANSWERS 13 COLONIES TRIVIA AND PDF 13 COLONIES TRIVIA AND ANSWERS PDF THIRTEEN COLONIES QUIZ - BRAINPOP

Memorandum of understanding

What s New? AlveoLab, SRC-CHOPIN, Mixolab 2. CHOPIN Technologies Geoffroy d Humières

Global Protein-Based Multiplex Assay Market Research Report 2021

Fractions with Frosting

Survey Overview. SRW States and Areas Surveyed. U.S. Wheat Class Production Areas. East Coast States. Gulf Port States

Missing Data Treatments

AWRI Refrigeration Demand Calculator

Business opportunities and challenges of mainstreaming biodiversity into the agricultural sector

WINE MANAGAMENT PLATFORM FOR WAREHOUSES

Religion and Innovation

PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN DOWNLOAD EBOOK : PROFESSIONAL COOKING, 8TH EDITION BY WAYNE GISSLEN PDF

Global cooking Utensils Market Research Report 2016

Intellectual Property Acquisition Opportunity

Harvest Series 2017: Wine Analysis. Jasha Karasek. Winemaking Specialist Enartis USA

Beachhead Market. BHM Technology Early Adopters (that like coffee)

Sandringham, Auckland

Western Uganda s Arabica Opportunity. Kampala 20 th March, 2018

Table 1.1 Number of ConAgra products by country in Euromonitor International categories

three sites, three different voices

Is Your Restaurant Ready for the Growing Online Ordering Trend?

Australia s Label Integrity Program

Global Online Takeaway Food Delivery Market ( Edition) December 2018

Creating an Effective Website: Is Yours Worthy? by Natalia Kolyesnikova, Ph.D.

Caffeine And Reaction Rates

Get Schools Cooking Application

Starbucks Geography Summary

Introduction to Measurement and Error Analysis: Measuring the Density of a Solution

Instruction (Manual) Document

N e w Yo r k C i t y / N YS T L C ata lo g for FAMIS purchases

Flexible Working Arrangements, Collaboration, ICT and Innovation

Thomas Jefferson: Expansion & Embargo

Summary of Main Points

Feeling Hungry. How many cookies were on the plate before anyone started feeling hungry? Feeling Hungry. 1 of 10

CASE STUDY: HOW STARBUCKS BREWS LOGISTICS SUCCESS

PROFICIENCY TESTS NO 19 AND EURL-Campylobacter National Veterinary Institute

EXECUTIVE SUMMARY OVERALL, WE FOUND THAT:

Scientific Research and Experimental Development (SR&ED) Tax Credit

United States Electric Skillets Industry 2016 Market Research Report

Name. Maple Vocabulary

Transcription:

Big Data Integration Xin Luna Dong (Amazon) Divesh Srivastava (AT&T Labs-Research)

A Shameless Plug 2

A Lot of Information on the Web 3

Information Can Be Erroneous The story, marked Hold for release Do not use, was sent in error to the news service s thousands of corporate clients. 4

Case Study: Deep Web Quality [LDL+13] Study on two domains Belief of clean data Poor quality data can have big impact #Sources Period #Objects #Localattrs #Globalattrs Considered items Stock 55 7/2011 1000*20 333 153 16000*20 Flight 38 12/2011 1200*31 43 15 7200*31 5

Deep Web Quality Is the data consistent? Tolerance to 1% value difference, 15 min for time 6

Deep Web Quality Why such inconsistency? Semantic ambiguity Nasdaq Yahoo! Finance Day s Range: 93.80-95.71 52wk Range: 25.38-95.71 52 Wk: 25.38-93.72 7

Deep Web Quality Why such inconsistency? Unit errors 76.82B 76,821,000 8

Deep Web Quality Why such inconsistency? Pure errors FlightView FlightAware Orbitz 6:15 PM 6:15 PM 6:22 PM 9:40 PM 8:33 PM 9:54 PM 9

Deep Web Quality Why such inconsistency? Random sample of 20 data items + 5 items with largest # of values 10

Deep Web Quality Do sources copy from other sources? 11

Deep Web Quality Do sources copy from accurate sources? 12

Why Do We Need Big Data Integration? Building web-scale knowledge bases with correct information Google knowledge graph 13

Using KB in Social Media 14

Small Data Integration: What Is It? Data integration = solving lots of jigsaw puzzles Each jigsaw puzzle (e.g., Taj Mahal) is an integrated entity Each piece of a puzzle comes from some source Small data integra on solving small puzzles 15

BDI: Why is it Challenging? Data integration = solving lots of jigsaw puzzles Big data integra on big, messypuzzles E.g., missing, duplicate, damaged pieces 16

BDI: Why is it Challenging? Number of structured sources: Volume Millions of websites with domain specific structured data [DMP12] 154 million high quality relational tables on the web [CHW+08] 10s of millions of high quality deep web sources [MKK+08] 10s of millions of useful relational tables from web lists [EMH09] Challenges: Difficult to do schema alignment Expensive to warehouse all the integrated data Infeasible to support virtual integration 17

BDI: Why is it Challenging? Rate of change in structured sources: Velocity 43,000 96,000 deep web sources (with HTML forms) [B01] 450,000 databases, 1.25M query interfaces on the web [CHZ05] 10s of millions of high quality deep web sources [MKK+08] Many sources provide rapidly changing data, e.g., stock prices Challenges: Difficult to understand evolution of semantics Extremely expensive to warehouse data history Infeasible to capture rapid data changes in a timely fashion 18

BDI: Why is it Challenging? Representation differences among sources: Variety Free-text extractors 19

BDI: Why is it Challenging? Poor data quality of deep web sources [LDL+13]: Veracity 20

Outline Motivation Record linkage Data fusion Emerging topics 21

BDI: Record Linkage Volume: dealing with billions of records Map-reduce based record linkage [VCL10, KTR12] Adaptive record blocking [DNS+12, MKB12, VN12] Blocking in heterogeneous data spaces [PIP+12, PKP+13] Velocity Incremental record linkage [WGM10, WGM13, GDS14] 22

BDI: Record Linkage Variety Matching structured and unstructured data [KGA+11, KTT+12] Matching Web tables and catalogs [LSC10] Veracity Linking temporal records [LDM+11] Using crowdsourcing oracle [WLK+13, VBD14, FSS16] 23

Linking Temporal Records [LDM+11] How many Wei Wang s are in DBLP, with which publications? 24

Linking Temporal Records: Motivation Traditional record linkage Links records of an entity from multiple sources at a point in time Record linkage in Long Data Links records of an entity over a long time period Attribute values of an entity evolve over time Different entities across time may have the same attribute value Adam Smith (1723-1790) Adam Smith (1965-) 25

Linking Temporal Records: Challenges r1: Xin Dong R. Polytechnic Institute r2: Xin Dong University of Washington r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research 1991 2004 2005 2006 2007 2008 2009 2010 2011 -Who authored what? r7: Dong Xin University of Illinois r9: Dong Xin Microsoft Research r11: Dong Xin Microsoft Research r10: Dong Xin r8:dong Xin University of Illinois University of Illinois r12: Dong Xin Microsoft Research 26

Linking Temporal Records: Challenges r1: Xin Dong R. Polytechnic Institute r2: Xin Dong University of Washington r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research 1991 2004 2005 2006 2007 2008 2009 2010 2011 -Ground truth r7: Dong Xin University of Illinois r9: Dong Xin Microsoft Research r11: Dong Xin Microsoft Research r10: Dong Xin r8:dong Xin University of Illinois University of Illinois r12: Dong Xin Microsoft Research 27

Linking Temporal Records: Challenges r1: Xin Dong R. Polytechnic Institute r2: Xin Dong University of Washington r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research 1991 2004 2005 2006 2007 2008 2009 2010 2011 -Traditional solution 1: high value consistency r7: Dong Xin University of Illinois r9: Dong Xin Microsoft Research r11: Dong Xin Microsoft Research r10: Dong Xin r8:dong Xin University of Illinois University of Illinois r12: Dong Xin Microsoft Research 28

Linking Temporal Records: Challenges r1: Xin Dong R. Polytechnic Institute r2: Xin Dong University of Washington r3: Xin Dong University of Washington r4: Xin Luna Dong University of Washington r5: Xin Luna Dong AT&T Labs-Research r6: Xin Luna Dong AT&T Labs-Research 1991 2004 2005 2006 2007 2008 2009 2010 2011 -Traditional solution 2: using similar names r7: Dong Xin University of Illinois r9: Dong Xin Microsoft Research r11: Dong Xin Microsoft Research r10: Dong Xin r8:dong Xin University of Illinois University of Illinois r12: Dong Xin Microsoft Research 29

Linking Temporal Records: Opportunities Smooth transition in one attribute, despite evolution of another ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 30

Linking Temporal Records: Opportunities Erratic changes in an attribute value are quite unlikely ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 31

Linking Temporal Records: Opportunities Typically, there is continuity of history, i.e., no big gaps in time ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 32

Linking Temporal Records: Solution High penalty for value disagreement over a short time period ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 33

Linking Temporal Records: Solution Lower penalty for value disagreement over a long time period ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 34

Linking Temporal Records: Solution High reward for value agreement across a small time gap ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 35

Linking Temporal Records: Solution Lower reward for value agreement across a big time gap ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 36

Linking Temporal Records: Solution Consider records in time order for clustering ID Name Affiliation Co-authors Year r1 Xin Dong R. Polytechnic Institute Wozny 1991 r2 Xin Dong University of Washington Halevy, Tatarinov 2004 r7 Dong Xin University of Illinois Han, Wah 2004 r3 Xin Dong University of Washington Halevy 2005 r4 Xin Luna Dong University of Washington Halevy, Yu 2007 r8 Dong Xin University of Illinois Wah 2007 r9 Dong Xin Microsoft Research Wu, Han 2008 r10 Dong Xin University of Illinois Ling, He 2009 r11 Dong Xin Microsoft Research Chaudhuri, Ganti 2009 r5 Xin Luna Dong AT&T Labs-Research Das Sarma, Halevy 2009 r6 Xin Luna Dong AT&T Labs-Research Naumann 2010 r12 Dong Xin Microsoft Research He 2011 37

Outline Motivation Record linkage Data fusion Emerging topics 38

BDI: Data Fusion Veracity Using source trustworthiness [YHY08, GAM+10, PR11, YT11, GSH11, PR13] Combining source accuracy and copy detection [DBS09a, QAH+13] Multiple truth values [ZRG+12] Erroneous numeric data [ZH12] Experimental comparison on deep web data [LDL+13] 39

BDI: Data Fusion Volume: Online data fusion [LDO+11] Velocity Truth discovery for dynamic data [DBS09b, PRM+12] Variety Combining record linkage with data fusion [GDS+10] 40

Basic Solution: Naïve Voting Supports difference of opinion, allows conflict resolution Works well for independent sources that have similar accuracy When sources have different accuracies Need to give more weight to votes by knowledgeable sources When sources copy from other sources Need to reduce the weight of votes by copiers 41

Source Accuracy [YHY08, DBS09a] Need to give more weight to knowledgeable sources Computing source accuracy: A(S) = Avg v i(d) S Pr(v i (D) true Ф) v i (D) S : S provides value v i on data item D Ф: observations on all data items by sources S Pr(v i (D) true Ф) : probability of v i (D) being true How to compute Pr(v i (D) true Ф)? 42

Source Accuracy Input: data item D, val(d) = {v 0,v 1,,v n }, Ф Output: Pr(v i (D) true Ф), for i=0,, n (sum=1) Based on Bayes Rule, need Pr(Ф v i (D) true) Under independence, need Pr(Ф D (S) v i (D) true) If S provides v i : Pr(Ф D (S) v i (D) true) = A(S) If S does not : Pr(Ф D (S) v i (D) true) =(1-A(S))/n Challenge: Inter-dependence between source accuracy and value probability? 43

Value Vote Count Source Vote Count Value Probability Source Accuracy Source Accuracy Continue until source accuracy converges 44 ) ) ( Pr( ) ( ) ( Φ = D v Avg S A S D v ) ( 1 ) ( ln ) ( ' S A S na S A = = Φ ) ( )) ( ( )) ( ( 0 0 ) ) ( Pr( D val v D v C D v C e e D v = )) ( ( ) ( ' )) ( ( D v S S S A D v C

Copy Detection Are Source 1 and Source 2 dependent? Not necessarily Source 1 on USA Presidents: Source 2 on USA Presidents: 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama 45

Copy Detection Are Source 1 and Source 2 dependent? Very likely Source 1 on USA Presidents: Source 2 on USA Presidents: 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : Barack Obama 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : John McCain 46

Copy Detection: Bayesian Analysis Different Values O d Same Values TRUE O t FALSE O f S1 S2 Goal: Pr(S1 S2 Ф), Pr(S1 S2 Ф) (sum = 1) According to Bayes Rule, we need Pr(Ф S1 S2), Pr(Ф S1 S2) Key: compute Pr(Ф D S1 S2), Pr(Ф D S1 S2), for each D S1 S2 47

Copy Detection: Bayesian Analysis Different Values O d Same Values TRUE O t FALSE O f S1 S2 Pr Independence Copying O t 2 A < A c + A 2 (1 c) O f O d ( 1 A ) 2 n P d =1 A 2 (1 A)2 n << > (1 A) c + (1 A) n P d (1 c) 2 (1 c) 48

Iterative Process Typically converges when #objs >> #srcs Step 2 Truth Discovery Accuracy Computation Step 3 Copy Detection Step 1 49

Outline Motivation Record linkage Data fusion Emerging topics 50

BDI: Source Selection [DSS13, RDS14] How to select sources before integration to balance gain, cost? Source Selection Big Data Integration 51

BDI: Using Crowdsourcing [FSS16] Improving progressive quality of linkage using an oracle 52

Quality Diagnosis 53

Source Exploration Tool Data.gov 54

Integrate Data Over Time 55

Conclusions Big data integration is an important area of research Knowledge bases, linked data, geo-spatial fusion, scientific data Much interesting work has been done in this area Challenges due to volume, velocity, variety, veracity A lot more research needs to be done! 56